Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen
1704
¿ Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
. Jan M. Zytkow Jan Rauch (Eds.)
Principles of Data Mining and Knowledge Discovery Third European Conference, PKDD’99 Prague, Czech Republic, September 15-18, 1999 Proceedings
½¿
Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany Volume Editors . Jan M. Zytkow University of North Carolina, Department of Computer Science Charlotte, NC 28223, USA E-mail:
[email protected] Jan Rauch University of Economics, Faculty of Informatics and Statistics Laboratory of Intelligent Systems W. Churchill Sq. 4, 13067 Prague, Czech Republic E-mail:
[email protected]
Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Principles of data mining and knowledge discovery : third European conference ; proceedings / PKDD ’99, Prague, Czech Republic, September 15 - 18, 1999. Jan M. Zytkow ; Jan Rauch (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1999 (Lecture notes in computer science ; Vol. 1704 : Lecture notes in artificial intelligence) ISBN 3-540-66490-4
CR Subject Classification (1998): I.2, H.3, H.5, G.3, J.1 ISBN 3-540-66490-4 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. c Springer-Verlag Berlin Heidelberg 1999 Printed in Germany Typesetting: Camera-ready by author SPIN 10704907 06/3142 – 5 4 3 2 1 0
Printed on acid-free paper
Preface
This volume contains papers selected for presentation at PKDD’99, the Third European Conference on Principles and Practice of Knowledge Discovery in Databases. The first meeting was held in Trondheim, Norway, in June 1997, the second in Nantes, France, in September 1998. PKDD’99 was organized in Prague, Czech Republic, on September 15-18, 1999. The conference was hosted by the Laboratory of Intelligent Systems at the University of Economics, Prague. We wish to express our thanks to the sponsors of the Conference, the Komerˇcn´ı banka, a.s. and the University of Economics, Prague, for their generous support. Knowledge discovery in databases (KDD), also known as data mining, provides tools for turning large databases into knowledge that can be used in practice. KDD has been able to grow very rapidly since its emergence a decade ago, by drawing its techniques and data mining experiences from a combination of many existing research areas: databases, statistics, mathematical logic, machine learning, automated scientific discovery, inductive logic programming, artificial intelligence, visualization, decision science, and high performance computing. The strength of KDD came initially from the value added to the creative combination of techniques from the contributing areas. In order to establish its identity, KDD has to create its own theoretical principles and to demonstrate how they stimulate KDD research, facilitate communication and guide practitioners towards successful applications. Seeking the principles that can guide and strengthen practical applications has been always a part of the European research tradition. Thus “Principles and Practice of KDD” (PKDD) make a suitable focus for annual meetings of the KDD community in Europe. The main long-term interest is in theoretical principles for the emerging discipline of KDD and in practical applications that demonstrate utility of those principles. Other goals of the PKDD series are to provide a European-based forum for interaction among all theoreticians and practitioners interested in data mining and knowledge discovery as well as to foster interdisciplinary collaboration. A Discovery Challenge hosted at PKDD’99 is a new initiative promoting cooperative research on new real-world databases, supporting a broad and unified view of knowledge and methods of discovery, and emphasizing business problems that require an open-minded search for knowledge in data. Two multi-relational databases, in banking and in medicine, were widely available. The Challenge was born out of the conviction that knowledge discovery in real-world databases requires an open-minded discovery process rather than application of one or another tool limited to one form of knowledge. A discoverer should consider a broad scope of techniques that can reach many forms of knowledge. The discovery process cannot be rigid and selection of techniques must be driven by knowledge hidden in the data, so that the most and the best of knowledge can be reached.
VI
Preface
The contributed papers were selected from 106 full papers (45% growth over PKDD’98) by the following program committee: Pieter Adriaans (Syllogic, Netherlands), Petr Berka (U. Economics, Czech Rep.), Pavel Brazdil (U. Porto, Portugal), Henri Briand (U. Nantes, France), Leo Carbonara (British Telecom, UK), David L. Dowe (Monash U., Australia), A. Fazel Famili (IIT-NRC, Canada), Ronen Feldman (Bar Ilan U., Israel), Alex Freitas (PUC-PR, Brazil), Patrick Gallinari (U. Paris 6, France), Jean Gabriel Ganascia (U. Paris 6, France), Attilio Giordana (U. Torino, Italy), Petr H´ ajek (Acad. Science, Czech Rep.), Howard Hamilton (U. Regina, Canada), David Hand (Open U., UK), Bob Henery (U. Strathclyde, UK), Mikhail Kiselev (Megaputer Intelligence, Russia), Willi Kloesgen (GMD, Germany), Yves Kodratoff (U. Paris 11, France), Jan Komorowski (Norwegian U. Sci. & Tech.), Jacek Koronacki (Acad. Science, Poland), Nada Lavrac (Josef Stefan Inst., Slovenia), Heikki Manilla (Microsoft Research, Finland), Gholamreza Nakhaeizadeh (DaimlerChrysler, Germany), Gregory Piatetsky-Shapiro (Knowledge Stream, Boston, USA), Jaroslav Pokorn´ y (Charles U., Czech Rep.), Lech Polkowski (U. Warsaw, Poland), Mohamed Quafafou (U. Nantes, France), Jan Rauch (U. Economics, Czech Rep.), Zbigniew Ras (UNC Charlotte, USA), Wei-Min Shen (USC, USA), Arno Siebes (CWI, Netherlands), Andrzej Skowron (U. Warsaw, Poland), Derek Sleeman (U. Aberdeen, UK), Nicolas Spyratos (U. ˇ ep´ Paris 11, France), Olga Stˇ ankov´ a (Czech Tech. U.), Shusaku Tsumoto (Tokyo U., Japan), Raul Valdes-Perez (CMU, USA), Rudiger Wirth (DaimlerChrysler, Germany), Stefan Wrobel (GMD, Germany), Ning Zhong (Yamaguchi U., Japan), Wojtek Ziarko (U. Regina, Canada), Djamel A. ˙ Zighed (U. Lyon 2, France), Jan Zytkow (UNC Charlotte, USA). The following colleagues also reviewed for the conference and are due our special thanks: Thomas ˚ Agotnes, Mirian Halfeld Ferrari Alves, Joao Gama, A. Giacometti, Claire Green, Alipio Jorge, P. Kuntz, Dominique Laurent, Terje Løken, Aleksander Øhrn, Tobias Scheffer, Luis Torgo, and Simon White. Classified according to the first author’s nationality, papers submitted to PKDD’99 came from 31 countries on 5 continents (Europe: 71 papers; Asia: 15; North America: 12; Australia: 5; and South America: 3), including Australia (5 papers), Austria (2), Belgium (3), Brazil (3), Bulgaria (1), Canada (2), Czech Republic (4), Finland (3), France (10), Germany (12), Greece (1), Israel (3), Italy (6), Japan (8), Korea (1), Lithuania (1), Mexico (1), Netherlands (3), Norway (2), Poland (4), Portugal (2), Russia (3), Slovak Republic (1), Slovenia (1), Spain (1), Switzerland (1), Taiwan (1), Thailand (2), Turkey (1), United Kingdom (9), and USA (9). Further authors represent: Australia (6 authors), Austria (1), Belgium (5), Brazil (5), Canada (4), Colombia (2), Czech Republic (3), Finland (2), France (12), Germany (9), Greece (1), Israel (13), Italy (9), Japan (15), Korea (1), Mexico (1), Netherlands (3), Norway (3), Poland (4), Portugal (4), Russia (1), Slovenia (3), Spain (4), Switzerland (1), Taiwan (1), Thailand (1), Turkey (1), Ukraine (1), United Kingdom(9), and USA (9).
Preface
VII
Many thanks to all who submitted papers for review and for publication in the proceedings. The accepted papers were divided into two categories: 28 oral presentations and 48 poster presentations. In addition to poster sessions each poster paper has been allocated 3-minute highlight presentation at a plenary session. Invited speakers included Rudiger Wirth (DaimlerChrysler, Germany) and Wolfgang Lehner (IBM Almaden Research Center, USA). Six tutorials were offered to all Conference participants on 15 September: (1) Data Mining for Robust Business Intelligence Solutions by Jan Mrazek; (2) Query Languages for Knowledge Discovery Processes by Jean-Fran¸cois Boulicaut; (3) The ESPRIT Project CreditMine and its Relevance for the Internet Market by Michael Krieger and Susanne K¨ ohler; (4) Logics and Statistics for Association Rules and Beyond by Petr H´ ajek and Jan Rauch; (5) Data Mining for the Web by Myra Spiliopolou; and (6) Relational Learning and Inductive Logic Programming Made Easy by Luc De Raedt and Hendrik Blockeel. Members of the PKDD’99 organizing committee have done an enormous amount of work and deserve the special gratitude of all participants: Petr Berka – Discovery Challenge Chair, Leonardo Carbonara – Industrial Program Chair, Jiˇ r´ı Iv´ anek – Local Arrangement Chair, Vojtˇ ech Sv´ atek – Pubˇ amek. Spelicity Chair, and Jiˇ r´ı Kosek, Marta Sochorov´ a, and Dalibor Sr´ cial gratitude is also due Milena Zeithamlov´ a and Lucie V´ achov´ a, and their organizing agency, Action M Agency. Special thanks go to Alfred Hofmann of Springer-Verlag for his continuous help and support.
July 1999
˙ Jan Rauch and Jan Zytkow PKDD’99 Program Co-Chairs
Table of Contents
Session 1A – Time Series Scaling up Dynamic Time Warping to Massive Dataset . . . . . . . . . . . . . . . . . E.J. Keogh, M.J. Pazzani
1
The Haar Wavelet Transform in the Time Series Similarity Paradigm . . . . . 12 Z.R. Struzik, A. Siebes Rule Discovery in Large Time-Series Medical Databases . . . . . . . . . . . . . . . . 23 S. Tsumoto
Session 1B – Applications Simultaneous Prediction of Multiple Chemical Parameters of River Water Quality with TILDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 H. Blockeel, S. Dˇzeroski, J. Grbovi´c Applying Data Mining Techniques to Wafer Manufacturing . . . . . . . . . . . . . . 41 E. Bertino, B. Catania, E. Caglio An Application of Data Mining to the Problem of the University Students’ Dropout Using Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 S. Massa, P.P. Puliafito
Session 2A – Taxonomies and Partitions Discovering and Visualizing Attribute Associations Using Bayesian Networks and Their Use in KDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 G. Masuda, R. Yano, N. Sakamoto, K. Ushijima Taxonomy Formation by Approximate Equivalence Relations, Revisited . . . 71 ˙ F.A. El-Mouadib, J. Koronacki, J.M. Zytkow On the Use of Self-Organizing Maps for Clustering and Visualization . . . . . 80 A. Flexer Speeding Up the Search for Optimal Partitions . . . . . . . . . . . . . . . . . . . . . . . . 89 T. Elomaa, J. Rousu
Session 2B – Logic Methods Experiments in Meta-level Learning with ILP . . . . . . . . . . . . . . . . . . . . . . . . . . 98 L. Todorovski, S. Dˇzeroski
X
Table of Contents
Boolean Reasoning Scheme with Some Applications in Data Mining . . . . . . 107 A. Skowron, H.S. Nguyen On the Correspondence between Classes of Implicational and Equivalence Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 J. Iv´ anek Querying Inductive Databases via Logic-Based User-Defined Aggregates . . 125 F. Giannotti, G. Manco
Session 3A – Distributed and Multirelational Databases Peculiarity Oriented Multi-database Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 N. Zhong, Y.Y. Yao, S. Ohsuga Knowledge Discovery in Medical Multi-databases: A Rough Set Approach . 147 S. Tsumoto Automated Discovery of Rules and Exceptions from Distributed Databases Using Aggregates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 R. P´ airc´eir, S. McClean, B. Scotney
Session 3B – Text Mining and Feature Selection Text Mining via Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 R. Feldman, Y. Aumann, M. Fresko, O. Liphstat, B. Rosenfeld, Y. Schler TopCat: Data Mining for Topic Identification in a Text Corpus . . . . . . . . . . 174 C. Clifton, R. Cooley Selection and Statistical Validation of Features and Prototypes . . . . . . . . . . 184 M. Sebban, D.A. Zighed, S. Di Palma
Session 4A – Rules and Induction Taming Large Rule Models in Rough Set Approaches . . . . . . . . . . . . . . . . . . . 193 T. ˚ Agotnes, J. Komorowski, T. Løken Optimizing Disjunctive Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 D. Zelenko Contribution of Boosting in Wrapper Models . . . . . . . . . . . . . . . . . . . . . . . . . . 214 M. Sebban, R. Nock Experiments on a Representation-Independent ”Top-Down and Prune” Induction Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 R. Nock, M. Sebban, P. Jappy
Table of Contents
XI
Session 5A – Interesting and Unusual Heuristic Measures of Interestingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 R.J. Hilderman, H.J. Hamilton Enhancing Rule Interestingness for Neuro-fuzzy Systems . . . . . . . . . . . . . . . . 242 T. Wittmann, J. Ruhland, M. Eichholz Unsupervised Profiling for Identifying Superimposed Fraud . . . . . . . . . . . . . . 251 U. Murad, G. Pinkas OPTICS-OF: Identifying Local Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 M.M. Breunig, H.-P. Kriegel, R.T. Ng, J. Sander
Posters Selective Propositionalization for Relational Learning . . . . . . . . . . . . . . . . . . . 271 ´ Alphonse, C. Rouveirol E. Circle Graphs: New Visualization Tools for Text-Mining . . . . . . . . . . . . . . . . 277 Y. Aumann, R. Feldman, Y.B. Yehuda, D. Landau, O. Liphstat, Y. Schler On the Consistency of Information Filters for Lazy Learning Algorithms . . 283 H. Brighton, C. Mellish Using Genetic Algorithms to Evolve a Rule Hierarchy . . . . . . . . . . . . . . . . . . 289 R. Cattral, F. Oppacher, D. Deugo Mining Temporal Features in Association Rules . . . . . . . . . . . . . . . . . . . . . . . . 295 X. Chen, I. Petrounias The Improvement of Response Modeling: Combining Rule-Induction and Case-Based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 F. Coenen, G. Swinnen, K. Vanhoof, G. Wets Analyzing an Email Collection Using Formal Concept Analysis . . . . . . . . . . 309 R. Cole, P. Eklund Business Focused Evaluation Methods: A Case Study . . . . . . . . . . . . . . . . . . . 316 P. Datta Combining Data and Knowledge by MaxEnt-Optimization of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 W. Ertel, M. Schramm Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 A. Feelders
XII
Table of Contents
Rough Dependencies as a Particular Case of Correlation: Application to the Calculation of Approximative Reducts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 M.C. Fernandez-Baiz´ an, E. Menasalvas Ruiz, J.M. Pe˜ na S´ anchez, S. Mill´ an, E. Mesa A Fuzzy Beam-Search Rule Induction Algorithm . . . . . . . . . . . . . . . . . . . . . . . 341 C.S. Fertig, A.A. Freitas, L.V.R. Arruda, C. Kaestner An Innovative GA-Based Decision Tree Classifier in Large Scale Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Z. Fu Extension to C-means Algorithm for the Use of Similarity Functions . . . . . . 354 J.R. Garc´ıa-Serrano, J.F. Mart´ınez-Trinidad Predicting Chemical Carcinogenesis Using Structural Information Only . . . 360 C.J. Kennedy, C. Giraud-Carrier, D.W. Bristol LA - A Clustering Algorithm with an Automated Selection of Attributes, which Is Invariant to Functional Transformations of Coordinates . . . . . . . . . 366 M.V. Kiselev, S.M. Ananyan, S.B. Arseniev Association Rule Selection in a Data Mining Environment . . . . . . . . . . . . . . . 372 M. Klemettinen, H. Mannila, A.I. Verkamo Multi-relational Decision Tree Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 A.J. Knobbe, A. Siebes, D. van der Wallen Learning of Simple Conceptual Graphs from Positive and Negative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 S.O. Kuznetsov An Evolutionary Algorithm Using Multivariate Discretization for Decision Rule Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 W. Kwedlo, M. Kr¸etowski ZigZag, a New Clustering Algorithm to Analyze Categorical Variable Cross-Classification Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 S. Lallich Efficient Mining of High Confidence Association Rules without Support Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 J. Li, X. Zhang, G. Dong, K. Ramamohanarao, Q. Sun A Logical Approach to Fuzzy Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 C.-J. Liau, D.-R. Liu AST: Support for Algorithm Selection with a CBR Approach . . . . . . . . . . . . 418 G. Lindner, R. Studer Efficient Shared Near Neighbours Clustering of Large Metric Data Sets . . . 424 S. Lodi, L. Reami, C. Sartori
Table of Contents
XIII
Discovery of ”Interesting” Data Dependencies from a Workload of SQL Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 S. Lopes, J.-M. Petit, F. Toumani Learning from Highly Structured Data by Decomposition . . . . . . . . . . . . . . . 436 R. Mac Kinney-Romero, C. Giraud-Carrier Combinatorial Approach for Data Binarization . . . . . . . . . . . . . . . . . . . . . . . . 442 E. Mayoraz, M. Moreira Extending Attribute-Oriented Induction as a Key-Preserving Data Mining Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 M.K. Muyeba, J.A. Keane Automated Discovery of Polynomials by Inductive Genetic Programming . 456 N. Nikolaev, H. Iba Diagnosing Acute Appendicitis with Very Simple Classification Rules . . . . . 462 A. Øhrn, J. Komorowski Rule Induction in Cascade Model Based on Sum of Squares Decomposition 468 T. Okada Maintenance of Discovered Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 ˇ ep´ M. Pˇechouˇcek, O. Stˇ ankov´ a, P. Mikˇsovsk´ y A Divisive Initialization Method for Clustering Algorithms . . . . . . . . . . . . . . 484 C. Pizzuti, D. Talia, G. Vonella A Comparison of Model Selection Procedures for Predicting Turning Points in Financial Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 T. Poddig, K. Huber Mining Lemma Disambiguation Rules from Czech Corpora . . . . . . . . . . . . . . 498 L. Popel´ınsk´y, T. Pavelek Adding Temporal Semantics to Association Rules . . . . . . . . . . . . . . . . . . . . . . 504 C.P. Rainsford, J.F. Roddick Studying the Behavior of Generalized Entropy in Induction Trees Using a M-of-N Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 R. Rakotomalala, S. Lallich, S. Di Palma Discovering Rules in Information Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518 Z.W. Ras Mining Text Archives: Creating Readable Maps to Structure and Describe Document Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 A. Rauber, D. Merkl
XIV
Table of Contents
Neuro-fuzzy Data Mining for Target Group Selection in Retail Banking . . . 530 J. Ruhland, T. Wittmann Mining Possibilistic Set-Valued Rules by Generating Prime Disjunctions . . 536 A.A. Savinov Towards Discovery of Information Granules . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 A. Skowron, J. Stepaniuk Classification Algorithms Based on Linear Combinations of Features . . . . . . 548 ´ ezak, J. Wr´ D. Sl¸ oblewski Managing Interesting Rules in Sequence Mining . . . . . . . . . . . . . . . . . . . . . . . . 554 M. Spiliopoulou Support Vector Machines for Knowledge Discovery . . . . . . . . . . . . . . . . . . . . . 561 S. Sugaya, E. Suzuki, S. Tsumoto Regression by Feature Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 ˙ Uysal, H.A. G¨ I. uvenir Generating Linguistic Fuzzy Rules for Pattern Classification with Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 N. Xiong, L. Litz
Tutorials Data Mining for Robust Business Intelligence Solutions . . . . . . . . . . . . . . . . . 580 J. Mrazek Query Languages for Knowledge Discovery in Databases . . . . . . . . . . . . . . . . 582 J.-F. Boulicaut The ESPRIT Project CreditMine and Its Relevance for the Internet Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 S. K¨ ohler, M. Krieger Logics and Statistics for Association Rules and Beyond . . . . . . . . . . . . . . . . . 586 P. H´ ajek, J. Rauch Data Mining for the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 M. Spiliopoulou Relational Learning and Inductive Logic Programming Made Easy . . . . . . . 590 L. De Raedt, H. Blockeel
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
Scaling up Dynamic Time Warping to Massive Datasets Eamonn J. Keogh and Michael J. Pazzani Department of Information and Computer Science University of California, Irvine, California 92697 USA {eamonn, pazzani}@ics.uci.edu Abstract. There has been much recent interest in adapting data mining algorithms to time series databases. Many of these algorithms need to compare time series. Typically some variation or extension of Euclidean distance is used. However, as we demonstrate in this paper, Euclidean distance can be an extremely brittle distance measure. Dynamic time warping (DTW) has been suggested as a technique to allow more robust distance calculations, however it is computationally expensive. In this paper we introduce a modification of DTW which operates on a higher level abstraction of the data, in particular, a piecewise linear representation. We demonstrate that our approach allows us to outperform DTW by one to three orders of magnitude. We experimentally evaluate our approach on medical, astronomical and sign language data.
1 Introduction Time series are a ubiquitous form of data occurring in virtually every scientific discipline and business application. There has been much recent work on adapting data mining algorithms to time series databases. For example, Das et al (1998) attempt to show how association rules can be learned from time series. Debregeas and Hebrail (1998) demonstrate a technique for scaling up time series clustering algorithms to massive datasets. Keogh and Pazzani (1998) introduced a new, scaleable time series classification algorithm. Almost all algorithms that operate on time series data need to compute the similarity between time series. Euclidean distance, or some extension or modification thereof, is typically used. However, Euclidean distance can be an extremely brittle distance measure. 3 Consider the clustering produced by Euclidean distance in Fig 1. 4 Sequence 3 is judged as most similar to the line in 2 sequence 4, yet it appears more similar to 1 or 2. The reason why Euclidean distance may fail to produce an intuitively correct measure of similarity between two sequences is because it is
1 Fig. 1. An unintuitive clustering produced by the Euclidean distance measure. Sequences 1, 2 and 3 are astronomical time series (Derriere 1998). Sequence 4 is simply a straight line with the same mean and variance as the other sequences
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 1-11, 1999. © Springer-Verlag Berlin Heidelberg 1999
2
E.J. Keogh and M.J. Pazzani
very sensitive to small distortions in the time axis. Consider Fig 2.A. The two sequences have approximately the same overall shape, but those shapes are not exactly aligned in the time axis. The nonlinear alignment shown in Fig 2.B would
A) 0
B) 10
20
30
40
50
60
0
10
20
30
40
50
60
Fig. 2. Two sequences that represent the Y-axis position of an individual’s hand while signing the word "pen" in Australian Sign Language. Note that while the sequences have an overall similar shape, they are not aligned in the time axis. Euclidean distance, which assumes the ith point on one sequence is aligned with ith point on the other (A), will produce a pessimistic dissimilarity measure. A nonlinear alignment (B) allows a more sophisticated distance measure to be calculated
allow a more sophisticated distance measure to be calculated. A method for achieving such alignments has long been known in the speech processing community (Sakoe and Chiba 1978). The technique, Dynamic Time Warping (DTW), was introduced to the data mining community by Berndt and Clifford (1994). Although they demonstrate the utility of the approach, they acknowledge that the algorithms time complexity is a problem and that "…performance on very large databases may be a limitation". As an example of the utility of DTW compare the clustering shown in Figure 1 with Figure 3. In this paper we introduce a technique which speeds up DTW by a large constant. The value of the constant is data dependent but is typically one to three orders of magnitude. The algorithm, Segmented Dynamic Time Warping (SDTW), takes advantage of the fact that we can efficiently approximate most time series by a set of piecewise linear segments.
4
3 2
1 Fig 3. When the dataset used in Fig. 1 is clustered using DTW the results are much more intuitive
The rest of this paper is organized as follows. Section 2 contains a review of the classic DTW algorithm. Section 3 introduces the piecewise linear representation and SDTW algorithm. In Section 4 we experimentally compare DTW, SDTW and Euclidean distance on several real world datasets. Section 5 contains a discussion of related work. Section 6 contains our conclusions and areas of future research.
2 The Dynamic Time Warping Algorithm Suppose we have two time series Q and C, of length n and m respectively, where:
Scaling up Dynamic Time Warping to Massive Dataset
3
Q = q1,q2,…,qi,…,qn
(1)
C = c1,c2,…,cj,…,cm
(2) th
To align two sequences using DTW we construct an n-by-m matrix where the (i , j ) element of the matrix contains the distance d(qi,cj) between the two points qi and cj 2 (With Euclidean distance, d(qi,cj) = (qi - cj) ). Each matrix element (i,j) corresponds to the alignment between the points qi and cj. This is illustrated in Figure 4. A warping path W, is a contiguous (in the sense stated below) set of matrix elements that defines th a mapping between Q and C. The k element of W is defined as wk = (i,j)k so we have: th
max(m,n) K < m+n-1
W = w1, w2, …,wk,…,wK
(3)
The warping path is typically subject to several constraints.
Boundary Conditions: w1 = (1,1) and wK = (m,n), simply stated, this requires the warping path to start and finish in diagonally opposite corner cells of the matrix.
Continuity: Given wk = (a,b) then wk-1 = (a’,b’) where a–a' 1 and b-b' 1. This restricts the allowable steps in the warping path to adjacent cells (including diagonally adjacent cells).
Monotonicity: Given wk = (a,b) then wk-1 = (a',b') where a–a' 0 and b-b' 0. This forces the points in W to be monotonically spaced in time.
There are exponentially many warping paths that satisfy the above conditions, however we are interested only in the path which minimizes the warping cost: K (4) DTW (Q, C ) = min w K k =1 k The K in the denominator is used to compensate for the fact that warping paths may have different lengths.
∑
Q 5
10
15
20
25
30
wK
15
20
25
30
0
m
10
j …
5
w
0
C
w
1
w
1
i
Fig. 4. An example warping path
n
4
E.J. Keogh and M.J. Pazzani
This path can be found very efficiently using dynamic programming to evaluate the following recurrence which defines the cumulative distance g(i,j) as the distance d(i,j) found in the current cell and the minimum of the cumulative distances of the adjacent elements: g(i,j) = d(qi,cj) + min{ g(i-1,j-1) , g(i-1,j ) , g(i,j-1) }
(5)
The Euclidean distance between two sequences can be seen as a special case of th DTW where the k element of W is constrained such that wk = (i,j)k , i = j = k. Note that it is only defined in the special case where the two sequences have the same length. The time complexity of DTW is O(nm). However this is just for comparing two sequences. In data mining applications we typically have one of the following two situations (Agrawal et. al. 1995). 1) Whole Matching: We have a query sequence Q, and X sequences of approximately the same length in our database. We want to find the sequence that is most similar to Q. 2) Subsequence Matching: We have a query sequence Q, and a much longer sequence R of length X in our database. We want to find the subsection of R that is most similar to Q. To find the best match we "slide" the query along R, testing every possible subsection of R. 2
In either case the time complexity is O(n X), which is intractable for many realworld problems. This review of DTW is necessarily brief; we refer the interested reader to Kruskall and Liberman (1983) for a more detailed treatment.
3 Exploiting a Higher Level Representation Because working with raw time series is computationally expensive, several researchers have proposed using higher level representations of the data. In previous work we have championed a piecewise linear representation, demonstrating that the linear segment representation can be used to allow relevance feedback in time series databases (Keogh and Pazzani 1998) and that it allows a user to define probabilistic queries (Keogh and Smyth 1997). 3.1 Piecewise Linear Representation We will use the following notation throughout this paper. A time series, sampled at n points, is represented as an italicized uppercase letter such as A. The segmented version of A, containing N linear segments, is denoted as a bold uppercase letter such as A, where A is a 4-tuple of vectors of length N. A {AXL, AXR, AYL, AYR} th
The i segment of sequence A is represented by the line between (AXLi ,AYLi) and (AXRi ,AYRi). Figure 5 illustrates this notation.
Scaling up Dynamic Time Warping to Massive Dataset
5
We will denote the ratio n/N as c, the compression ratio. We can choose to set this ratio to any value, adjusting the tradeoff between compactness and fidelity. For brevity we omit details of A how we choose the compression ratio and how the segmented representation (AXLi,AYLi) f(t) A (AXRi,AYRi) is obtained, referring the interested reader to Keogh t and Smyth (1997) instead. We do note however that the Fig. 5. We represent a time series by a sequence of segmentation can be obtained straight segments in linear time. 3.2 Warping with the Piecewise Linear Representation th
To align two sequences using SDTW we construct an N-by-M matrix where the (i , j ) element of the matrix contains the distance d(Qi,Cj) between the two segments Qi and Cj. The distance between two segments is defined as the square of the distance between their means: th
d(Qi,Cj) = [((QYLi + QYRi) /2 ) - ((CYLj + CYRj) /2 )]
2
(6)
Apart from this modification the matrix-searching algorithm is essentially unaltered. Equation 5 is modified to reflect the new distance measure: g(i,j) = d(Qi,Cj) + min{ g(i-1,j-1) , g(i-1,j ) , g(i,j-1) }
(7)
When reporting the DTW distance between two time series (Eq. 4) we compensated for different length paths by dividing by K, the length of the warping path. We need to do something similar for SDTW but we cannot use K directly, because different elements in the warping matrix correspond to segments of different lengths and therefore K only approximates the length of the warping path. Additionally we would like SDTW to be measured in the same units as DTW to facilitate comparison. We measure the length of SDTW’s warping path by extending the recurrence shown in Eq. 7 to return and recursively sum an additional variable, max([QXRi QXLi],[CXRj – CXLj]), with the corresponding element from min{ g(i-1,j-1) , g(i-1,j ) , g(i,j-1) }. Because the length of the warping path is measured in the same units as DTW we have: SDTW(Q,C) @ DTW(Q,C)
(8)
Figure 6 shows strong visual evidence that SDTW finds alignments that are very similar to those produced by DTW. The time complexity for a SDTW is O(MN), where M = m/c and N = n/c. This 2 means that the speedup obtained by using SDTW should be approximately c , minus some constant factors because of the overhead of obtaining the segmented representation.
6
E.J. Keogh and M.J. Pazzani A
B
0
10
20
30
40
50
60
70
A’
0
20
40
60
80
100
60
80
100
B’
0
10
20
30
40
50
60
70
0
20
40
Fig. 6. A and B both show two similar time series and the alignment between them, as discovered by DTW. A’ and B’ show the same time series in their segmented representation, and the alignment discovered by SDTW. This presents strong visual evidence that SDTW finds approximately the same warping as DTW
4 Experimental Results We are interested in two properties of the proposed approach. The speedup obtained over the classic DTW algorithm and the quality of the alignment. In general, the quality of the alignment is subjective, so we designed experiments that indirectly, but objectively measure it. 4.1 Clustering For our clustering experiment we utilized the Australian Sign Language Dataset from the UCI KDD archive (Bay 1999). The dataset consists of various sensors that measure the X-axis position of a subject’s right hand while signing one of 95 words in Australian Sign Language (There are other sensors in the dataset, which we ignored in this work). For each of the words, 5 recordings were made. We used a subset of the database which corresponds to the following 10 words, "spend", "lose", "forget", "innocent", "norway", "happy", "later", "eat", "cold" and "crazy". For every possible pairing of words, we clustered the 10 corresponding sequences, using group average hierarchical clustering. At the lowest level of the corresponding dendogram, the clustering is subjective. However, the highest level of the dendogram (i.e. the first bifurcation) should divide the data into the two classes. There are 34,459,425 possible ways to cluster 10 items, of which 11,025 of them correctly partition the two classes, so the default rate for an algorithm which guesses randomly is only 0.031%. We compared three distance measures: 1) DTW: The classic dynamic time warping algorithm as presented in Section 2. 2) SDTW: The segmented dynamic time warping algorithm proposed here. 3) Euclidean: We also tested Euclidean to facilitate comparison to the large body of literature that utilizes this distance measure. Because the Euclidean distance is only defined for sequences of the same length, and there is a small variance in the length of the sequences in this dataset, we did the following. When comparing sequences of different lengths, we "slid" the shorter of the two sequences across the longer and recorded the minimum distance.
Scaling up Dynamic Time Warping to Massive Dataset
7
Figure 7 shows an example of one experiment and Table 1 summarizes the results. 8
Euclidean
9
DTW
10
2
8
9
7
10
8
5
7
7
6
6
6
4
3
5
3
2
4
10
5
2
9
4
3
1
1
1
SDTW
Fig. 7. An example of a single clustering experiment. The time series 1 to 5 correspond to 5 different readings of the word "norway", the time series 6 to 10 correspond to 5 different readings of the word "later". Euclidean distance is unable to differentiate between the two words. Although DTW and SDTW differ at the lowest levels of the dendrogram, were the clustering is subjective, they both correctly divide the two classes at the highest level
Mean Time (Seconds) 3.23
Correct Clusterings (Out of 45) 2
DTW
87.06
22
SDTW
4.12
21
Distance measure Euclidean
Table 1: A comparison of three distance measures on a clustering task Although the Euclidean distance can be quickly calculated, it performance is only slightly better than random. DTW and SDTW have essentially the same accuracy but SDTW is more than 20 times faster. 4.2 Query by Example The clustering example in the previous section demonstrated the ability of SDTW to do whole matching. Another common task for time series applications is subsequence matching, which we consider here. Assume that we have a query Q of length n, and a much longer reference sequence R, of length X. The task is to find the subsequence of R, which best matches Q, and report it’s offset within R. If we use the Euclidean distance our distance measure, we can use an indexing technique to speedup the search (Faloutsos et. al. 1994, Keogh & Pazzani 1999). However, DTW does not obey the triangular inequality and this makes
8
E.J. Keogh and M.J. Pazzani
it impossible to utilize standard indexing schemes. Given this, we are resigned to using sequential search, "sliding" the query along the reference sequence repeatedly recalculating the distance at each offset. Figure 8 illustrates the idea. R f(t)
Q
t
Fig. 8. Subsequence matching involves sequential search, "sliding" the query Q against the reference sequence R, repeating recalculating the distance measure at each offset.
Brendt and Clifford (1994) suggested the simple optimization of skipping every second datapoint in R, noting that as Q is slid across R, the distance returned by DTW changes slowly and smoothly. We note that sometimes it would be possible to skip much more than 1 datapoint, because the distance will only change dramatically when a new feature (i.e. a plateau, one side of a peak or valley etc.) from R falls within the query window. The question then arises of how to tell where features begin and end in R. The answer to this problem is given automatically, because the process of finding obtaining the linear segmentation can be considered a form of feature extraction (Hagit & Zdonik 1996). We propose searching R by anchoring the leftmost segment in Q against the left edge of each segment in R. Each time we slid the query to measure the distance at the next offset, we effectively skip as many datapoints as are represented by the last anchor segment. As noted in section 3 the speedup for SDTW over DTW is 2 approximately c , however this is for whole matching, for subsequence matching the 3 speedup is approximately c . For this experiment we used the EEG dataset from the UCI KDD repository (Bay 1999). This dataset contains a 10,240 datapoints. In order to create queries with objectively correct answers. We extracted a 100-point subsection of data at random, then artificially warped it. To warp a sequence we begin by randomly choosing an anchor point somewhere on the 80 sequence. We randomly shifted 70 the anchor point W time-units 60 left or right (with W = 10, 20, 50 30). The other datapoints were 40 moved to compensate for this 30 shift by an amount that depended 20 on their inverse squared distance 10 to the anchor point, thus 0 0 10 20 30 40 50 60 70 80 90 100 localizing the effect. After this transformation we interpolated Fig. 9. An example of an artificially warped time series used in our experiments. An anchor point (black dot) is the data back onto the original, chosen in the original sequence (solid line). The anchor equi-spaced X-axis. The net point is moved W units (here W = 10) and the effect of this transformation is a neighboring points are also moved by an amount smooth local distortion of the related to the inverse square of their distance to the original sequence, as shown in anchor point. The net result is that the transformed sequence (dashed line) is a smoothly warped version of Figure 9. We repeated this ten the original sequence times for each for W.
Scaling up Dynamic Time Warping to Massive Dataset
9
As before, we compared three distance measures, measuring both accuracy and time. The results are presented in Table 2. Mean Accuracy (W = 10 )
Mean Accuracy (W = 20 )
Mean Accuracy (W = 30 )
Mean Time (Seconds)
Euclidean
20%
0%
0%
147.23
DTW
100%
90%
60%
15064.64
SDTW
100%
90%
50%
26.16
Distance measure
Table 2: A comparison of three distance measures on query by example Euclidean distance is fast to compute, but its performance degrades rapidly in the presence of time axis distortion. Both DTW and SDTW are able to detect matches in spite of warping, but SDTW is approximately 575 times faster.
5 Related Work Dynamic time warping has enjoyed success in many areas where it’s time complexity is not an issue. It has been used in gesture recognition (Gavrila & Davis 1995), robotics (Schmill et. al 1999), speech processing (Rabiner & Juang 1993), manufacturing (Gollmer & Posten 1995) and medicine (Caiani et. al 1998). Conventional DTW, however, is much too slow for searching large databases. For this problem, Euclidean distance, combined with an indexing scheme is typically used. Faloutsos et al, (1994) extract the first few Fourier coefficients from the time series and use these to project the data into multi-dimensional space. The data can then be indexed with a multi-dimensional indexing structure such as a R-tree. Keogh and Pazzani (1999) address the problem by de-clustering the data into bins, and optimizing the data within the bins to reduce search times. While both these approaches greatly speed up query times for Euclidean distance queries, many real world applications require non-Euclidean notions of similarity. The idea of using piecewise linear segments to approximate time series dates back to Pavlidis and Horowitz (1974). Later researchers, including Hagit and Zdonik (1996) and Keogh and Pazzani (1998) considered methods to exploit this representation to support various non-Euclidean distance measures, however this paper is the first to demonstrate the possibility of supporting time warped queries with linear segments.
6 Conclusions and Future Work We demonstrated a modification of DTW that exploits a higher level representation of time series data to produce one to three orders of magnitude speed-up with no appreciable decrease in accuracy. We experimentally demonstrated our approach on several real world datasets. Future work includes a detailed theoretical examination of SDTW, and extensions to multivariate time series.
10
E.J. Keogh and M.J. Pazzani
References Agrawal, R., Lin, K. I., Sawhney, H. S., & Shim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in times-series databases. In VLDB, September. Bay, S. (1999). UCI Repository of Kdd databases [http://kdd.ics.uci.edu/]. Irvine, CA: University of California, Department of Information and Computer Science. Berndt, D. & Clifford, J. (1994) Using dynamic time warping to find patterns in time series. AAAI-94 Workshop on Knowledge Discovery in Databases (KDD-94), Seattle, Washington. Caiani, E.G., Porta, A., Baselli, G., Turiel, M., Muzzupappa, S., Pieruzzi, F., Crema, C., Malliani, A. & Cerutti, S. (1998) Warped-average template technique to track on a cycle-by-cycle basis the cardiac filling phases on left ventricular volume. IEEE Computers in Cardiology. Vol. 25 Cat. No.98CH36292, NY, USA. Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule discovery form time series. Proceedings of the 4rd International Conference of Knowledge Discovery and Data Mining. pp 16-22, AAAI Press. Debregeas, A. & Hebrail, G. (1998). Interactive interpretation of Kohonen maps applied to curves. Proceedings of the 4rd International Conference of Knowledge Discovery and Data Mining. pp 179-183, AAAI Press. Derriere, S. (1998) D.E.N.I.S strasbg.fr/DENIS/qual_gif/cpl3792.dat]
strip
3792:
[http://cdsweb.u-
Faloutsos, C., Ranganathan, M., & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In Proc. ACM SIGMOD Conf., Minneapolis, May. Gavrila, D. M. & Davis,L. S.(1995). Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In International Workshop on Automatic Face- and Gesture-Recognition. IEEE Computer Society, Zurich. Gollmer, K., & Posten, C. (1995) Detection of distorted pattern using dynamic time warping algorithm and application for supervision of bioprocesses. On-Line Fault Detection and Supervision in the Chemical Process Industries (Edited by: Morris, A.J.; Martin, E.B.). Hagit, S., & Zdonik, S. (1996). Approximate queries and representations for large data sequences. Proc. 12th IEEE International Conference on Data Engineering. pp 546-553, New Orleans, Louisiana, February.
Scaling up Dynamic Time Warping to Massive Dataset
11
Keogh, E., & Pazzani, M. (1998). An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. rd Proceedings of the 4 International Conference of Knowledge Discovery and Data Mining. pp 239-241, AAAI Press. Keogh, E., & Pazzani, M. (1999). An indexing scheme for fast similarity search in th large time series databases. To appear in Proceedings of the 11 International Conference on Scientific and Statistical Database Management. Keogh, E., Smyth, P. (1997). A probabilistic approach to fast pattern matching in time rd series databases. Proceedings of the 3 International Conference of Knowledge Discovery and Data Mining. pp 24-20, AAAI Press. Kruskall, J. B. & Liberman, M. (1983). The symmetric time warping algorithm: From continuous to discrete. In Time Warps, String Edits and Macromolecules: The Theory and Practice of String Comparison. Addison-Wesley. Pavlidis, T., Horowitz, S. (1974). Segmentation of plane curves. IEEE Transactions on Computers, Vol. C-23, NO 8, August. Rabiner, L. & Juang, B. (1993). Fundamentals of speech recognition. Englewood Cliffs, N.J, Prentice Hall. Sakoe, H. & Chiba, S. (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics, Speech, and Signal Proc., Vol. ASSP-26, 43-49. Schmill, M., Oates, T. & Cohen, P. (1999). Learned models for continuous planning. In Seventh International Workshop on Artificial Intelligence and Statistics.
The Haar Wavelet Transform in the Time Series Similarity Paradigm Zbigniew R. Struzik, Arno Siebes Centre for Mathematics and Computer Science (CWI) Kruislaan 413, 1098 SJ Amsterdam The Netherlands email:
[email protected] Abstract. Similarity measures play an important role in many data mining algorithms. To allow the use of such algorithms on non-standard databases, such as databases of financial time series, their similarity measure has to be defined. We present a simple and powerful technique which allows for the rapid evaluation of similarity between time series in large data bases. It is based on the orthonormal decomposition of the time series into the Haar basis. We demonstrate that this approach is capable of providing estimates of the local slope of the time series in the sequence of multi-resolution steps. The Haar representation and a number of related represenations derived from it are suitable for direct comparison, e.g. evaluation of the correlation product. We demonstrate that the distance between such representations closely corresponds to the subjective feeling of similarity between the time series. In order to test the validity of subjective criteria, we test the records of currency exchanges, finding convincing levels of correlation.
1
Introduction
Explicitly or implicitly, record similarity is a fundamental aspect of most data mining algorithms. For traditional, tabular data the similarity is often measured by attributevalue similarity or even attribute-value equality. For more complex data, e.g., financial time series, such simple similarity measures do not perform very well. For example, assume we have three time series A, B, and C, where B is constantly 5 points below A, whereas C is randomly 2 points below or above A. Such a simple similarity measure would rate C as far more similar to A than B, whereas a human expert would rate A and B as very similar because they have the same shape. This example illustrates that the similarity of time series data should be based on certain characteristics of the data rather than on the raw data itself. Ideally, these characteristics are such that the similarity of the time series is simply given by the (traditional) similarity of the characteristics. In that case, mining a database of time series is reduced to mining the database of characteristics using the traditional algorithms. This observation is not new, but can also (implicitly) be found in papers such as [1-7]. Which characteristics are computed depends very much on the application one has in mind. For example, many models and paradigms of similarity introduced to date are unnecesarily complex because they are designed to suit too large a spectrum of applications. The context of data mining applications in which matching time series are required often involves a smaller number of degrees of freedom than assumed. For •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 12−22, 1999. Springer−Verlag Berlin Heidelberg 1999
The Haar Wavelet Transform in the Time Series Similarity Paradigm
13
example, in comparing simultanous financial time series, the time variable is explicitly known and time and scale shift are not applicable. In addition, there are strong heuristics which can be applied to these time series. For example, the concern in trading is usually to reach a certain level of index or currency exchange within a certain time. This is nothing else than increase rate or simply slope of the time series in question. Consider a financial record over one year which we would like to compare with another such record from another source. The values of both are unrelated, the sampling density may be different or vary with time. Nevertheless, it ought to be possible to state how closely the two are related. If we were to do it in as few steps as possible, the first to ask would probably be about the increase/decrease in (log)value over the year. In fact, just a sign of a change over the year may be sufficient, showing whether there has been a decrease or an increase in the stock value. Given this information the next question might be what the increase/decrease was in the first half of the year and what it was in the second half. The reader will not be surprised if we suggest that perhaps the next question might be related to the increase/decrease in each quarter of the year. This is exactly the strategy we are going to follow. The wavelet transform using the Haar wavelet (the Haar WT for short) will provide exactly the kind of information we have used in the above example, through the decomposition of the time series in the Haar basis. In section 2, we will focus on the relevant aspects of the wavelet transformation with the Haar wavelet. From the hierarchical scale-wise decomposition provided by the wavelet transform, we will next select a number of interesting representations of the time series in section 3. In section 4, these time series’ representations will be subject to evaluation of their correlation products. Section 5 gives a few details on the computational efficiency of the convolution product. This is followed by several test cases of correlating examples of currency exchange rates in section 6. Section 7 closes the paper with conclusions and suggestions for future developments.
2 The Haar Wavelet Transform As already mentioned above, the recently introduced Wavelet Transform (WT), see e.g. Ref. [9, 10], provides a way of analysing local behaviour of functions. In this, it fundamentally differs from global transforms like the Fourier Transform. In addition to locality, it possesses the often very desirable ability of filtering the polynomial behaviour to some predefined degree. Therefore, correct characterisation of time series is possible, in particular in the presence of non-stationarities like global or local trends or biases. Conceptually, the wavelet transform is an inner product of the time series with the scaled and translated wavelet (x), usually a n-th derivative of a smoothing kernel (x). The scaling and translation actions are performed by two parameters; the scale parameter s ‘adapts’ the width of the wavelet to the microscopic resolution required, thus changing its frequency contents, and the location of the analysing wavelet is determined by the parameter b: W f (s; b) =< f;
> (s; b) =
1
s
Z
dx f (x)
(
x
, b) ; s
(1)
14
Z.R. Struzik and A. Siebes
R
where s; b 2 and s > 0 for the continuous version (CWT), or are taken on a discrete, usually hierarchical (e.g. dyadic) grid of values si ; bj for discrete version (DWT, or just WT). is the support of the f (x) or the length of the time series. The choice of the smoothing kernel (x) and the related wavelet (x) depends on the application and on the desired properties of the wavelet transform. In [6, 7, 11], we used the Gaussian as the smoothing kernel. The reason for this was the optimal localisation both in frequency and position of the related wavelets, and the existence of derivatives of any degree n. In this paper, for the reasons which will become apparent later, see section 3, we will use a different smoothing function, namely a simple block function:
(x) =
0
x<1
1
for
0
otherwise
:
(2)
The wavelets obtained from this kernel are defined on finite support and go by the name of their inventor Haar:
8 < 1 (x) = : ,01
for
x < 12 x<1
0 1 for 2 otherwise
:
(3)
For a particular choice of rescaling and position shift parameters (dyadic pyramidal scheme), the Haar system constitutes an orthonormal basis:
m;n (x) = 2,m
,m x , n);
m > 0; n = 0 : : : 2m : (4) Assume an arbitrary time series f = ffi g; i = 1 : : : 2N on the normalised support
(f ) = [0; 1]. Using the orthonormal basis just described, the function f can be repre(2
sented with the linear combination of Haar wavelets:
f =f
0
+
N X 2m X m=0 l=0
cm;l m;l ;
(5)
where f0 is the most coarse approximation of the time series; f 0 =< f; >, and each coefficient cm;l of the representation can be obtained as cm;l =< f; m;l >. In particular, the approximations f j of the time series f with the smoothing kernel j;k form a ‘ladder’ of multi-resolution approximations: 2j
X f j,1 = f j + < f; j;k > j;k ; k=0 j , j where f =< f; j;k > and j;k = 2 (2,j x , k ).
(6)
It is thus possible to ‘move’ from one approximation level j , 1 to another level j by simply adding (subtracting for j to j , 1 direction), the detail contained in the corresponding wavelet coefficients cj;k ; k = 0 : : : 2j . In figure 1, we show an example decomposition and reconstruction with the Haar wavelet. The time series analysed is f1::4 = f9; 7; 3; 5g.
The Haar Wavelet Transform in the Time Series Similarity Paradigm
15
signal signal
9
9
7 7
5 5
3
3
=
average
6
6
+2 −2
difference
8
+
4
+2 −2
+1
9 difference
+1 −1
−1
+
7 5 3 +1
+1 −1
−1
Fig. 1. Decomposition of the example time series into Haar components. Right: reconstruction of the time series from the Haar components.
Note that the set of wavelet coefficients can be represented in a hierarchical (dyadic) tree structure, through which it is obtained. In particular, the reconstruction of each single point fi of the time series is possible (without reconstructing all the fj 6= fi ), by following a single path along the tree, converging to the point fi in question. This path determines a unique ‘binary address’ of the point fi .
3 Time Series Representations with Haar Family Note that the Haar wavelet implements the operation of derivation at the particular scale at which it operates. From the definition of the Haar wavelet , (eq. 3, see also figure 2) we have: (x) =<
D(2 x); (2 x) > ;
where D is the derivative operator
8 1 < D(x) = ,1 : 0
x=0 x=1 otherwise : for for
For the wavelet transform of f , we have the following:
(7)
16
Z.R. Struzik and A. Siebes
< f (x);
l;n (x) > = = = =
< f (x); < Dl;n (2 x); l;n (2 x) >>
< f (x); 2,1 < Dl,1;n (x); l,1;n (x) >> << 2,1 Dm;n (x); f (x) >; m;n (x) >
,1 < Dfm;n (x); m;n (x) >
=2
:
(8)
where Df is the derivative of the function f and is the smoothing kernel. The wavelet coefficients obtained with the Haar wavelet are, therefore, proportional to the local averages of the derivative of the time series f at a given resolution. This is a particularly interesting property of our representation, which makes us think that the representations derived from the Haar representation will be quite useful in time series mining. Indeed, in the analysis of patterns in time series, local slope is probably the most appealing feature for many applications.
*
=
x −> x/2
Fig. 2. Convolution of the block function with the derivative operator gives the Haar wavelet after rescaling the time axis x ! x=2. stands for the convolution product.
The most direct representation of the time series with the Haar decomposition scheme would be encoding a certain predefined, highest, i.e. most coarse, resolution level smax , say one year resolution, and the details at the lower scales: half (a year), quarter (of a year) etc., down to the minimal (finest) resolution of interest smin , which would often be defined by the lowest sampling rate of the signals. 1 The coefficients of the Haar decomposition between scales smax ::smin will be used for the representation: Haar(f ) = fci;j : i = smax ::smin ; j = 1::2i g :
The Haar representation is directly suitable to serve for comparison purposes when the absolute (i.e. not relative) values of the time series (and the local slope) are relevant. In many applications one would, however, rather work with value independent, scale invariant representations. For that purpose, we will use a number of different, special representations derived from the Haar decomposition WT. To begin with, we will use the sign based representation. It uses only the sign of the wavelet coefficient and it has been shown to work in the CWT based decomposition, see [6]. si;j = sgn(ci;j ) 1
In practice one may need to interpolate and re-sample signals in order to arrive at a certain common or uniform sampling rate. This is, however, a problem of the implementation and not of the representation and it is related to how the convolution operation is implemented.
The Haar Wavelet Transform in the Time Series Similarity Paradigm
where sgn(x)
=
1 for for
,1
17
0
x x<
0:
The sign representation is an extreme case of discretisation representation since it reduces the range of coefficients in the representation to two discrete levels. For some purposes this may be too coarse. Another possibility to arrive at a scale invariant representation is to use the difference of the logarithms (DOL) of values of the wavelet coefficient at the highest scale and at the working scale:
DOL vi;j
= log (jci;j j) , log (jc1;1 j) ;
where i; j are working scale and position respectively, and c1;1 is the first coefficient of the corresponding Haar representation. Note that the sign representation si;j of the time series is complementary/orthogonal to the DOL representation. The DOL representation can be conveniently normalised to give the rate of increase of v DOL with scale: hi;j
DOL = log(2(i) ) = vi;j
for i > 0 :
This representation resembles the H¨older exponent approximation of time series local roughness at the particular scale of resolution i as introduced in [7].
4 Distance Evaluation with Haar Representations k;l The measure of the correlation between the components ci;j g and cg of two respective time series f and g can be put as: C (f; g )
=
m;n X
i;j
fi;j;k;lg=0
wi cf wk ck;l g i;j ;k;l
where i;j ;k;l
=1
i
i=k
&j=l
and the (optional) weights wi and wk depend on their respective scales i and k . In our experience the orthogonality of the coefficients is best employed without weighting. Normalisation is necessary in order to arrive at the correlation product between [0; 1] and will simply take the form of Cnormalised (f; g )
=
C (f; g )
p
C (f; f ) C (g; g )
:
The distance of two representations can be easily obtained as
Distance(f; g ) = , log(jCnormalised (f; g )j) :
18
Z.R. Struzik and A. Siebes
3 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
Brownian walk sample
0
0.2
0.4
0.6
0.8
1
Fig. 3. Top plot contains the input signal. The top colour (gray-scale) panel contains the Haar decomposition with six scale levels from i = 1 to i = 6, the smoothed component is not shown. The colour (gray shade) encodes the value of the decomposition from dark blue (white) for 1 to dark red (black) for 1. The centre panel shows the sign of the decomposition coefficients, i.e. dark blue (white) for ci;j 0 and dark red (black) for ci;j < 0. The bottom colour (gray-scale) panel contains the H¨older decomposition with five scale levels i = 2 : : : 6.
,
The Haar Wavelet Transform in the Time Series Similarity Paradigm
19
5 Incremental Calculation of the Decomposition Coefficients and the Correlation Product One of the severe disadvantages of the Haar WT is the lack of translation invariance; when the input signal shifts by t (e.g. as the result of acquiring some additional input samples), the coefficients of the Haar wavelet transform need to be recalculated. This is rather impractical when one considers systematically updated inputs like financial records. When the representation is to be updated on each new sample, little can be done other than to recalculate the coefficients. The cost of this resides mainly in the cost of calculating the inner product. Direct calculation is of nm complexity, where n = 2N is the length of time series and m is the length of the wavelet. The cost of calculating the inner product therefore grows quickly with the length of the wavelet and for the largest scale it is n2 . The standard way to deal with this problem is to use the Fast Fourier Transform for calculating the inner product of two time series, which in case of equal length reduces the complexity to n log(n). Additional savings can be obtained if the update of the WT does not have to be performed on every new input sample, but it can be done periodically on each new n samples ( corresponding with some t time period). In this case, when the t coincides with the working scale of the wavelet at a given resolution, particular a situation arrises: – only the coefficients at scales larger than t scale have to be recalculated; – coefficients of f jxx00 +x must be calculated anew; – other coefficients have to be re-indexed or removed. This is also illustrated in figure 4. x0
x0 + delta t time
recalculate scale = delta t
remove
reindex
calculate anew
scale
Fig. 4. Representation update scheme in the case of the shift of the input time series by working scale of the wavelet.
t =
As expected, the larger the time shift t, the fewer the number of the coefficients which have to be recalculated and the larger the number of coefficients which have to be reindexed (plus, of course, the number of coefficients which have to be calculated from
20
Z.R. Struzik and A. Siebes
f jxx00 +t ). For the full details of incremental calculation of coefficients the reader may wish to consult [8].
6
Experimental Results
We took the records of the exchange rate with respect to USD over the period 01/06/73 - 21/05/87. It contains daily records of the exchange rates of five currencies with respect to USD: Pound Sterling, Canadian Dollar, German Mark, Japanese Yen and Swiss Franc. (Some records were missing - we used the last known value to interpolate missing values.) Below, in figure 5 we show the plots of the records. 3
2.6 1) Pound Sterling
1) Pound Sterling 2) Canadian Dollar 3) German Mark 4) Japanese Yen 5) Swiss Franc
2.5
2.4 2.2 2 1.8 1.6
2
1.4 1.2
1.5
1 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.05 2) Canadian Dollar 1
1
0.95 0.9
0.5
0.85 0.8 0.75
0 0.7 0.65
0.1
0.2
0.3
0.4
0.5
0.7
0.6
0.7
0.8
0.1
0.9
0.0075
0.2
0.3
0.4
0.5
4) Japanese Yen
0.65
0.007
0.6
0.0065
0.6
0.7
0.8
0.9
0.6
5) Swiss Franc
3) German Mark 0.55 0.5
0.55
0.006
0.5
0.0055
0.45
0.005
0.4
0.0045
0.45 0.4 0.35
0.35
0.004
0.3
0.0035
0.25
0.003 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.3 0.25 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Fig. 5. Left above, all the records of the exchange rate used, with respect to USD over the period 01/06/73 - 21/05/87. In small inserts, single exchange rates renormalised, from top right to bottom left (clockwise), Pound Sterling, Canadian Dollar, German Mark, Japanese Yen and Swiss Franc, all with respect to USD.
All three representation types were made for each of the time series: the Haar, sign and H¨older representation. Only six scale levels (64 values) of the representation (five for H¨older, 63 points) were retained. These were next compared for each pair to give the correlation product.
The Haar Wavelet Transform in the Time Series Similarity Paradigm
21
In figure 6, we plot the values of the correlation for each of the pairs compared. The reader can visually compare the Haar representation results with his/her own ‘visual estimate’ of the degree of (anti-)correlation for pairs of plots in figure 5.
2
Haar repr. Hoelder repr sign repr. -1 1 0
1.5
1
0.5
0
-0.5
-1 c(1,2)
c(1,3)
c(1,4)
c(1,5)
c(2,3)
c(2,4)
c(2,5)
c(3,4)
c(3,5)
c(4,5)
Fig. 6. The values of the correlation products for each of the pairs compared, obtained with the Haar representation, the sign representation, and the H¨older representation.
One can verify that the results obtained with the sign representation follow those obtained with the Haar representation but are weaker in their discriminating power (more flat plot). Also, the H¨older representation is practically independent of the sign representation. In terms of correlation product, its distance to sign representation approximately equals the distance of Haar represenation to the sign representation but with the oposite sign. This confirms the fact that the correlation in the H¨older exponent captures the value oriented, sign independent features (roughness exponent) of the time series.
7 Conclusions We have demonstrated that the Haar representation and a number of related represenations derived from it are suitable for providing estimates of similarity between time series in a hierarchical fashion. In particular, the correlation obtained with the local slope of the time series (or its sign) in the sequence of multi-resolution steps closely corresponds to the subjective feeling of similarity between the example financial time series. Larger scale experiments with one of the major Dutch banks confirm these findings. The next step is the design and development of a module which will compute and update these representations for the 2.5 million time series which this bank maintains. Once this module is running, mining on the database of time series representations will be the next step.
References 1. R. Agrawal, C. Faloutsos, A. Swami. Efficient Similarity Search in Sequence Databases, In Proc. of the Fourth International Conference on Foundations of Data Organization and
22
Z.R. Struzik and A. Siebes
Algorithms, Chicago, (1993). 2. R. Agrawal, K-I. Lin, H.S. Sawhney, K, Shim, Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time Series Databases, in Proceedings of the 21 VLDB Conference, Z¨urich, (1995). 3. G. Das, D. Gunopulos, H. Mannila, Finding Similar Time Series, In Principles of Data Mining and Knowledge Discovery, Lecture Notes in Artificial intelligence 1263, Springer, (1997). 4. U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Eds., Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press, (1996). 5. C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast Subsequence Matching in Time-Series Databases”, in Proc. ACM SIGMOD Int. Conf. on Management of Data, (1994). 6. Z.R. Struzik, A. Siebes, Wavelet Transform in Similarity Paradigm I, CWI Report, INSR9802, (1998), also in Research and Development in Knowledge Discovery and Data Mining, Xindong Wu, Ramamohanarao Kotagiri, Kevin B. Korb, Eds, Lecture Notes in Artificial Intelligence 1394, 295-309, Springer (1998). 7. Z.R. Struzik, A. Siebes, Wavelet Transform in Similarity Paradigm II, CWI Report, INSR9815, CWI, Amsterdam (1998), also in Proc. 10th Int. Conf. on Database and Expert System Applications (DEXA’99), Florence, (1999). 8. Z.R. Struzik, A. Siebes, The Haar Wavelet Transform in Similarity Paradigm, CWI Report, INS-R99xx, CWI, Amsterdam (1999). http://www.cwi.nl/htbin/ins1/publications 9. I. Daubechies, Ten Lectures on Wavelets, S.I.A.M. (1992). 10. M. Holschneider, Wavelets - An Analysis Tool, Oxford Science Publications, (1995). 11. Z.R. Struzik, ‘Local Effective H¨older Exponent Estimation on the Wavelet Transform Maxima Tree’, Fractals: Theory and Applications in Engineering, Michel Dekking, Jacques L´evy V´ehel, Evelyne Lutton, Claude Tricot, Eds, Springer (1999).
Rule Discovery in Large Time-Series Medical Databases Shusaku Tsumoto Department of Medicine Informatics, Shimane Medical University, School of Medicine, 89-1 Enya-cho Izumo City, Shimane 693-8501 Japan E-mail:
[email protected]
Abstract. Since hospital information systems have been introduced in large hospitals, a large amount of data, including laboratory examinations, have been stored as temporal databases. The characteristics of these temporal databases are: (1) Each record are inhomogeneous with respect to time-series, including short-term effects and long-term effects. (2) Each record has more than 1000 attributes when a patient is followed for more than one year. (3) When a patient is admitted for a long time, a large amount of data is stored in a very short term. Even medical experts cannot deal with these large databases, the interest in mining some useful information from the data are growing. In this paper, we introduce a combination of extended moving average method and rule induction method, called CEARI to discover new knowledge in temporal databases. This CEARI was applied to a medical dataset on Motor Neuron Diseases, the results of which show that interesting knowledge is discovered from each database.
1
Introduction
Since hospital information systems have been introduced in large hospitals, a large amount of data, including laboratory examinations, have been stored as temporal databases[11]. For example, in a university hospital, where more than 1000 patients visit from Monday to Friday, a database system stores more than 1 GB numerical data of laboratory examinations. Thus, it is highly expected that data mining methods will find interesting patterns from databases because medical experts cannot deal with those large amount of data. The characteristics of these temporal databases are: (1) Each record are inhomogeneous with respect to time-series, including short-term effects and long-term effects. (2) Each record has more than 1000 attributes when a patient is followed for more than one year. (3) When a patient is admitted for a long time, a large amount of data is stored in a very short term. Even medical experts cannot deal with these large temporal databases, the interest in mining some useful information from the data are growing. In this paper, we introduce a combination of extended moving average method and rule induction method, called CEARI to discover new knowledge in temporal databases. In the system, extended moving average method are used ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 23–31, 1999. c Springer-Verlag Berlin Heidelberg 1999
24
S. Tsumoto
for preprocessing, to deal with irregularity of each temporal data. Using several parameters for time-scaling, given by users, this moving average method generates a new database for each time scale with summarized attributes. Then, rule induction method is applied to each new database with summarized attributes. This CEARI was applied to two medical datasets, the results of which show that interesting knowledge is discovered from each database.
2
Temporal Databases in Hospital Information Systems
Since incorporating temporal aspects into databases is still an ongoing research issue in database area[1], temporal data are stored as a table in hospital information systems(H.I.S.). Table 1 shows a typical example of medical data, which is retrieved from H.I.S. The first column denotes the ID number of each patient, and the second one denotes the date when the datasets in this row is examined. Each row with the same ID number describes the results of laboratory examinations, which were taken on the date in the second column. For example, the second row shows the data of the patient ID 1 on 04/19/1986. This simple database show the following characteristics of medical temporal database: (1)The Number of Attributes are too many. Even though the dataset of a patient focuses on the transition of each examination (attribute), it would be difficult to see its trend when the patient is followed for a long time. If one wants to see the long-term interaction between attributes, it would be almost impossible. In order to solve this problems, most of H.I.S. systems provide several graphical interfaces to capture temporal trends[11]. However, the interactions among more than three attributes are difficult to be studied even if visualization interfaces are used. (2)Irregularity of Temporal Intervals. Temporal intervals are irregular. Although most of the patients will come to the hospital every two weeks or one month, physicians may not make laboratory tests at each time. When a patient has a acute fit or suffers from acute diseases, such as pneumonia, laboratory examinations will be made every one to three days. On the other hand, when his/her status is stable, these test may not be made for a long time. Patient ID 1 is a typical example. Between 04/30 and 05/08/1986, he suffered from a pneumonia and was admitted to a hospital. Then, during the therapeutic procedure, laboratory tests were made every a few days. On the other hand, when he was stable, such tests were ordered every one or two year. (3)Missing Values. In addition to irregularity of temporal intervals, datasets have many missing values. Even though medical experts will make laboratory examinations, they may not take the same tests in each instant. Patient ID 1 in Table 1 is a typical example. On 05/06/1986, medical physician selected a specific test to confirm his diagnosis. So, he will not choose other tests. On 01/09/1989, he focused only on GOT, not other tests. In this way, missing values will be observed very often in clinical situations. These characteristics have already been discussed in KDD area[5]. However, in real-world domains, especially domains in which follow-up studies are crucial, such as medical domains, these ill-posed situations will be distinguished. If one
Rule Discovery in Large Time-Series Medical Databases
25
Table 1. An Example of Temporal Database ID 1 1 1 1 1 1 1 1 2 2 2 2
Date 19860419 19860430 19860502 19860506 19860508 19880826 19890109 19910304 19810511 19810713 19880826 19890109 ···
GOT GPT LDH γ-GTP TP edema · · · 24 12 152 63 7.5 ··· 25 12 162 76 7.9 + · · · 22 8 144 68 7.0 + · · · ··· 22 13 156 66 7.6 ··· 23 17 142 89 7.7 ··· 32 ··· 20 15 369 139 6.9 + · · · 20 15 369 139 6.9 ··· 22 14 177 49 7.9 ··· 23 17 142 89 7.7 ··· 32 ···
wants to describe each patient (record) as one row, then each row have too many attributes, which depends on how many times laboratory examinations are made for each patient. It is notable that although the above discussions are made according to the medical situations, similar situations may occur in other domains with long-term follow-up studies.
3
Extended Moving Average Methods
3.1
Moving Average Methods
Averaging mean methods have been introduced in statistical analysis[6]. Temporal data often suffers from noise, which will be observed as a spike or sharp wave during a very short period, typically at one instant. Averaging mean methods remove such an incidental effect and make temporal sequences smoother. With one parameter w, called window, moving average yˆw is defined as follows: w X yˆw = yj . j=1
For example, in the case of GOT of patient ID 1, y5 is calculated as: yˆ5 = (24 + 25 + 22 + 22 + 22)/5 = 23.0. It is easy to see that yˆw will remove the noise effect which continue less than w points. The advantage of moving average method is that it enables to remove the noise effect when inputs are given periodically[6]. For example, when some tests are measured every several days1 , the moving average method is useful to remove the noise and to extract periodical domains. However, in real-world domains, 1
This condition guarantees that measurement is approximately continuous
26
S. Tsumoto
inputs are not always periodical, as shown in Table 1. Thus, when applied timeseries are irregular or discrete, ordinary moving average methods are powerless. Another disadvantage of this method is that it cannot be applicable to categorical attributes. In the case of numerical attributes, average can be used as a summarized statistic. On the other hand, such average cannot be defined for categorical attributes. Thus, we introduce the extended averaging method to solve these two problems in the subsequent subsections. 3.2
Extended Moving Average for Continuous Attributes
In this extension, we first focus on how moving average methods remove noise. The key idea is that a window parameter w is closely related with periodicity. If w is larger, then the periodical behavior whose time-constant is lower than w will be removed. Usually, a spike by noise is observed as a single event and this effect will be removed when w is taken as a large value. Thus, the choice of w separates different kinds of time-constant behavior in each attribute and in the extreme case when w is equal to total number of temporal events, all the temporal behavior will be removed. We refer to this extreme case as w = ∞. The extended moving average method is executed as follows: first calculates y∞ for an attribute y. Second, the method outputs its maximum and minimum values. Then, according to the selected values for w, a set of sequence {yw (i)} for each record is calculated. For example, if {w } is equal to {10 years, 5 years, 1 year, 3 months, 2 weeks}, then for each element in {w}, the method uses the time-stamp attribute for calculation of each {yw (i)} in order to deal with irregularities. In the case of Table 1, when w is taken as 1 year, all the rows are aggregated into several components as shown in Table 2. From this aggregation, a sequence yw for each attribute is calculated as in Table 3. Table 2. Aggregation for w= 1 (year) ID 1 1 1 1 1 1 1 1
Date 19860419 19860430 19860502 19860506 19860508 19880826 19890109 19910304 ···
GOT GPT LDH γ-GTP TP edema · · · 24 12 152 63 7.5 ··· 25 12 162 76 7.9 + · · · 22 8 144 68 7.0 + · · · ··· 22 13 156 66 7.6 ··· 23 17 142 89 7.7 ··· 32 ··· 20 15 369 139 6.9 + · · ·
Rule Discovery in Large Time-Series Medical Databases
27
Table 3. Moving Average for w= 1 (year) ID Period GOT GPT LDH γ-GTP TP edema · · · 1 1 23.25 11.25 153.5 68.25 7.5 ? ··· 1 2 23 17 142 89 7.7 ? ··· 1 3 32 ? ··· 1 4 ? ··· 1 5 20 15 369 139 6.9 ? ··· 1 ∞ 24 12.83 187.5 83.5 7.43 ? · · · ···
3.3
Categorical Attributes
One of the disadvantages of moving average method is that it cannot deal with categorical attributes. To solve this problem, we will classify categorical attributes into three types, whose information should be given by users. The first type is constant, which will not change during the follow-up period. The second type is ranking, which is used to rank the status of a patient. The third type is variable, which will change temporally, but ranking is not useful. For the first type, extended moving average method will not be applied. For the second one, integer will be assigned to each rank and extended moving average method for continuous attributes is applied. On the other hand, for the third one, the temporal behavior of attributes is transformed into statistics as follows. First, the occurence of each category (value) is counted for each window. For example, in Table 2, edema is a binary attribute and variable. In the first window, an attribute edema takes {-,+,+,-}.2 So, the occurence of − and + are 2 and 2, respectively. Then, each conditional probability will be calculated. In the above example, probabilities are equal to p(−|w1 ) = 2/4 and p(+|w1 ) = 2/3. Finally, for each probability, a new attribute is appended to the table (Table 4).
Table 4. Final Table with Moving Average for w= 1 (year) ID Period GOT GPT LDH γ-GTP TP edema(+) edema(-) · · · 1 1 23.25 11.25 153.5 68.25 7.5 0.5 0.5 ··· 1 2 23 17 142 89 7.7 0.0 1.0 ··· 1 3 32 0.0 1.0 ··· 1 4 0.0 1.0 ··· 1 5 20 15 369 139 6.9 1.0 0.0 ··· 1 ∞ 24 12.83 187.5 83.5 7.43 0.43 0.57 · · · ···
2
Missing values are ignored for counting.
28
S. Tsumoto
Summary of Extended Moving Average. All the process of extended moving average is used to construct a new table for each window parameter as the first preprocessing. Then, second preprocessing method will be applied to newly generated tables. The first preprocessing method is summarized as follows. 1. Repeat for each w in List Lw , a) Select an attribute in a List La ; i. If an attribute is numerical, then calculate moving average for w; ii. If an attribute is constant, then break; iii. If an attribute is rank, then assign integer to each ranking; calculate moving average for w; iv. If an attribute is variable, calculate frequency of each category; b) If La is not empty, goto (a). c) Construct a new table with each moving average. 2. Construct a table for w = ∞.
4 4.1
Second Preprocessing and Rule Discovery Summarizing Temporal Sequences
From the data table after processing extended moving average methods, several preprocessing methods may be applied in order for users to detect the temporal trends in each attribute. One way is discretization of time-series by clustering introduced by Das[4]. This method transforms time-series into symbols representing qualitative trends by using a similarity measure. Then, time-series data is represented as a symbolic sequence. After this preprocessing, rule discovery method is applied to this sequential data. Another way is to find auto-regression equations from the sequence of averaging means. Then, these quantitative equations can be directly used to extract knowledge or their qualitative interpretation may be used and rule discovery[3], other machine learning methods[7], or rough set method[9] can be applied to extract qualitative knowledge. In this research, we adopt two modes and transforms databases into two forms: one mode is applying temporal abstraction method[8] as second preprocessing and transforms all continuous attributes into temporal sequences. The other mode is applying rule discovery to the data after the first preprocessing without second one. The reason why we adopted these two mode is that we focus not only on temporal behavior of each attribute, but also on association among several attributes. Although Miksch’s method[8] and Das’s approach[4] are very efficient to extract knowledge about transition, they cannot focus on association between attributes in an efficient way. For the latter purpose, much simpler rule discovery algorithm are preferred. 4.2
Continuous Attributes and Qualitative Trend
To characterize the deviation and temporal change of continuous attributes, we introduce standardization of continuous attributes. For this, we only needs the
Rule Discovery in Large Time-Series Medical Databases
29
total average yˆ∞ and its standardization σ∞ . With these parameters, standardized value is obtained as: yw − yˆ∞ . zw = σ∞ The reason why standardization is introduced is that it makes comparison between continuous attributes much easier and clearer, especially, statistic theory guarantees that the coefficients of a auto-regression equation can be compared with those of another equation[6]. After the standardization, an extraction algorithm for qualitative trends is applied[8]. This method is processed as follows: First, this method uses data smoothing with window parameters. Secondly, smoothed values for each attributes are classified into seven categories given as domain knowledge about laboratory test values: extremely low, substantially low, slightly low, normal range, slightly high, substantially high, and extremely high. With these categories, qualitative trends are calculated and classified into the following ten categories by using guideline rules: decrease too fast(A1), normal decrease(A2), decrease too slow(A3), zero change(ZA), dangerous increase(C), increase too fast(B1), normal increase(B2), increase too slow(B3), dangerous decrease(D). For example, if the value of some laboratory tests change from substantially high to normal range within a very short time, the qualitative trend will be classified into A1(decrease too fast). For further information, please refer to [8]. 4.3
Rule Discovery Algorithm
For rule discovery, a simple rule induction algorithm discussed in [10] is applied, where continuous attributes are transformed into categorical attributes with a cut-off point. As discussed in Section 3, moving average method will remove the temporal effect shorter than a window parameter. Thus, w = ∞ will remove all the temporal effect, so this moving average can be viewed as data without any temporal characteristics. If rule discovery is applied to this data, it will generate rules which represents non-temporal association between attributes. In this way, data after processing w-moving average is used to discover association with w or longer time-effect. Ideally, from w = ∞ down to w = 1, we decompose all the independent time-effect associations between attributes. However, the timeconstant in which users are interested will be limited and the moving average method shown in Section 3 uses a set of w given by users. Thus, application of rule discovery to each table will generate a sequence of temporal associations between attributes. If some temporal associations will be different from associations with w = ∞, then these specific relations will be related with a new discovery. 4.4
Summary of Second Preprocessing and Rule Discovery
Second preprocessing method and rule discovery are summarized as follows.
30
S. Tsumoto
1. Calculate yˆ∞ and σ∞ from the table of w = ∞; 2. Repeat for each w in List Lw ; (w is sorted in a descending order.) a) Select a table of w: Tw ; i. Standardize continuous and ranking attributes; ii. Calculate qualitative trends for continuous and ranking attributes; iii. Construct a new table for qualitative trends; iv. Apply rule discovery method for temporal sequences; b) Apply rule induction methods to the original table Tw ;
5
Experimental Results
The above rule discovery system is implemented in CEARI(Combination of Extended Moving Average and RUle Induction). CEARI was applied to a clinical database on motor neuron diseases, which consists of 1477 samples, 3 classes. Each patient is followed during 15 years. A list of w, {w } was set to {10 years, 5 years, 1 year, 3 months, 2 weeks} and thresholds, δp(D|R) and δp(R|D) were set to 0.60 and 0.30,respectively. One of the most interesting problems of Motor neuron diseases (MND) is how long it takes each patient to suffer from respiratory failure, which is the main cause of death.3 It is empirically known that some types of MND is more progressive than other types and that their survival period is much shorter than others. The database for this analysis describes all the data of patients suffering from MND. Non-temporal Knowledge. The most interesting discovered rules are: [M ajor P ectolis < 3] → [P aCO2 > 50] (P (D|R) : 0.87, P (R|D) : 0.57), [M inor P ectolis < 3] → [P aO2 < 61] (P (D|R) : 0.877, P (R|D) : 0.65). Both rules mean that if some of the muscles of chest, called Major Pectolis and Minor Pectolis are weak, then respiratory function is low, which suggests that muscle power of chest is closely related with respiratory function, although these muscles are not directly used for respiration. Short-Term Effect. Several interesting rules are discovered: [M ajor P ectolis = 2] → [P aO2 : D] (P (D|R) : 0.72, P (R|D) : 0.53, w = 3(months)), [Biceps < 3] → [P aO2 : A2] (P (D|R) : 0.82, P (R|D) : 0.62, w = 3(months)). [Biceps > 4] → [P aO2 : ZA] (P (D|R) : 0.88, P (R|D) : 0.72, w = 3(months)). 3
The prognosis of MND is generally not good, and most of the patients will die within ten years because of respiratory failure. The only way for survival is to use automatic ventilator[2].
Rule Discovery in Large Time-Series Medical Databases
31
These rules suggest that if the power of muscles around chest is low, then respiratory function will decrease within one year and that if the power of muscles in arms is low, then respiratory function will decrease within a few years. Long-Term Effect. The following interesting rules are discovered: [M ajorP ectolis : A3] ∧ [Quadriceps : A3] → [P a02 : A3] (P (D|R) : 0.85, P (R|D) : 0.53, w = 1(year)), [Gastro : A3] → [P aO2 : A3] (P (D|R) : 0.87, P (R|D) : 0.52, w = 1(year)). These rules suggest that if the power of muscles of legs change very slowly, then respiratory function will decrease very slow. In summary, the system discovers that the power of muscles around chest and its chronological characteristics are very important to predict the respiratory function and how long it takes for a patient to reach respiratory failure.
References 1. Abiteboul, S., Hull, R., and Vianu, V. Foundations of Databases, Addison-Wesley, New York, 1995. 2. Adams, R.D. and Victor, M. Principles of Neurology, 5th edition, McGraw-Hill, NY, 1993. 3. Agrawal, R., Imielinski, T., and Swami, A., Mining association rules between sets of items in large databases, in Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), pp. 207-216, 1993. 4. Das, G., Lin, K.I., Mannila, H., Renganathan, G. and Smyth, P. Rule discovery from time series. In: Proceedings of Fourth International Conference on Knowledge Discovery and Data Mining, pp.16-22, 1998. 5. Fayyad, U.M., et al.(eds.)., Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996. 6. Hamilton, J.D. Time Series Analysis, Princeton University Press, 1994. 7. Langley, P. Elements of Machine Learning, Morgan Kaufmann, CA, 1996. 8. Miksch, S., Horn, W., Popow, C., and Paky, F. Utilizing temporal data abstraction for data validation and therapy planning for artificially ventilated newborn infants. Artificial Intelligentce in Medicine, 8, 543-576, 1996. 9. Tsumoto, S. and Tanaka, H., PRIMEROSE: Probabilistic Rule Induction Method based on Rough Sets and Resampling Methods. Computational Intelligence, 11, 389-405, 1995. 10. Tsumoto, S. Knowledge Discovery in Medical MultiDatabases: A Rough Set Approach, Proceedings of PKDD99(in this issue), 1999. 11. Van Bemmel,J. and Musen, M. A. Handbook of Medical Informatics, SpringerVerlag, New York, 1997.
Simultaneous Prediction of Multiple Chemical Parameters of River Water Quality with TILDE Hendrik Blockeel1 , Saˇso Dˇzeroski2 , and Jasna Grbovi´c3 1
3
Katholieke Universiteit Leuven, Dept. of Computer Science Celestijnenlaan 200A, B-3001 Heverlee, Belgium
[email protected] 2 Joˇzef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia
[email protected] Hydrometeorological Institute, Vojkova 1b, SI-1000 Ljubljana, Slovenia
[email protected]
Abstract. Environmental studies form an increasingly popular application domain for machine learning and data mining techniques. In this paper we consider two applications of decision tree learning in the domain of river water quality: a) the simultaneous prediction of multiple physico-chemical properties of the water from its biological properties using a single decision tree (as opposed to learning a different tree for each different property) and b) the prediction of past physico-chemical properties of the river water from its current biological properties. We discuss some experimental results that we believe are interesting both to the application domain experts and to the machine learning community.
1
Introduction
The quality of surface waters, including rivers, depends on their physical, chemical and biological properties. The latter are reflected by the types and densities of living organisms present in the water. Based on the above properties, surface waters are classified into several quality classes which indicate the suitability of the water for different kinds of use (drinking, swimming, . . . ). It is well known that the physico-chemical properties give a limited picture of water quality at a particular point in time, while living organisms act as continuous monitors of water quality over a period of time [6]. This has increased the relative importance of biological methods for monitoring water quality, and many different methods for mapping biological data to discrete quality classes or continuous scales have been developed [7]. Most of these approaches use indicator organisms (bioindicator taxa), which have well known ecological requirements and are selected for their sensitivity / tolerance to various kinds of pollution. Given a biological sample, information on the presence and density of all indicator organisms present in the sample is usually combined to derive a biological index that reflects the quality of the water at the site where the sample was taken. Examples are the Saprobic Index [14], used in many countries of Central Europe, and the Biological Monitoring Working Party Score (BMWP) [13] and its derivative Average Score Per Taxon (ASPT), used in the United Kingdom. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 32–40, 1999. c Springer-Verlag Berlin Heidelberg 1999
Simultaneous Prediction of River Water Quality with TILDE
33
The main problem with the biological indices described above is their subjectivity [18]. The computation of these indices makes use of weights and other numbers that were assigned to individual bioindicators by (committees of) expert biologists and ecologists and are based on the experts’ knowledge about the ecological requirements of the bioindicators, which is not always complete. The assigned bioindicator values are thus subjective and often inappropriate [19]. An additional layer of subjectivity is added by combining the scores of the individual bioindicators through ad-hoc procedures based on sums, averages, and weighted averages instead of using a sound method of combination. While a certain amount of subjectivity cannot be avoided (water quality itself is a subjective measure, tuned towards the interests humans have in river water), this subjectivity should only appear at the target level (classification) and not at the intermediate levels described above. This may be achieved by gaining insight into the relationships between biological, physical and chemical properties of the water and its overall quality, which is currently a largely open research topic. To this aim data mining techniques can be employed [18,11,9]. The importance of gaining such insight stretches beyond water quality prediction. E.g., the problem of inferring chemical parameters from biological ones is practically relevant, especially in countries where extensive biological monitoring is conducted. Regular monitoring for a very wide range of chemical pollutants would be very expensive, if not impossible. On the other hand, biological samples may, for example, reflect an increase in pollution and indicate likely causes or sources of (chemical) pollution. The work described in this paper is situated at this more general level. The remainder of the paper is organized as follows. Section 2 describes the goals of this study and the difference with earlier work. Section 3 describes the available data and the experimental setup. Section 4 describes the machine learning tool that was used in these experiments. Section 5 presents in detail the experiments and their results and in Sect. 6 we conclude.
2
Goals of This Study
In earlier work [10,11] machine learning techniques have been applied to the task of inferring biological parameters from physico-chemical ones by learning rules that predict the presence of individual bioindicator taxa from the values of physico-chemical measurements, and to the task of inferring physico-chemical parameters from biological ones [9]. Dˇzeroski et al. [9] discuss the construction of predictive models that allow prediction of a specific physico-chemical parameter from biological data. For each parameter a different regression tree is built using Quinlan’s M5 system [17]. A comparison with nearest neighbour and linear regression methods shows that the induction of regression trees is competitive with the other approaches as far as predictive accuracy is concerned, and moreover has the advantage of yielding interpretable theories. A comparison of the different trees shows that the trees for different target variables are often similar, and that some of the taxa occur in many trees (i.e.,
34
H. Blockeel, S. Dˇzeroski, and J. Grbovi´c
they are sensitive to many physico-chemical properties). This raises the question whether it would be possible to predict many or all of the properties with only one (relatively simple) tree, and without significant loss in predictive accuracy. As such, this application seems a good test case for recent research on simultaneous prediction of multiple variables [1]. A second extension with respect to the previous work is the prediction of past physico-chemical properties of the water; more specifically, the maximal, minimal and average values of these properties over a period of time. As mentioned before, physico-chemical properties of water give a very momentary view of the water quality; watching these properties over a longer period of time may alleviate this problem. This is the second scientific issue we investigate in this paper.
3
The Data
The data set we have used is the same one as used in [9]. The data come from the Hydrometeorological Institute of Slovenia (HMZ) that performs water quality monitoring for Slovenian rivers and maintains a database of water quality samples. The data cover a six year period (1990–1995). Biological samples are taken twice a year, once in summer and once in winter, while physical and chemical samples are taken more often (periods between measurements varying from one to several months) for each sampling site. The physical and chemical samples include the measured values of 16 different parameters: biological oxygen demand (BOD), electrical conductivity, chemical oxygen demand (K2 Cr2 O7 and KMnO4 ), concentrations of Cl, CO2 , NH4 , PO4 , SiO2 , NO2 , NO3 and dissolved oxygen (O2 ), alkalinity (pH), oxygen saturation, water temperature, and total hardness. The biological samples include a list of all taxa present at the sampling site and their density. The frequency of occurrence (density) of each present taxon is recorded by an expert biologist at three different qualitative levels: 1=incidentally, 3=frequently and 5=abundantly. Our data are stored in a relational database represented in Prolog; in Prolog terminology each relation is a predicate and each tuple is a fact. The following predicates are relevant for this text: – chem(Site, Year, Month, Day, ListOf16Values) : this predicate contains all physico-chemical measurements. It consists of 2580 facts. – bio(Site, Day, Month, Year, ListOfTaxa): this predicate lists the taxa that occur in a biological sample; ListOfTaxa is a list of couples (taxon, abundancelevel) where the abundance level is 1, 3 or 5 (taxa that do not occur are simply left out of the list). This predicate contains 1106 facts. Overall the data set is quite clean, but not perfectly so. 14 physico-chemical measurements have missing values; moreover, although biological measurements are usually taken on exactly the same day as some physico-chemical measurement, for 43 biological measurements no physico-chemical data for the same day are available. Since this data pollution is very limited, we have just disregarded the examples with missing values in our experiments. This leaves a total of 1060
Simultaneous Prediction of River Water Quality with TILDE
35
water samples for which complete biological and physico-chemical information is available; our experiments are conducted on this set.
4
Predictive Clustering and TILDE
Building a model for simultaneous prediction of many variables is strongly related to clustering. Indeed, clustering systems are often evaluated by measuring the average predictability of attributes, i.e., how well the attributes of an object can be predicted given that it belongs to a certain cluster (see, e.g., [12]). In our context, the predictive modelling can then be seen as clustering the training examples into clusters with small intra-cluster variance, where this variance is measured as the sum of the variances of the individual variables that are to be predicted, or equivalently: as the mean squared euclidean distance of the instances to their mean in the prediction space. More formally: given a cluster C consisting of n examples ei that are each labelled with a target vector xi ∈ IRD , the intra-cluster variance of C is defined as n X 2 σC = 1/n · (xi − x ¯)0 (xi − x ¯) (1) Pn
i=1
¯ = 1/n i=1 xi . (We assume the target vector to have only numerical where x components here, as is the case in our application; in general however predictive clustering can also be used for nominal targets (i.e., classification), see [1].) In our experiments we used the decision tree learner TILDE [2,3]. TILDE is an ILP system1 that induces first-order logical decision trees (FOLDT’s). Such trees are the first-order equivalent of classical decision trees [2]. TILDE can induce classification trees, regression trees and clustering trees and can handle both attribute-value data and structural data. It uses the basic TDIDT algorithm [16], in its clustering or regression mode employing as heuristic the variance as described above. The system is fit for our experiments for the following reasons: – Most machine learning and data mining systems that induce predictive models can handle only single target variables (e.g., C4.5 [15], CART [5], M5 [17], . . . ). Building a predictive model for a multi-dimensional prediction space can be done using clustering systems, but most clustering systems consider clustering as a descriptive technique, where evaluation criteria are still slightly different from the ones we have here. (Using terminology from [12], descriptive systems try to maximise both predictiveness and predictability of attributes, whereas predictive systems maximise predictability of the attributes belonging to the prediction space.) 1
Inductive logic programming (ILP) is a subfield of machine learning where first order logic is used to represent data and hypotheses. First order logic is more expressive than the attribute value representations that are classically used by machine learning and data mining systems. From a relational database point of view, ILP corresponds to learning patterns that extend over multiple relations, whereas classical (propositional) methods can find only patterns that link values within the same tuple of a single relation to one another. We refer to [8] for details.
36
H. Blockeel, S. Dˇzeroski, and J. Grbovi´c
– Although the problem at hand is not, strictly speaking, an ILP problem (i.e., it can be transformed into attribute-value format; the number of different attributes would become large but not unmanageable for an attribute-value learner), the use of an ILP learner has several advantages: – No data preprocessing is needed: the data can be kept in their original, multi-relational format. This was especially advantageous for us because the experiments described here are part of a broader range of experiments, many of which would demand different and extensive preprocessing steps. – Prolog offers the same querying capabilities as relational databases, which allows for non-trivial inspection of the data (e.g., counting the number of times a biological measurement is accompanied by at least 3 physicochemical measurements during the last 2 months, . . . ) The main disadvantage of ILP systems, compared to attribute-value learners, is their low efficiency. For our experiments however this inefficiency was not prohibitive and amply compensated by the additional flexibility ILP offers.
5
Experiments
TILDE was consistently run with default parameters, except one parameter controlling the minimal number of instances in each leaf which was 20. From preliminary experiments this value is known to combine high accuracy with reasonable tree size. All results are obtained using 10-fold cross-validations. 5.1
Multi-valued Predictions
For this experiment we have run TILDE with two settings: predicting a single variable at a time (the results of which serve as a reference for the other setting), and predicting all variables simultaneously. When predicting all variables at once, the variables were first standardised (zx = (x − µx )/σx with µx the mean and σx the standard devation); this ensures that all target variables will be considered equally important for the prediction.2 As a bonus the results are more interpretable for non-experts; e.g., “BOD=16.0” may not tell a non-expert much, but a standardised score of +1 always means “relatively high”. The predictive quality of the tree for each single variable is measured as the correlation of the predictions with the actual values. Table 1 shows these correlations; correlations previously obtained with M5.1 [9] are given as reference. It is clear from the table that overall, the multi-prediction tree performs approximately as well as the set of 16 single trees. For a few variables there is a clear decrease in predictive performance (T, NO2 , NO3 ), but surprisingly this effect is compensated for by a gain in accuracy for other variables (conductivity, CO2 , 2
Since the system minimises total variance, i.e. the sum of the variances of each single variable, the “weight” of a single variable is proportional to its variance; standardisation gives all variables an equal variance of 1.
Simultaneous Prediction of River Water Quality with TILDE
37
Table 1. Comparison of predictive quality of a single tree predicting all variables at once with that of a set of 16 different trees, each predicting one variable. variable T pH conduct. O2 O2 -sat. CO2 hardness NO2 NO3 NH4 PO4 Cl SiO2 KMnO4 K2 Cr2 O7 BOD avg
TILDE, all variables TILDE, single variable M5.1, single variable r r r 0.482 0.563 0.561 0.353 0.356 0.397 0.538 0.464 0.539 0.513 0.523 0.484 0.459 0.460 0.424 0.407 0.335 0.405 0.496 0.475 0.475 0.330 0.417 0.373 0.265 0.349 0.352 0.500 0.489 0.664 0.441 0.445 0.461 0.603 0.602 0.570 0.369 0.400 0.411 0.509 0.435 0.546 0.561 0.514 0.602 0.640 0.605 0.652 0.467 0.465 0.498 Chironomus thummi <3
>=3 T=0.0305434 pH=-0.868026 cond=1.88505 O2=-1.66761 O2sat=-1.77512 CO2=1.5091 hardness=1.27274 NO2=0.78751 NO3=0.309126 NH4=2.30423 PO4=1.38143 Cl=1.46933 SiO2=1.30734 KMnO4=1.09387 K2Cr2O7=1.40614 BOD=1.23197
Chlorella vulgaris >=3
T=0.637616 pH=-0.790306 cond=0.734063 O2=-1.17917 O2sat=-0.942371 CO2=0.603914 hardness=0.855631 NO2=1.57007 NO3=-0.250572 NH4=0.510661 PO4=0.247388 Cl=0.530256 SiO2=0.171444 KMnO4=0.526165 K2Cr2O7=0.561389 BOD=0.630086
<3 Sphaerotilus natans <5
5 Gammarus fossarum
Ceratoneis arcus
>=1
<1
T=-0.145121 pH=-0.0213303 cond=0.119256 O2=-0.274239 O2sat=-0.33789 CO2=-0.182526 hardness=0.129298 NO2=0.164533 NO3=0.254751 NH4=0.0355588 PO4=0.00090593 Cl=-0.024326 SiO2=-0.229698 KMnO4=0.460244 K2Cr2O7=0.324544 BOD=0.187718
T=-0.0308557 pH=-0.600129 cond=1.57447 O2=-1.30586 O2sat=-1.38338 CO2=0.630138 hardness=1.55244 NO2=0.889683 NO3=-0.272559 NH4=1.01863 PO4=1.11101 Cl=0.9249 SiO2=0.717223 KMnO4=1.74707 K2Cr2O7=1.40825 BOD=0.998845
Fig. 1. An example of a clustering tree.
...
...
38
H. Blockeel, S. Dˇzeroski, and J. Grbovi´c
KMnO4 ). A possible explanation for this is that when the variables to be predicted are not independent, they contain mutual information about one another that may help the learner distinguish random fluctuations in a single variable from structural fluctuations. The table also shows that TILDE’s performance is slightly worse than that of M5.1 (possibly because of different settings). Note that because of the constant “minimal coverage” of 20, all trees have approximately equal size (about 35 nodes). Hence, when predicting all 16 variables at once the total theory size is effectively reduced by a factor of 16 without predictive accuracy suffering from this. Figure 1 shows the first levels of a multi-prediction tree that was induced during the experiment. The tree indicates, e.g., that Chironomus thummi has the greatest overall influence on the physico-chemical properties; its occurrence indicates low oxygen level, high conductivity, very high NH4 concentration, etc. 5.2
Predicting Past Values
In this experiment we try to predict the average, maximal and minimal values of physico-chemical parameters over a period of three months before the date when the biological sample was taken. Although three months is a relatively long period (according to our domain expert 1 to 2 months would be optimal), for this data set we faced the problem that physico-chemical measurements are not always available for each month; in some cases the only measurement available for the last 5 months is taken on the same day as the biological measurement, which means that the minimal, maximal and average value over the period of time are equal to the current value. The problem is quantified in [4]; here we just mention that by using a period of 3 months we ensure that for a reasonably-sized subset of the data set at least 2 or 3 measurements are available. Results of this experiment are shown in Table 2. This table confirms most of the expert’s expectations. For instance, for oxygen it was expected that the minimal oxygen level during a period of time, rather than its average or maximum, is most related to the biological data. Especially for O2 -saturation, and to a lesser extent for O2 , this is confirmed by the experiment. The expectation that for chemical oxygen demand (KMnO4 , K2 Cr2 O7 ), the average value would be most important (because this parameter has a cumulative effect) is confirmed, although the minimal value also shows high correlation, which was not expected. 5.3
Discussion
Both experiments show the potential of decision tree learning for gaining insight in the water quality domain. The first experiment shows that simultaneous prediction of multiple parameters is feasible and increases the potential of decision trees for providing compact, interpretable theories. The second experiments confirms that it is possible to predict past properties of water from its current biological properties; moreover the results may lead to more insight into the mechanisms through which physico-chemical properties influence biological properties over a longer period of time.
Simultaneous Prediction of River Water Quality with TILDE
39
Table 2. Comparison of predictive quality of trees when predicting the current value of a property vs. its minimal, maximal or average value during the last three months. variable T pH conduct. O2 O2 -sat. CO2 hardness NO2 NO3 NH4 PO4 Cl SiO2 KMnO4 K2 Cr2 O7 BOD avg
6
minimum r 0.444 0.351 0.410 0.540 0.522 0.359 0.412 0.236 0.313 0.373 0.271 0.513 0.344 0.524 0.627 0.609 0.428
maximum r 0.591 0.316 0.405 0.435 0.388 0.401 0.451 0.446 0.359 0.494 0.400 0.311 0.432 0.461 0.529 0.575 0.437
average r 0.578 0.355 0.443 0.514 0.472 0.403 0.497 0.416 0.336 0.475 0.418 0.413 0.394 0.526 0.697 0.653 0.474
current r 0.563 0.356 0.464 0.523 0.460 0.335 0.475 0.417 0.349 0.489 0.445 0.602 0.400 0.435 0.514 0.605 0.465
Conclusions
We have used the decision tree learner TILDE to test two hypotheses: a) is it feasible to predict many properties at once with a single decision tree; b) is it feasible to predict past chemical properties from current biological data? In both cases the answer is positive. Our experiments globally confirm the expert’s expectations, but here and there also contain some unexpected and interesting results. Some insight has been gained in the interdependencies of physico-chemical parameters and the way in which the properties of the water in the recent past can be predicted from current biological data. From the machine learning point of view, the feasability and potential advantages of a hitherto little explored technique, simultaneous prediction of multiple variables, has been demonstrated. Related work in the machine learning domain includes the use of (descriptive) clustering systems for prediction of multiple variables [12]. In the application domain, we mention [9], [10] and [11] (on which this work builds further), and [4] which discusses a broad range of preliminary experiments in this domain. There are many opportunities for further work: some of our results need to be studied in more detail by domain experts; simultaneous prediction of subsets of the 16 used variables, or of a mixture of current and past values, seems an interesting direction; and many of the preliminary experiments described in [4], investigating other kinds of relationships in this domain, deserve further study.
Acknowledgements The authors thank Damjan Demˇsar for his practical support and the Hydrometeorological Institute of Slovenia for making the data set available.
40
H. Blockeel, S. Dˇzeroski, and J. Grbovi´c
References 1. H. Blockeel. Top-down induction of first order logical decision trees. PhD thesis, Department of Computer Science, Katholieke Universiteit Leuven, 1998. http://www.cs.kuleuven.ac.be/˜ml/PS/blockeel98:phd.ps.gz. 2. H. Blockeel and L. De Raedt. Top-down induction of first order logical decision trees. Artificial Intelligence, 101(1-2):285–297, June 1998. 3. H. Blockeel, L. De Raedt, and J. Ramon. Top-down induction of clustering trees. In Proc. 15th Int’l Conf. on Machine Learning, pages 55–63, 1998. http://www.cs.kuleuven.ac.be/˜ml/PS/ML98-56.ps. 4. H. Blockeel, S. Dˇzeroski, and J. Grbovi´c. Experiments with TILDE in the river water quality domain. Technical Report IJS-DP 8089, Joˇzef Stefan Institute, Ljubljana, Slovenia, 1999. 5. L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984. 6. J. Cairns, W.A. Douglas, F. Busey, and M.D. Chaney. The sequential comparison index – a simplified method for non-biologists to estimate relative differences in biological diversities in stream pollution studies. J. Wat. Pollut. Control Fed., 40:1607–1613, 1968. 7. N. De Pauw and H.A. Hawkes. Biological monitoring of river water quality. In Proc. Freshwater Europe Symposium on River Water Quality Monitoring and Control, pages 87–111. Aston University, Birmingham, 1993. 8. L. De Raedt. Attribute-value learning versus inductive logic programming: the missing links. Proc. 8th Int’l Conf. on Inductive Logic Programming, pages 1–8. Springer-Verlag, 1998. 9. S. Dˇzeroski, D. Demˇsar, and J. Grbovi´c. Predicting chemical parameters of river water quality from bioindicator data. Applied Intelligence, 1999. In press. 10. S. Dˇzeroski and J. Grbovi´c. Knowledge discovery in a water quality database. In Proc. 1st Int’l Conf. on Knowledge Discovery and Data Mining (KDD’95). AAAI Press, Menlo Park, CA, 1995. 11. S. Dˇzeroski, J. Grbovi´c, and W.J. Walley. Machine learning applications in biological classification of river water quality. Machine Learning, Data Mining and Knowledge Discovery: Methods and Applications. Wiley and Sons, Chichester, 1997. 12. D. H. Fisher. Iterative optimization and simplification of hierarchical clusterings. Journal of Artificial Intelligence Research, 4:147–179, 1996. 13. ISO-BMWP. Assessment of the biological quality of rivers by a macroinvertebrate score. Technical Report ISO/TC147/SC5/WG6/N5, International Standards Organization, 1979. ¨ 14. R. Pantle and H. Buck. Die biologische Uberwachtung der Gewas und die Darstellung der Ergebnisse. Gas und Wasserfach, 96:603, 1978. 15. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann series in machine learning. Morgan Kaufmann, 1993. 16. J.R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986. 17. J.R. Quinlan. Combining instance-based and model-based learning. In Proc. 10th International Workshop on Machine Learning. Morgan Kaufmann, 1993. 18. W.J. Walley. Artificial intelligence in river water quality monitoring and control. In Proceedings of the Freshwater Europe Symposium on River Water Quality Monitoring and Control, pages 179–193. Aston University, Birmingham, 1993. 19. W.J. Walley and H.A. Hawkes. A computer-based reappraisal of the Biological Monitoring Working Party scores using data from the 1990 river quality survey of England and Wales. Water Research, 30:2086–2094, 1996.
Applying Data Mining Techniques to Wafer Manufacturing Elisa Bertino1 , Barbara Catania2 , and Eleonora Caglio3 1
2
Universit` a degli Studi di Milano Via Comelico 39/41 20135 Milano, Italy
[email protected] Universit` a degli Studi di Genova Via Dodecaneso 35 16146 Genova, Italy
[email protected] 3 Progres Progetti, Via Varesina 76, 20156 Milano,
[email protected]
Abstract. In this paper we report an experience from the use of data mining techniques in the area of semiconductor fabrication. The specific application we dealt with is the analysis of data concerning the wafer production process with the goal of determining possible causes for errors, resulting in lots of faulty wafers. Even though our application is very specific and deals with a specific manufacturing sector (e.g. semiconductor fabrication), we believe that our experience can be relevant to other manufacturing sectors and provide significant feedback on the use of data mining techniques.
1
Introduction
Background. In this paper we report an experience from the use of data mining techniques in the area of semiconductor fabrication. The problem we have considered refers to a specific request of a well known semiconductor producer. Such a request is related to the detection of the causes of failures, arising during the semiconductor manufacturing process. The semiconductor manufacturing process is a very complex activity composed of several phases (see [10] for details). The experience has shown that, if one of the constructed devices does not satisfy the design requirements, some problems arose during the so-called wafer fabrication phase. The aim of the wafer fabrication step is that of creating several identical integrated circuits in and on the wafer surface. More precisely, given a wafer with a polished surface, during the wafer fabrication step a new wafer is generated whose surface contains several hundreds completed chips. In a single wafer fabrication phase, more than one hundred of individual steps are performed. However, each individual step is one of four basic phases, which are repeated several times. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 41–50, 1999. c Springer-Verlag Berlin Heidelberg 1999
42
E. Bertino, B. Catania, and E. Caglio
The phases are the following: layering, covering the wafer surface by a thin layer of conducting, semiconducting or insulating materials; pattering, removing some portions of the layers inserted in the layering phase, thus reproducing a specific pattern on the wafer surface; doping, putting specific amounts of dopants on the wafer surface, forming dopant regions and junctions that determine the electric characteristics of the semiconductor; heat treatments, during which the wafer is heated and cooled to achieve specific results. Each main phase is then composed of several substeps, that may be repeated, each executed by a specific machine. The order by which such operations are executed depends on the type of the device that has to be constructed. In order to reduce time and costs, all the operations are often applied to wafer lots. A lot, that usually contains 25 wafers, is thus the processing unit. The problem and the considered approach. In the considered application, during wafer fabrication, information about each step is collected in a production process database. This information, that refers to process data, electric misures, and so on, is used to establish the correctness of the produced devices. Whenever a wafer lot is faulty, production process engineers must go through the data collected in the production process database to find out possible causes for the production errors. This analysis is in general carried out “by hand”, in the sense that all the tuples are extensively analyzed without the support of any automatic mechanism. Such an analysis may take quite a few days to the process engineers. The goal of the experimental project we are describing was to use data mining techniques in order to reduce the analysis time from several days to a few hours. In our project, we used two commercially available data mining systems. One, MineSet, is a multistrategy system supporting several techniques [6]. The other system, Q-Yield, is, by contrast, specialized for manufacturing applications [5]. Both systems were not successful for our application and therefore we have developed our own analysis technique. Related work. Because of the relevance of defect analysis in semiconductor manufacturing, issues related to management of defect-data has received some attention. In particular, a relevant effort is represented by the defect-data management system developed at SEMATECH [7], a non profit consortium of semiconductor manufacturers. However, no data mining tools are provided to help in determining possible explanations for the errors in the production. The use of data mining techniques for this specific problem is briefly discussed by Turney [8]. However, the emphasis of the discussion is on how to pre-process data in order to enable an effective use of a specific data mining technique. No specific experience is reported on the use of different techniques or systems. Organization of the paper. The reminder of this paper is organized as follows. The data mining techniques and tools we have used in the project are described in Section 2. Section 3 presents the results we have obtained from the experiments. The new approach we propose is then introduced in Section 4. Finally, Section 5 presents some concluding remarks.
Applying Data Mining Techniques to Wafer Manufacturing
2
43
Data Mining Techniques and Systems Applied
In the considered project, we were interested in two different types of data mining techniques: association rules and decision trees. Due to space constraints, we refer the reader to [4] for additional details about such techniques. Here we just recall their main characteristics. Association rules. Let I be a set of items. An association rule is a rule of the form X → Y , where X ⊆ I, Y ⊆ I, X ∩ Y = ∅.1 Such rule specifies that the presence of items X determines the presence of items Y . An association rule X → Y has a support level c in a set of tuples T if the c% of tuples in T satisfies X and Y . It has a confidence level c in T if the c% of tuples in T satisfying X also satisfy Y . In general, it is useful to generate only the rules with a sufficiently high support. Classification techniques. Classification is a function determining if an object belongs to a given class, chosen among a set of predefined classes, based on the values of some object attribute (the label). Among the proposed techniques, decision trees are becoming a largely used technique for classification [4]. The leaves of the tree correspond to all the possible values for the label. Each internal node is associated with an attribute and has a child for each value that can be associated with that attribute. Often, single values are replaced by mutually disjoint conditions. The disjunction of all the conditions appearing in the children of a given node completely characterizes the domain of the attribute. These properties guarantee that, given a certain item to be classified, there is always a single path from the root to a leaf whose conditions are satisfied by the item. Such path associates with the item the class corresponding to the reached leaf. Decision trees are typically constructed starting from a given set of tuples (the training set) and are then used to classify all the items. In the context of the considered projects, two commercially available data mining systems, supporting the techniques described above, were used. One of them (MineSet) is a multistrategy system supporting several techniques, such as association rules, clustering techniques, and decision trees. The other system (Q-Yield) is, by contrast, specialized for manufacturing applications and uses a combination of Artificial Intelligence and statistics techniques to find relationships in production data. In the following, both systems are shortly described.2 MineSet. MineSet is a tool produced by Silicon Graphics [6]. The main data mining techniques provided by MineSet are the following: – Association rules generator: the association rule generator produces a set of association rules, together with the support level and the confidence level. – Classifiers: The tool supports two different types of classifiers, one based on decision-trees and one Naive-Bayes classifier [3]. In both cases, an attribute has to be identified as label; the classifier is constructed by considering a training set containing about 2/3 of the considered data. Then, such classifier 1 2
In a relational database, an item represents an association attribute/value. Both systems allow one to analyze both flat data and data contained in commercial relational DBMS.
44
E. Bertino, B. Catania, and E. Caglio
is used to classify additional data. For the aim of the project, we considered the decision tree classifier. Q-Yield. Q-Yield is a single strategy tool produced by Quadrillion Corporation [5]. It is based on decision trees and it has been specifically designed to support designers in the detection of specific causes of a certain event and in making predictions. The tool provides the following main functionalities: – It allows one to graphically represent how the values of a given attribute vary inside a relation. The graph may point out some exceptional situations (the event) that can be better analyzed by using the other tools provided by the system. – The event can be described by declaratively specifying a logical condition on the considered data. Input data are then analyzed by the system to determine the causes (the logical conditions) of the specified event. The result is a set of rules, specifying the data conditions under which the event arises, together with some statistical information.
3
Experiments and Results
In order to address the problem described in Section 1, we have used both MineSet and Q-Yield on the production process database of a well known semiconductor producer. The process database is maintained in a relational database managed by the INGRES database management system (DBMS) [1]. A single relation is used to maintain all relevant information related to the production of a given device. Each tuple of such relation represents, among the others, the type of the operation performed by an equipment in a certain instant on a given wafer lot. The equipment (for example, the tube furnace used for layering) is identified by a string (eqt). The operation performed on a given wafer lot at a certain instant is identified by the triple (event,step,script), where event represents the main operation (for example, layering, patterning or doping), step identifies the specific step performed, and script provides additional information about the performed operation. The failure (the success) of an operation performed on a given wafer lot is represented by setting to 0 (1) the field fail in all tuples corresponding to that lot. Such setting is provided by process engineers during a pre-analysis step. During the production of a given product, the same operation can be executed on various equipments and the same step may correspond to various events. The process database typically contains data related to 80-90 lots, generated in a period of three-four months. Such data are very large in volume and are used for daily activities. The goal of the experimental project we are describing was to use data mining techniques to detect the causes of process failure. A cause is certain if it allows one to determine all, and only those, lots that are faulty. A cause is uncertain if allows one to determine all lots that are faulty, together with some non-faulty lots. The problem we deal with is a typical problem for which Q-Yield should give good result since we exactly need to determine the causes of an event (generation
Applying Data Mining Techniques to Wafer Manufacturing
45
of faulty wafers). In the following we present the results of the experiments we have carried out by using both tools. 3.1
MineSet
Association rules. The aim of the first group of experiments was to generate association rules representing the possible causes of faulty wafers. Therefore, the rules to be generated must have the form: attribute1 = value1 ∧...∧attributen = valuen → f ail = 0. The experiments we have carried out are based on the following choices: – We have projected the starting relation on different sets of attributes. Indeed, the result of the application of data mining techniques strongly depends on the input set of data. By changing the starting set, different results are usually obtained. – MineSet allows one to specify the required confidence and support levels. We have considered two main confidence levels: (a) Confidence = 50%: with this setting, we restrict the system to generate significant rules. However, since the confidence is 50%, the generated rules does not necessarily return certain causes. (b) Confidence = 100%: with this setting, only certain causes are generated by the system. Since the number of faulty lots is typically low with respect to the total number of lots, the support of association rules describing the causes of failure should be low. For this reason, we have set the support level to 1% and to 0.5%. The obtained results have not been very satisfactory. By setting the support level to 0.5%, no rules have been generated. By setting the support level to 1%, in case (a), the tool has generated some rules, similar to the following ones: Support Confidence Rules 1.3096 92.06 eqt = P E06 → f ail = 1 1.4902 89.19 eqt = W J1 → f ail = 1 Such rules do not characterize faulty lots of wafers but correct lots. No rule with f ail = 0 on the right side has been generated. In case (b), no rule has been generated. By deeply analyzing the tool, we have discovered that this behavior depends on the fact that MineSet generates only rules having a support greater than 1% and a confidence higher than 50%. This assumption makes MineSet association rules useless in determining causes of failure in our application. Decision trees. The aim of the second group of experiments was to extract information about faulty lots by using decision trees. We have performed several experiments by varying the set of attributes of the starting relation and by varying the label field. In all the experiments, the resulting decision trees represent only non significant information. For example, if we choose f ail as label, we
46
E. Bertino, B. Catania, and E. Caglio
obtain a 1-level tree, in which each leaf corresponds to a specific value for the wafer lot. No causes are determined, since no attributes, besides lot and fail, are considered. Similar results have been obtained by varying the starting set of attributes. Starting from these results, we have analyzed in depth the algorithm for the construction of decision trees and we have discovered that the maximum number of children for a node is 25. In our case, the only attribute with less than 25 values is lot. Therefore, for our application, we need decision trees able to support a much larger number of children for a given node. Moreover, we have observed that decision trees generate significant results only when attribute values can be significantly compared. This happens when an attribute takes only few different values in the input relation and when such values are numbers. In our case, both conditions are not satisfied and this is the reason why such technique fails. As an additional experiment, we have transformed values for eqt in numerical values, choosing lot as label. The obtained tree allows one to determine with a certain confidence, some numerical intervals containing the equipments processing some faulty lots of wafer. However, the conditions under which all faulty wafer lots have been generated have not been determined. 3.2
Q-Yield
In order to use Q-Yield, a condition representing the event to be analyzed has to be specified. In our case, the condition is f ail = 0. Each rule returned by the system in the performed experiments specifies some conditions on equipments and on operations under which the event is satisfied. The following are some examples of rules returned by the system: Rule Coverage Error rate Correct Incorrect eqt = ST ZZ1 1.0% 0.0% 8 0 eqt 6= ST KZ1 ∧ eqt 6= ST KP 4 0.5% 0.0% 4 0 ∧eqt = 16870 eqt 6= ST KZ1 ∧ eqt = ST KP 4 1.0% 27.3% 8 3 In the previous rules, error rate (incorrect) represents the percentage (the number) of tuples satisfying the rule but not the event whereas coverage (correct) represents the percentage (the number) of tuples satisfying the rule and the event. From the previous rules, if we know the number of faulty lots and if the number of tuples is equal to the number of lots, the exact causes of the failures are represented by the rules for which the number of tuples making the rule correct coincides with the number of faulty lots. However, in our application, each lot appears in several tuples. Thus, the previous method cannot be applied. The only way to get this answer is to consider the rules satisfied by a number of tuples which is higher or equal to the number of faulty lots (representing uncertain causes) and then directly analyze the file. Since the number of tuples is very high, this solution does not seem reasonable. Even if also Q-Yield was not able to determine certain causes, it returned more information than MineSet tools. An additional advantage of Q-Yield with
Applying Data Mining Techniques to Wafer Manufacturing
47
respect to MineSet is that it works well on both numerical and non-numerical attributes. This is the reason why we get some results.
4
A New Approach
The analysis of the performed experiments shows that current data mining techniques are not adequate to solve the considered problem. Such inadequacy is mainly due to two different reasons: (i) the considered problem is related to the detection of the causes of a non frequent event: in general, data mining tools are successfully used when some regularities in the data set has to be found; (ii) even if the data considered in the experiments are exactly the data that the semiconductor producer daily uses in its activity, they are not significant from the point of view of data mining techniques: each attribute contains values that are not easily comparable and the set of active values is typically very high. The considered problem is therefore a typical case in which data mining techniques may fail. Therefore, we have designed an ad hoc algorithm. Though this algorithm has been designed for fault analysis in the semiconductor manufacturing process, it can also be used in other manufacturing processes characterized by the following properties: – The process database is composed of a single relation, called process relation, containing non necessarily numerical attributes. Each tuple in the relation contains information about a given process step. – With each tuple, an information is associated concerning the result of the process operation (success/failure) (attribute fail). Success/failure is assumed to be related to a specific element (a lot, in our application). – The aim of the analysis is the detection of the certain or uncertain causes of failure. In the following, we first present the general algorithm, then we describe the obtained results (see [2] for additional details). 4.1
Definition of the Approach
Given a manufacturing process, characterized by the properties described above, the causes that have to be detected are generally represented by specific assignments of values to a subset of the attributes belonging to the problem domain. For example, in the domain under consideration, the engineers can be interested in determining the equipments always generating failures, the operations always generating failures or the time intervals at which all failing lots have been generated. If no equipment, no operation, no time instant represent a certain cause, a combination of them may correspond to the solution, identifying, for example, that an operation performed on a given instant of time on a certain equipment was the cause of the problem. Starting from the previous considerations, in order to determine certain and uncertain causes of a given event, it is important to specify which attribute combinations have to be considered. The combinations are assumed to be partially
48
E. Bertino, B. Catania, and E. Caglio
ordered. Such ordering specifies which combination has to be considered first and how combinations can be refined. After that, we have to determine which attribute assignments represent certain or uncertain causes. In the following, both steps are analized in more detail. Representation of significant combinations. Let D be the set of attributes of the process relation. Let S1 , ..., Sn ⊆ 2D be all significant combinations, as specified by process engineers. Let be a partial order on {S1 , ..., Sn }, called interest order, such that if Si Sj , the engineers are first interested in determining if Si is a certain cause. If it is not, but it is an uncertain cause, they are interested in refining Si in Sj , to determine if Sj is a certain cause. Significant combinations and the interest order can be represented as a direct graph, called interest graph, having the significant combinations as nodes and one edge (Si , Sj ) if and only if Si Sj . All the nodes in the graph having an empty fan-in represent all combinations from which the analysis has to start. Causes detection. Let G be an interest graph. If an edge (Si , Sj ) exists in G, Si has to be analyzed before Sj and Sj has to be analyzed only if no assignment for Si represents a certain cause. This visit corresponds to a slightly modified breadth first search of the graph. In particular, the search from a node must stop if no certain or uncertain causes can be associated with that node. In the following, we assume that two functions, certain and uncertain, exist, taking a significant combination and returning the set of assignments representing certain or uncertain causes, respectively. In order to correctly modify the breadth first search, we associate with each node an additional attribute marked, initialized to true. Let n be a node visited by a breadth first search associated with a significant combination S. If marked(n) = true, we determine if there exists an assignment for S representing a certain cause (certain(S) 6= ∅). If it exists, we return certain(S) to the user and we set marked(n) to false. If it does not exist, we determine if there exists an assignments for S representing an uncertain cause (uncertain(S) 6= ∅). If it does not exist, we set marked(n) to false. When all nodes have been visited, if no certain cause has been returned to the user, we report the uncertain causes. Such causes are generated from the combinations associated with all nodes n having marked(n) =true. In order to complete the description of the algorithm, we have to specify how functions certain and uncertain are implemented. Since we assume that the process relation is stored in a relational DBMS, simple SQL aggregated queries can be used to determine certain and uncertain causes. Due to space constraints, we refer the reader to [2] for additional details. A tool has been implemented based on the previous algorithm, under the database management system Informix, using ESQL/C as programming language.
Applying Data Mining Techniques to Wafer Manufacturing
4.2
49
Results
The attributes that are typically used in the analysis are lot, eqt, event, step, time, and fail (see Section 3).3 From the requirement analysis, we have discovered that the interest graph of the considered application is indeed a tree and has the form presented in Figure 2. This means that process engineers are first interested in finding the equipments generating failures. In this case, an hardware problem is detected: an equipment does not work as it should do. If no equipment represents a certain cause for the problem, the event is also considered. If neither in this case a certain cause is detected, the step is taken into account. In both situations, a problem still concerning an equipment is detected. However, in this case, it refers a specific equipment functionality. As a final refinement, the time is considered. Different time aggregations, with different granularities (day, week), have been considered. In this case, the cause could represent an accidental problem happened at a certain time. Note that in the considered application, a monotonic refinement of combinations is used: a combination is always refined by adding attributes. The algorithm presented in Subsection 4.1 has then be applied to the same datasets on which we have applied data mining techniques. In most cases, the proposed algorithm determined the certain causes, often represented by a single equipment (identifying an hardware problem). In some other cases, it determined only uncertain causes, often represented by some assignments for (eqt, event, step). This information has then been used by the process engineers to further investigate the problem. They discovered that, when no certain causes are determined, the faults are mainly due to some accidental problem.
5
Concluding Remarks
This paper has reported the results we have obtained in investigating the effectiveness of current data mining techniques in determining the causes of failures of a wafer fabrication process. The experiments we carried out have shown that data mining techniques fail to solve this problem. We have therefore proposed a new approach tailored to the detection of the causes of failure in manufacturing processes. As overall conclusion, we would like to point out that, in addition to technical problems that we found out in our experiments (see Section 3), we noticed some more general problems in applying data mining techniques to the semiconductor manufacturing process. The first problem is that it is difficult to determine right away which specific data mining technique to use for the problem at hand. Another problem is that data must often be pre-processed before being used for data mining. Such pre-processing can be non trivial and depends on the specific data mining technique one expects to use. Both problems point out the need of some general methodologies and guidelines that could support the users in the development of data mining applications. 3
In the requirement analysis for the considered application, we have noted that often attribute script is useless since it often contains redundant information.
50
E. Bertino, B. Catania, and E. Caglio INPUT: an interest graph G OUTPUT: a set of certain or uncertain causes METHOD: for each node n set marked(n) = true certain ← f alse; Cause = ∅ apply a breadth first visit to G for each visited node n, associated with a combination S do if marked(n) and certain(S) 6= ∅ then marked ← f alse; certain ← true; Cause = Cause ∪ certain(S) else if uncertain(S) = ∅ then marked ← f alse if not certain then visit the graph for each node n such that marked(n) = true return uncertain(S) else return Cause
Fig. 1. The faults detection algorithm eqt
eqt event
eqt event step
eqt event step time
Fig. 2. The interest graph corresponding to the semiconductor manufacturing process
References 1. ASK Computer Co: INGRES/SQL Reference Manual. 2. Bertino, E., Catania, B.: Applying Data Mining Techniques to Wafer Manufacturing. Technical Report, University of Milano, Italy (1999) 3. Cheeseman, P., Stutz, J.: Bayesian Classification (Autoclass): Theory and Results. In Fayyad, U.M., Piatetsky-Shapiro, G., Smith, P., Uthurusamy, R. (eds): Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996) 4. Fayyad, U.M., Piatetsky-Shapiro, G., Smith, P., Uthurusamy, R.: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996) 5. Q-Yield Product Information. http://www.quadrillion.com/qyover.htm 6. Silicon Graphics MineSet - Supporting the Discovery Research Process. Silicon Graphics White Paper, http://www.sgi.com/ software/mineset/mineset_data.html (1997) 7. Singh, H., Lakhani, F., Proctor, P., Kazakoff, A.: Defect Data Management System at SEMATECH. Solid State Technology 75–80. Pennwell Publishing (1995) 8. Turney, P.: Data Engineering for the Analysis of Semiconductor Manufacturing Data. In Proc. of the IJCAI-95 Workshop on Data Engineering for Inductive Learning (1995) 9. Viveros, M.S., Nearhos, J.P., Rothman, M.: Applying Data Mining Techniques to a Health Insurance Information System. In Proc. the 22nd International Conference on Very Large Databases (1996) 10. van Zant, P.: Microchip Fabrication. McGraw-Hill (1997)
An Application of Data Mining to the Problem of the University Students’ Dropout Using Markov Chains 1
1
S. Massa and P.P. Puliafito 1
DIST, Department of Communication, Computer and System Sciences, University of Genoa, via Opera Pia, 13, 16145 Genova – ITALY {silviam, ppp}@dist.unige.it
Abstract. A new application of data mining to the problem of University dropout is presented. A new modeling technique, based on Markov chains, has been developed to mine information from data about the University students’ behavior. The information extracted by means of the proposed technique has been used to deeply understand the dropout problem, to find out the high-risk population and to drive the design of suitable politics to reduce it. To represent the behavior of the students the available data have been modeled as a Markov chain and the associated transition probabilities have been used as a base to extract the aforesaid behavioral patterns. The developed technique is general and can be successfully used to study a large range of decisional problems dealing with data in the form of events or time series. The results of the application of the proposed technique to the students’ data will be presented.
1
Introduction
Data mining represents the core activity of the so-called Knowledge Discovery in Databases (KDD) process, which aims at extracting hidden information from large collections of data. Data mining techniques can be divided into five classes of methods according to their different goals that is the different kind of knowledge they aim to extract [1]. These methods include predictive modeling (i.e. decision trees [2]), clustering [3], data summarization (i.e. association rules [4]), dependency modeling (i.e. causal modeling [5], [6]) and finally change and deviation detection [7]. The work presented in this paper deals with the application of a new predictive modeling technique to the problem of the University students’ dropout. A modeling technique based on Markov chains has been developed in order to mine the students’ behavior during their University period and to identify the population at risk. In the dropout problem, the time represents an important attribute characterizing the available information. The available data about the students' University career can be associated with a time-ordered sequence of events, which can represent, for example, the passed examinations. The analysis of such a sequence could then provide knowledge about the behavior of the system that, at least ideally, has generated the data. The mined knowledge in such cases could successively be used to J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 51-60, 1999. © Springer-Verlag Berlin Heidelberg 1999
52
S. Massa and P.P. Puliafito
predict, with a sort of “black box” pattern matching approach, the evolution of the considered system from the observation of its past behavior. The addressed application can be considered a good reference for a large range of problems which deal with data available in the form of time ordered sequence of events. As a consequence the proposed technique should be intended as general and can be used in different contexts. The approach that has been studied in this work tries to exploit a model based on the theory of Markov chains in order to provide a statistical representation of the properties of the observed system. The data are mined in order to extract the probability of transition among the possible states in which the system could evolve and to show implicit correlation between the different elements of a student state (the number of passed exams, the average mark, changes of residence and so on) and the decision to give up studying. In our case the goal consists in the extraction of expected patterns from data ending with dropout or degree, through data mining. The analysis of such patterns leads to identify the set of students who run the risk of dropping-out and therefore to determine high-risk situations in the students’ careers. The paper is organized as follows. It begins with a short introduction to the problem we mean to address. Then the Markov chains are briefly introduced and the proposed modeling technique is explained step by step with reference to dropouts. Then the results of the application of the proposed technique to the data about the students are discussed. Finally the further developments of the work are presented.
2
Markov Chains
The theory of Markov chains ([8], [9], [10], [11], [12]) is often used to describe the system asymptotic behavior by means of relevant simulation algorithms (Gibbs sampling [13], Metropolis, Metropolis-Hastings [14]). The use of Markov chains simplifies the modeling of a complex, multi-variant population by focusing on the information associated with the system state. This basic property of Markov chains allows to describe easily the behavior of systems whose evolution can be modeled by a sequence of stochastic transitions from one state to another in a discrete set of possible states, which occur in correspondence of time instants or events. Markov chain methods are considered a standard tool in statistical physics, in biological systems simulation and for performing probabilistic inference in statistical applications. Such methods are also successfully employed in expert systems to carry out probabilistic inferences, in the discovery of latent classes from data and in Bayesian learning for neural networks. 2.1
Definitions and Basic Properties (k)
Let X be a set of possible states of a system, at the k-th step or time instant, for any entity of a considered population. If the state of an entity at a generic k-th step can be (k) expressed through a vector of variables, then X can be written as follows:
Data Mining and the Problem of the University Students’ Dropout
{
X (k) = x 1 , x 2 , ... , x w (k)
(k)
(k)
}
53
(1)
where w is the number of possible states for the considered entity at k-th step. Then, such a system could be modeled through Markov chains only if the probability distribution of a generic
x (kik ++11) ∈ X (k +1) depends entirely on the value of
the state vector assumed at the k-th step, i.e.,
x (k) ∈ X (k) . ik
Formally: +1) −1) (k +1) p( x (k | x i(k) , x (k ,..., x (0) | x i(k) ) j v w ) = p( x j
∀ i, v, ... , w
(2)
of course, equation (2) is verified for any step k. To define the Markov chain we need to know the initial probability of a generic (0)
state xj ,
(k+1) p (0) to follow x j , ∀j and the transition probability for any possible state xj (k)
the state xi that is denoted by matrix
Tx(k)i ,x j .
If the transition probability does not depend on the step k (e.g. for stationary systems), the Markov chain is said homogeneous and the transition probability could be written as xj
(k+1)
Tx i ,x j . Using the transition probabilities, the probability for the state
at time k+1, denoted by
p (kxj+1) can be easily computed from the correspondent
probabilities at time k as follows: +1) p (k = ∑ip (k) Tx(k)x x x j
i
i
(3) j
(0)
Given the vector of initial probabilities, p , equation (3) determines the behavior of the chain for all the time instant. The probabilities at step k can be viewed as a row (k) (k) vector, p , and the transition probabilities at step k as a matrix, T , or simply T if the chain is homogeneous. Equation (3) can be expressed as:
p
(k +1)
=p
(k ) (k ) T
(4)
k
For a homogeneous chain, T , that is the k-th power of the matrix T, gives the transition probabilities at k step to obtain:
p
3
(k +1)
( )
= p 0 T (k )
(5)
Application of Markov Chains to the Mining of Time Series
The class of addressed problems takes the form of time series analysis to extract nonevident behavioral pattern from data.
54
S. Massa and P.P. Puliafito
Time series analysis is a typical subject for data mining that can model a wide range of real cases; examples can be reported from economic-financial problems as the analysis of sale trends and of price and market behavior, from medical and diagnostics problems and from environmental contexts. In the next sections a modeling technique aiming at applying Markov chain theory to data mining problems, which can be modeled with time-series, is presented with particular reference to the analysis of University dropouts.
3.1
Definition of the Problem
In general, given a population made of a finite number of entities, each entity can be associated with a series of successive events that characterize its behavior. Let e be a generic entity from the considered population and <s1 s2 s3…sn> a sequence of successive events. Then the association between the entity e and its relevant series of events can be written as follows: e ↔ <s1 s2 s3…sn>
(6)
where n = n(e) is the number of events of the series. Table 1. An example of exam database
Student's number 1 1 2 3 …
Exam_data 20/1/98 20/1/98 21/1/98 22/1/98 …
Exam_ID 10 12 2 22 …
Exam_mark 5 2 3 6 …
Let us consider the administrative database of a University where the various entities are represented by the University students and the events by their passed examinations (Table 1). The students, who are univocally identified by their matriculation number, are observed for a period of twenty years (from 1978 to 1998). Therefore the data set includes either students who have already left the University or students who are still attending in 1998. The students' personal data are inserted in a table (Table 2) that reports, for each student, the matriculation date, the date of degree or the date of the first "nonenrollment" that can be considered as the dropout date. Table 2. An example of the students’ personal data
Student’s number 1 2 …
Matriculation_date 1/11/88 1/11/89 …
Degree_date 20/4/94 25/7/95 …
Dropout_date
…
Data Mining and the Problem of the University Students’ Dropout
55
The examinations passed by the students are considered as the "events" that characterize their curriculum; therefore a student’s state is given by a three aggregated variables vector (Table 3): the number of passed exams, the average mark and the student’s condition (attending/graduate/dropped-out) as coded in Table 4. Time here represents the distance (months) from the matriculation date and it is sampled non homogeneously to reflect only specific instants that are particularly significant during an academic year. Table 3. The state variable values for the students’ data
ID_code 1 1 … 1 2 … 2 …
Sampling_time 5 12 … 72 5 … 24 …
Num_exams 1 2 … 28 2 … 3 …
Average_mark 25 27 … 26 20 … 22 …
Condition_code A A … D A … G …
Table 4. The condition codes Description Attending Graduate Dropout
Condition code A G D
The expression (6) for the dropout case study and a given time horizon turns to be: f(ej, ti) = x, j = (1,…,n) n number of observed students; (7) ti ∈ T, T= <5, 12,…, 180>, time instants vector; x ∈ X, X =
The students are observed for a time horizon of 15 years, corresponding to 180 months, by which the 95% of them end his University career and the remaining 5% can be disregarded. The description of the student's state as the combination of the number of passed exams, the average mark and the student’s condition produce an excessive number of states and a consequent lowering of the support. The concept of support is crucial in data mining to give a measure of the statistic importance of the probabilistic information resulting from this kind of analysis. In our case the support gives a measure of the statistic importance of the transactions leaving from a state xi at time k. The support of the considered state that can be defined as follows:
σ ( x ik ) =
n ik ∑ n ik i
(8)
56
S. Massa and P.P. Puliafito k
As the number of states increases, the number of students for each state ( n i ) decreases and consequently the support of each state drops. Then a discretization step is now needed to reduce the number of states that is here achieved by considering suitable ranges for the average mark values to avoid an excessive state scatter and to maintain a sufficient support level. Table 5. Classes for the average mark Average mark 27-30 23-26 18-22
Code High Medium Low
Let k and k+1 be two generic successive stages; then, let N, F and S represent the components of the state, respectively the number of passed exams, the average mark range and the condition code. Each state component obviously has values depending on the stage. The transition probability can be computed as follows using the Markov chain basic property (2). Considering a pair of states that are contiguous in time, i.e. which are associated with two successive sampling times, the transition probability can be expressed as:
pi, j( k ,k +1)
=
ni, j(k ,k +1) n i(k )
(9)
∑ pi, j = 1 j
The computation of the transition probabilities is performed through (9) and then is summarised in Table 6. The final result of the computing transition probabilities process is a sequence of (k) matrices T , corresponding to the transition from states in the k-th stage to states in the (k+1)-th one. For each stage there are two absorbing states, one associated with degree (G) and the other with dropout (D). (k ) For students still attending in a generic state x i it is possible to calculate the (k ) (k ) possibility to reach each of the absorbing states, p i,D and p i,G . The way to get the (k ) two probabilities is similar. Taking for example p i,D :
p i(,kD) =
hf
w −1
w = k +1
z =k
T (z) ∑ ei ⋅ ∏ T ⋅ e D
(10)
where eD has 1 in correspondence to the absorbing state D and hf is the final stage of the time horizon considered. The result of the application of Markov chains can be used, in the present context, to discover the sets of students with different risk degree of dropout, but it can also constitute the basis for further more accurate analysis of individual behavior.
Data Mining and the Problem of the University Students’ Dropout
57
Table 6. The computed transition probabilities Stage k
Stage k+1
0,75
N 1
F H
S
N
F
S
1
H
A
N
F
S
2
H
A
0,25
A
…... N 2
F M
S
…...
…...
0,1 0,2
A
N
F
S
3
M
A
N
F
S
2
L
D
0,1
The second goal comes from the analysis of dropout probability for each intermediate state and then from the construction of clusters of students combined by the dropout risk level. Such clusters can be useful to identify appropriate actions to try to influence the behavior of the dropout risk students and, as a consequence, the evolution of their careers. (k) (k) Being T dependent on those actions, the fact that the transition probabilities T could significantly change in the long range, can be inferred. In this case the model based on Markov chains could be used to the purposes of planning and control. Another use of the proposed Markov chain based modeling technique is the possibility to forecast the single student’s behavior that is his position in the state space at time k+1, knowing his position in the state space at time k. This kind of application needs a sufficient number of variables to be included in the state space to provide the appropriate description and, as a consequence, the possibility of having local lack of support generally increases. In the following sections the first results of the application of the proposed technique to dropouts will be presented together with further improvements of the model, in order to minimize the effects of the above mentioned problem.
4.
Results
In this chapter the students’ behavior is summarized through some graphs. The presented graphs are based on the data relative to the University of Genoa, Faculty of Engineering and they refer to the students that were attending University in the years between 1978 and 1998. The observed sample consists of about 15000 students relative to the twenty years’ period of time under consideration. Figure 1 compares two histograms for each year. The black one represents the local dropout probability, that is the probability to dropout over years from the matriculation date, while the white one represents the cumulative dropout probability.
58
S. Massa and P.P. Puliafito
d ro p -o ut p ro b ab ility f o r e ac h ye ar
c um ulative d ro p -o ut p ro b ab ility
0 ,6
0 ,5
dropout probability
0 ,4
0 ,3
0 ,2
0 ,1
0 2
3
4
5
6
7
8
9
10
11
12
13
14
15
years passed from the matriculation date
Fig. 1. Dropout probability over the years from the matriculation date
Now a graph, which is based on the probabilities previously computed for each state, is provided. This representation takes into account only the time passed from the matriculation date and the number of passed exams. These two variables define the state to which a dropout probability is associated. A white ball represents a state where the dropout probability is under a fixed threshold while a black ball represents a state where the same threshold is exceeded.
= dropout probability over 60%
Number of passed exams
0
1
2
3
4
= dropout probability under 60% 5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Years passed from the matriculation date
Fig. 2. Dropout risk-zone
11
12
13
14
Data Mining and the Problem of the University Students’ Dropout
59
Figure 2 refers to a dropout threshold of 60% but such a threshold may be defined to consider possible deviations in the dropout rate throughout the time or to perform special advising policies. Figure 3 represents the dropout probability after 4 years from the matriculation date, with reference to the passed exams, and the relevant shape of the support (see (8)). 1 0 , 9 0 , 8
Dropout probability
0 , 7 0 , 6 0 , 5
(A)
0 , 4 0 , 3 0 , 2 0 , 1 0
1
6
1 1
1 6
2 1
2 6
3 1
Passed exams 0 , 0 8 0 , 0 7 0 , 0 6
Support
0 , 0 5
(B)
0 , 0 4 0 , 0 3 0 , 0 2 0 , 0 1 0
1
6
1 1
1 6
2 1
2 6
3 1
Passed exams
Fig. 3. Dropout probability (A) and support (B) with reference to the passed exams after 4 years from the matriculation date
5.
Conclusions
The behavior and the choices of an individual can often be referred to the behavior of the groups of people that statistically represent them. This paper defines an approach based on Markov chains to define clusters of people with a homogeneous behavior and to identify individual pattern that represent the behavior of the single component of the cluster. Such behaviors can be described through Markov chains as a series of transitions characterized by time. The proposed method has been applied to a case study concerning the problem of University dropouts. In such a context the proposed modeling technique can be used in order to define clusters of students associated with different dropout risk degree. Another use of the method concerns the analysis of the individual patterns in order to identify possible policies aimed at lowering the dropout risk levels. Then, in this sense, the proposed method can be used for planning and control activities.
60
S. Massa and P.P. Puliafito
References 1.
Geman S., Geman D. (1984) "Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.6, pp. 721-741
2.
Gelfand A.E., Smith A.F.M. (1990) "Sampling based approaches to calculating marginal densities", Journal of the American Statistical Association, vol.85 pp.398-409
3
Metropolis N., Rosenbluth A.W., Rosenbluth M.N., Teller E. (1953) "Equation of state calculations by fast computing machines", Journal of Chemical Physics, vol.21 pp.10871092
4.
Hastings W.K. (1970) "Monte Carlo sampling method using Markov chains and their applications", Biometrika, vol.57 pp. 97-109
5.
Agrawal R., Srikant R. (1994) "Fast algorithms for mining association rules in large databases" in Proc. of VLDB Conference, Santiago, Chile.
6.
Mannila H., Toivonen H., Verkamo I. (1994) "Efficient Algorithms for discovering association rules". In KDD-94: AAAI Workshop on Knowledge Discovery in Databases.
7.
Agrawal R., Srikant R. (1995) "Mining Sequential Patterns". IBM Research Report.
8.
Howard R.A. (1960) “Dynamic programming and Markov processes”. John Wiley.
9.
Neal, R. (1993) Probabilistic Inference using Markov Chain Monte Carlo Methods. Dept. of Computer Science, University of Toronto.
10.
Diaconis, P., Stroock, D. (1991) Geometric Bounds for eigenvalues of Markov Chains. Annals of Applied Probability, 1, 36-61.
11.
Hamdj A. Taha (1971) Operation Research, Macmillan Publisher, 400-406.
12.
Howard R.A. (1960) Dinamic Programming and Markov Processes, Wiley.
13.
Arnold S. F. (1993) Gibbs Sampling. In Handbook of Statistics, 9, 599-626.
14.
Chib S., Greenberg E. (1995) Understanding the Metropolis-Hastings Algorithm. The American Statistician. 49, #4, 329-335.
Discovering and Visualizing Attribute Associations Using Bayesian Networks and Their Use in KDD Gou Masuda1 , Rei Yano1 , Norihiro Sakamoto2 , and Kazuo Ushijima1 1
Graduate School of Information Science and Electrical Engineering, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, Fukuoka 812-8581, Japan {masuda,yano,ushijima}@csce.kyushu-u.ac.jp 2 Department of Medical Informatics, Kyushu University Hospital, 3-1-1 Maidashi, Higashi-ku, Fukuoka 812-8582, Japan [email protected]
Abstract. In this paper we describe a way to discover attribute associations and a way to present them to users using Bayesian networks. We describe a three-dimensional visualization to present them effectively to users. Furthermore we discuss two applications of attribute associations to the KDD process. One application involves using them to support feature selection. The result of our experiment shows that feature selection using visualized attribute associations works well in 17 data sets out of the 24 that were used. The other application uses them to support the selection of data mining methods. We discuss the possibility of using attribute associations to help in deciding if a given data set is suited to learning decision trees. We found 3 types of structural characteristics in Bayesian networks obtained from the data. The characteristics have strong relevance to the results of learning decision trees.
1
Introduction
Remarkable progress in data collecting and storing technologies has been generating a large number of huge databases such as astronomy databases and human genome databases. Knowledge Discovery in Databases (KDD) [3] aims automatically to analyze such huge databases and extract useful and interesting knowledge. A number of heuristic methods and strategies have been proposed for improving efficiency and accuracy in KDD. In general there is no single best method or strategy for all knowledge discovery tasks. Users therefore have to select an appropriate method for their specific task. However there are no clear theoretical metrics for selecting an appropriate method under a given circumstance. Consequently users have to apply a range of methods to their own data and repeatedly compare results to determine which provides the best fit. The KDD process thus has an iterative and interactive nature. In this situation, it is essential to visualize and present to users as much information on data as possible. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 61–70, 1999. c Springer-Verlag Berlin Heidelberg 1999
62
G. Masuda et al.
The purpose of this study is to discover attribute associations and to present them to users in the KDD process. An attribute association is one kind of information implicit among data and it possesses at least two features. One is the degree of relevance between a pair of attributes the data have. The other is the structure that exists between them. Discovering such attribute associations and presenting them to users make it possible to conduct data mining effectively. We propose a way using Bayesian networks [4], which are one of the graphical representations of knowledge that employ directed acyclic graph. Further consideration is given to the applicability of attribute associations to two different steps of the KDD process. One application we describe involves utilization of attribute associations to support feature selection [5,6]. The other application we discuss is the possibility of using attribute associations to support the selection of data mining methods. In this study we use them for deciding whether a given data set is suited to learning decision trees [9]. The remainder of this paper is organized as follows. Section 2 reviews the Bayesian networks and describes a way of discovering and visualizing attribute associations using Bayesian networks. Section 3 presents an application using attribute associations to support feature selection in the KDD process. Section 4 argues the possibility of using attribute associations to support discrimination of data suitable for learning decision trees. Section 5 concludes this paper.
2 2.1
Discovering and Visualizing Attribute Associations Attribute Associations
To begin, we describe the data format that we deal with in this study and define several terms. A case, a tuple of data, is expressed in terms of a fixed collection of attributes. Each attribute has either discrete or numeric values. A case has also a predefined category of a target concept. A data set is a set of cases for an event. Attribute associations of data are information on the degree of relevance between a pair of attributes and on the structure existing between them. “Degree of relevance between attributes” is a numeric value which represent the strength of relevance such as covariance, correlation coefficient and mutual information. “Structure existing between attributes” is an indication of which pair of attributes have relevance. Since data we deal with contain a target concept, we need to consider associations among not only attributes but also attributes and a target concept. We simply deal with the target concept as an attribute of the data because a target concept can be regarded as a discrete attribute. 2.2
Discovering Attribute Associations Using Learning Bayesian Networks
The first question to be discussed here is how we obtain attribute associations that exist implicitly in data. We propose a method of discovering attribute associations via learning Bayesian networks. A Bayesian network is a directed
Discovering and Visualizing Attribute Associations
63
acyclic graph with a conditional probability distribution for each node. Each node represents an attribute in data. Arcs between nodes represent probabilistic dependencies among the attributes. A set of conditional probability distributions defines these dependencies. The task of discovering attribute associations from data is equivalent to learning Bayesian networks from the data. The problem of learning a Bayesian network can be informally stated as: given a training set of data, find a network that best matches the data [2]. We used the following algorithm for learning Bayesian networks, which is based on a greedy search strategy. 1. Let a network N (V, A) where V = { all the nodes corresponding to the attributes of a data set }, A = {}. Let L and I be empty lists which are used in this algorithm. For each arc (vi , vj ) ∈ V , compute a score for the arc. For all the arcs, sort them by score and put them into list L in decreasing order. 2. Select arcs from L created in step 1 and put them into list I which is used as input for the construction of a network. 3. Create 3 candidates N1 , N2 and N3 adding an arc ai from the head of I to the current network N . N1 is a network to which ai is added in some direction. N2 is a network to which ai is added in the opposite direction to N1 . N3 is a network to which no edge is added. Remove ai from I. 4. Compute scores for the 3 candidates created in step 3. Select the network that gives the best score. 5. In a case in which N1 or N2 is selected in step 4, add ai to A and go back to step 3. If I becomes empty, return N . In step 1, we use the mutual information of each pair of nodes as the score for the arc. The mutual information of two nodes Xi , Xj is defined as I(Xi , Xj ) =
X
P (xi , xj ) log
xi ,xj
P (xi , xj ) P (xi )P (xj )
(1)
where xi is a possible attribute value of the attribute correspondent to the node Xi , P (xi ) is the probability that is calculated as a ratio of the number of cases which have xi to the total number of cases in a data set, and P (xi , xj ) is the probability that is calculated as a ratio of the number of cases which have xi and xj . We apply the drafting [1] to select arcs in step2. It selects n − 1 arcs from the head of L, where n is the number of nodes in N . This prevents there being an excess of edges in a network. We adopt the Bayesian Dirichlet (BD) metric [4] as a scoring metric for a Bayesian network in step 3. It calculates the relative posterior probability of a network structure given a data set. Let D be the data set, BSh be the hypothesis that a data set D is generated by network structure BS and ξ be given background knowledge. The BD Metrics is calculated as p(D, BSh |ξ)
=
p(BSh |ξ)
qi n Y Y
0
ri Y Γ (Nijk + Nijk ) Γ (Nij ) · · 0 0 Γ (Nij + Nij ) k=1 Γ (Nijk ) i=1 j=1 0
(2)
where ri is the number Q of possible attribute values of the i-th node Xi , qi is the number of state of i , Nijk is the number of cases in D in which Xi = k and
64
G. Masuda et al.
Fig. 1. Visualization of attribute associations
Pri Pri Q 0 0 0 Qi = j, Nijk is the Dirichlet exponents, Nij = k=1 Nijk and Nij = k=1 Nijk . i denotes a parent node set of the Xi such that Xi and {X1 , . . . , Xi−1 } are conditionally independent. Γ is the Gamma function which satisfies Γ (x + 1) = 0 xΓ (x) and Γ (1) = 1. As the Dirichlet exponents we use the equation Nijk = ri2qi . Because this learning algorithm is not able to deal with numeric attributes, a discretization is required beforehand. We adopt the gain criterion [9] to find a threshold value that divides numeric attribute values into two discrete ones. 2.3
Visualization of Attribute Associations
We propose a three-dimensional visualization of attribute associations obtained from data. We describe how we visualize them in this section. 1. Degree of relevance between a pair of attributes: We represent the degree of relevance by the size of a node, the color of a node and the thickness of an arc between nodes. We arrange a node indicating the target concept in the center. Attributes relevant to the target concept are arranged on the inner circumference surrounding it. Attributes irrelevant to the target concept are arranged on the outer circumference. When an attribute is a constituent of the network including the target concept it is regarded as relevant to the target concept. On the other hand an attribute is regarded as irrelevant to the target concept when it is not contained in the network. As regards the color and size of nodes, we give a different color and size to each node according to its kind. A node indicating the target concept is red. Attributes relevant to the target concept are purple while attributes irrelevant to the target concept are blue and are smaller than those relevant to the target concept. Thickness of an arc indicates the strength of mutual information the arc has. 2. Structure exists between a pair of attributes: We represent an arc in a Bayesian network by an arrow and the structure of cause and effect by
Discovering and Visualizing Attribute Associations
65
the direction of the arrow from cause to effect. Attributes are topologically sorted by their causal direction and are arranged on the circumference. This enables users intuitively to understand the causal direction among attributes. 3. Extra basic information: We represent a type of attribute by the shape of the node indicating the attribute. A discrete attribute is expressed by a square, while a numerical attribute is expressed by a sphere. An attribute name is labeled on each node. The number of possible discrete attribute values is also labeled on discrete attributes. Attributes which have a large number of attribute values (the default is 5) are colored in yellow in order that users can easily distinguish them. We implemented a tool for discovering and visualizing attribute associations from data. It visualizes them in a three-dimensional view and provides manipulations such as rotation and zoom for users. These manipulations allow users closely to examine an area of interest. Figure 1 is an example of visualized attribute associations from a data set for a golf tournament. Each case in the data set indicates the target concept, whether a golf tournament takes place or not, under a set of conditions such as outlook, humidity and temperature. In this example, the target concept classes is arranged in the center of the figure. The attributes relevant to the target concept are arranged on the inner circumference (outlook, humidity, temperature, number of participants). These attributes are topologically sorted clockwise by their causal direction. Attributes irrelevant to the target concept are arranged on the outer circumference (ID, windy). Users can find that the attributes outlook and humidity have relevance to whether or not a golf tournament takes place. Furthermore users can see that the attribute outlook has strong relevance to the attribute number of participants.
3
Using Attribute Associations to Support Feature Selection
3.1
Feature Selection Using Attribute Associations
Feature selection [5,6] eliminates irrelevant and/or redundant attributes in a data set in order to obtain simple and interpretable patterns and to decrease the size of search space in data mining. We propose an interactive feature selection using visualized attribute associations. Our visualization shows which attributes have relevance to the target concept and the strength of the relevance. It enables users interactively to select attributes by looking at the visualized attribute associations for their data set. We believe that it is important for users to be able easily to reflect their intention in the KDD process. The following are examples of policies for feature selection which users can lay down. – Rule 1: Eliminate all the attributes arranged on the outer circumference. – Rule 2: Eliminate the discrete attributes which have a large number of possible attribute values.
66
G. Masuda et al. Table 1. Results of experiment on feature selection
Tree size Predicted error rate(%) No FS FS Rate No FS FS Rate australian 53 37 0.70 14.0 14.5 1.04 balance-scale(*) 41 41 1.00 34.1 34.1 1.00 7 3 0.43 8.6 9.0 1.05 breast-cancer-wisconsin breast-cancer 23 6 0.26 29.3 28.6 0.98 crx 46 31 0.67 14.3 14.3 1.00 german 118 125 1.06 25.9 25.0 0.96 glass 59 43 0.73 21.1 25.9 1.23 hayes-roth 34 25 0.74 32.2 34.1 1.06 heart 45 28 0.62 19.3 20.0 1.04 hepatitis 15 17 1.13 17.1 17.3 1.01 7 7 1.00 6.3 6.3 1.00 iris(**) labor(**) 7 7 1.00 17.4 17.4 1.00 liver 81 79 0.98 23.1 27.1 1.17 lymphography 33 15 0.45 25.5 24.9 0.98 monk1 18 29 1.61 28.6 26.5 0.93 pima-diabet 25 27 1.08 22.5 22.5 1.00 post-operative(*) 1 1 1.00 32.9 32.9 1.00 primary-tumor 61 59 0.97 56.6 56.6 1.00 segmentation 25 25 1.00 10.6 10.7 1.01 139 139 1.00 18.4 18.4 1.00 tic-tac-toe(*) vehicle(*) 183 183 1.00 15.4 15.4 1.00 voting(**) 7 7 1.00 6.9 6.9 1.00 wine(**) 9 9 1.00 5.1 5.1 1.00 zoo 21 13 0.62 15.8 12.2 0.77 Average 0.877 1.009 Data set
Types of Network DENSE EX STAR DENSE SPARSE DENSE EX SPARSE — STAR — — DENSE — SPARSE DENSE EX STAR — SPARSE STAR DENSE SPARSE SPARSE DENSE DENSE DENSE —
Boldface denotes some improvement. * denotes that no attribute was eliminated in feature selection. ** denotes that several attributes were eliminated but there was no change in the result.
Attributes arranged on the outer circumference do not have relevance to the target concept. It is therefore a reasonable policy to eliminate such attributes (Rule 1). Moreover an attribute which has a large number of possible discrete attribute values tends to affect the size of patterns obtained from the data. Eliminating such attributes would also be a reasonable policy (Rule 2). 3.2
Experimental Results
We carried out an experiment in order to confirm the effectiveness of our proposed feature selection on the 24 data sets stored in the UCI Machine Learning Repository [10]. The two rules we described above were used as a policy of feature selection. We used a decision tree learning system [7,8] developed in our research group as a data mining method. It is based on C4.5 [9]. However flexibility
Discovering and Visualizing Attribute Associations
67
and extensibility of the system are emphasized in order easily to modify parts of the system using object-oriented technology. We adopted tree size, namely the number of nodes and the predicted error rate described in [9] as criteria for evaluation of results. In general it is to be desired that the decision tree has both a small size and a low predicted error rate. For comparison, we also analyzed the same data sets without feature selection. Table 1 shows the results. “FS” represents the results with feature selection, while “No FS” represents the results without feature selection. The results show that tree size obtained with feature selection is on average 0.88 times as large as that obtained without feature selection. However the predicted error rate did not worsen greatly with feature selection compared to without feature selection. To discuss these results in some detail, differences in either tree size or predicted error rate were found in 16 data sets out of 24. Of these, both tree size and predicted error rate were improved in 3 data sets. Tree size alone was improved in 8 data sets. Predicted error rate alone was improved in 2 data sets. Both got worse in 2 data sets. For the remaining 8 data sets, no attribute could be eliminated in 4 data sets (mark * in Table 1), while the eliminated attributes were not used originally in the other 4 data sets (mark ** in Table 1). However computation time was improved in these 4 data sets. These results indicate that feature selection using visualized attribute associations works well.
4
Using Attribute Associations to Support the Selection of Data Mining Methods
As [3] states, the ability to suggest to users the most appropriate data mining method is an important requirement for KDD tools. We describe our attempt to discriminate if a given data set is suitable for learning decision trees by catching the characteristic of the data via visualized attribute associations. We investigated 24 data sets and analyzed the results of learning decision trees used in the previous section. We found 3 types of structural characteristic in Bayesian networks obtained from the data. Moreover we found that these characteristics have a strong relevance to the analysis results of learning decision trees. We call these 3 characteristics of data DENSE, SPARSE and STAR according to the topology of Bayesian networks we obtained from the data. We present the 3 types as follows. DENSE: Bayesian networks in Fig. 2 have thick arcs between attributes arranged on the inner circumference. They also have thick arcs between the target concept and attributes arranged on the inner circumference. We classify the data that produce such Bayesian networks as the DENSE type. We found 6 data sets belonged to the DENSE type in the 24 data sets we used in the previous section (“DENSE” type in Table 1). It shows that decision trees derived from the data tend to be small and to have a low predicted error rate. We believe the reason to be that attributes which have strong relevance to the target concept tend to be used as nodes of the decision trees.
68
G. Masuda et al.
Fig. 2. Examples of the DENSE type: breast-cancer-wisconsin(left) and wine(right)
Fig. 3. Examples of the SPARSE type: liver(left) and german(right)
Some exceptions were observed. Tree size tends to be large when the target concept has strong relevance to discrete attributes which have a large number of possible attribute values. For example, the data set lymphography includes 2 discrete attributes which have 8 possible attribute values. By eliminating these 2 attributes using feature selection, we were able to obtain a smaller decision tree while keeping its predicted error rate almost the same. We found similar results in the data sets crx and australian (“DENSE EX” type in Table 1). SPARSE: In the Bayesian networks shown in Fig. 3, the arcs between attributes arranged on the inner circumference are very narrow and the networks look quite sparse. In the Bayesian network obtained from the data set german (Fig. 3 right), in particular, almost all attributes are arranged on the outer circumference. We classify the data from which such Bayesian networks are obtained as the SPARSE type. We found 6 data sets belonged to the SPARSE type out of the 24 data sets we used (“SPARSE” type in Table 1). Decision trees obtained from such data tend to be large and to have high predicted error rates. We believe the reason is that few attributes have strong relevance to the target concept, and it is hence difficult to classify data with such attributes.
Discovering and Visualizing Attribute Associations
69
Fig. 4. Examples of the STAR type: balance-scale(left) and primary-tumor(right)
STAR: In the Bayesian networks shown in Fig. 4, each attribute is arranged like a star surrounding the target concept in the center. In other words, each attribute arranged on the inner circumference is relevant to the target concept, while attributes themselves are irrelevant to each other. We classify the data from which such Bayesian networks are obtained as the STAR type. We found 4 data sets belonged to the STAR type out of 24 data sets (“STAR” type in Table 1). It was found that decision trees derived from the STAR type data tend to be large and to have high predicted error rates. We believe the main reason for this is that such data have the rule which is a conjunction of all the attributes relevant to the target concept. Decision trees therefore tend to be large and to overfit the training data, which worsen their predicted error rate. Up to this point we have presented 3 types of structural characteristics which are found in the Bayesian networks we obtained from the data. As a result we were able to set up the following criteria for learning decision trees. – In cases in which a data set belongs to the DENSE type, learning decision trees is suited for analyzing the data as a data mining method. – In cases in which a data set belongs to the SPARSE or STAR types, learning decision trees is not suited for analyzing the data.
5
Conclusion
In this paper we have described a way of discovering and visualizing attribute associations using Bayesian networks. In addition we have described two applications using visualized attribute associations to the KDD process. As regards feature selection with attribute associations, we ascertained that our proposed method worked well in 17 data sets out of the 24 tested, all of which come from the UCI Machine Learning Repository. In our approach users can directly reflect their intentions in feature selection. This advantage is very important for KDD tools because of the interactive nature of the KDD process. As regards supporting the selection of data mining methods, we found 3 types of structural characteristic in Bayesian networks which have strong relevance to the results
70
G. Masuda et al.
of learning decision trees. Based on these characteristics we set up criteria for discrimination of data suitable for learning decision trees. Users can estimate whether a given data set should be analyzed by learning decision trees according to these criteria. However they are based on a visual characteristic. There may exist other essential characteristics which affect the correlation between the types of Bayesian networks and the result of learning decision trees. Further analysis on this will be necessary as part of our future work. We have not considered conditional probabilities of Bayesian networks in our Bayesian networks learning algorithm. Using the conditional probabilities enables users to obtain more information on their data. Introducing conditional probabilities into our learning algorithm is a future task.
Acknowledgments We would like to thank Nick May for his advice on the presentation of English. We would also like to thank the maintainers and contributors to the UCI repository of machine learning databases.
References 1. Cheng, J., Bell, DA and Liu, W. “Learning belief networks from data: an information theory based approach,” Proceedings of the Sixth ACM International Conference on Information and Knowledge Management, 1997. 2. Friedman, N., Geiger, D. and Goldszmidt, M. Bayesian Network Classifiers, Machine Learning 29, pp. 131-163, 1997. 3. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, H. Advances in Knowledge discovery and data mining, AAAI/MIT Press, 1996. 4. Heckerman, D., Geiger, D. and Chickering, D. “Learning Bayesian networks: The Combination of Knowledge and Statistical Data,” Technical Report MSR-TR-94-09, Microsoft Research, 1994. 5. John, G., Kohave, R. and Pfleger, K. “Irrelevant Features and the Subset Selection Problem,” In Machine Learning: Proceedings of the Eleventh International Conference, pp. 121-129, Morgan Kaufmann Publisher, 1994. 6. Liu, H., Setiono, R. “Feature Selection and Classification – A Probabilistic Wrapper Approach,”, Proc. 9th Int. Conf. on IEA/AIE, pp. 419-424, 1996. 7. Masuda, G. Sakamoto, N. and Ushijima, K. “A Practical Object-Oriented Concept Learning System in Clinical Medicine,” Proc. 9th Int. Conf. on IEA/AIE, pp. 449454, 1996. 8. Masuda, G., Sakamoto, N. and Ushijima, K. “Applying Design Patterns to Decision Tree Learning System,” Proc. of ACM SIGSOFT Sixth International Symposium on the Foundations of Software Engineering, pp.111-120, 1998. 9. Quinlan, J. R. C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993. 10. Blake, C., Keogh, E. and Merz, C. J. UCI Repository of machine learning databases [http://www.ics.uci.edu/˜mlearn/MLRepository.html], Irvine, CA: University of California, Department of Information and Computer Science, 1998.
Taxonomy Formation by Approximate Equivalence Relations, Revisited 1,3 ˙ F.A. El-Mouadib1 , J. Koronacki1,2 , and J.M. Zytkow 1
Institute of Computer Science, Polish Academy of Sciences, 2 Polish-Japanese Institute of Computer Technology 3 Computer Science Department, UNC Charlotte mouadib, [email protected], [email protected]
Abstract. Unsupervised classification of objects involves formation of classes and construction of one or more taxonomies that include those classes. Meaningful classes can be formed in feedback with acquisition of knowledge about each class. We demonstrate how contingency tables can be used to construct one-level taxonomy elements by relying only on approximate equivalence relations between attribute pairs, and how a multi-level taxonomy formation can be guided by a partition utility functions. Databases with different types of attributes and large number of records can be dealt with.
1
Introduction
˙ As a part of their 49er system, Zytkow and Zembowicz (1993, 1995) and Troxel et al. (1994) proposed a mechanism for taxonomy formation based on approximate equivalence relations in two-way contingency tables. In this paper, we limit our attention to construction of binary taxonomy trees. At each node of the tree, only a binary split is allowed. Alike 49er, our mechanism starts from one-level taxonomies. A systematic search is performed for pairs of attributes whose association is ”strong enough” in a well defined sense. If such approximately equivalent pairs are found at a particular node of the tree, one of one-level taxonomies derived from approximate equivalence is added under this node. In what follows, we deal with some modifications and extensions of the ori˙ ginal methodology of Zytkow and others. In particular, 1. due to their simplicity and arguably clear interpretation, measures of association introduced by Goodman and Kruskal (1954) are used in lieu of ˙ χ2 -like Cramer’s V statistics applied in 49er (Troxel et al., 1994; Zytkow and Zembowicz, 1996). 2. instead of using the original heuristic approach intended to minimize the description length of the taxonomy, the choice of a one-level taxonomy to be attached under a given node is based on maximization of the so-called partition utility (as suggested, e.g., by Fisher and Hapanyengwi (1993) and based either on entropy or on the Gini index); ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 71–79, 1999. c Springer-Verlag Berlin Heidelberg 1999
˙ F.A. El-Mouadib, J. Koronacki, and J.M. Zytkow
72
3. a data-driven discretization of continuous variables is proposed. The original system uses discretization based on equal intervals or any other user-defined discretization of attributes. Given the system’s algorithmic simplicity it is suitable for data mining applications. In the process of building a taxonomy tree, the system does not use any similarity/dissimilarity measures but instead it relies only on approximating equivalence relations and on using partition utility functions. It should also be emphasized that, like 49er, the system works in a strictly unsupervised fashion. No a priori model of the data is assumed (neither the number of classes is prespecified nor any assumptions about data distribution are made).
2
Approximate Equivalence Relations
Let us begin with a pair of attributes measured on a nominal scale. A two-way contingency table represents the equivalence relation between the two attributes involved if and only if both attributes can be binned so that the entire population is concentrated in cells no two of which are in the same row or column of the table. In real world databases, relations represented by such tables are rare. Therefore, like 49er, our system relies on approximate equivalence relations. But rather than comparing observed cell counts with the expected counts under independence to determine whether a contingency table approximates equivalence, as is the case when χ2 -like statistics are used, we use a set of lambda (λ) measures of association introduced by Goodman and Kruskal (1954) Following Goodman and Kruskal (1954), let A and B be two nominal attributes assuming “levels”, that is, values Aa , a = 1, . . . , α, and Bb , b = 1, . . . , β, respectively. Consider first the following asymmetric case: the classification of the values of A precedes the B classification chronologically, causally, or otherwise. An individual is chosen at random from the population and we are asked to guess its B-level as well as possible, either 1. Given no further information, or 2. Given the individual’s A-level. One can then define the following measure of association λb =
(Probab. of wrong guess in case 1) − (Probab. of wrong guess in case 2) . (Probab. of wrong guess in case 1)
It is easy to see that: λb is indeterminate if and only if attribute B assumes the same level for all the population (i.e., iff the population lies in one column of the contingency table); λb ∈ [0, 1]; λb = 0 if and only if knowledge of the A classification is of no help in predicting the B classification; λb = 1 if and only if the A classification completely specifies the B classification; if A and B are independent and λb is determinate, then λb = 0; λb is unchanged by permutation of rows or columns.
Taxonomy Formation by Approximate Equivalence Relations, Revisited
73
The λa measure can be defined analogously. In turn, consider a symmetric case when an individual is chosen at random and the problem is to guess its A value half the time and its B value half the time (at random), either: 1. Given no further information, or 2. Given the individual’s A-level when B level is guessed and vice versa. Clearly, the measure λ can now be defined in the same way as λb . Both measures have analogous properties: λ is indeterminate if and only if the population lies in one cell of the contingency table; λ ∈ [0, 1]; λ = 1 if and only if all the population is concentrated in cells no two of which are in the same row or column of the table; if A and B are independent and λ is determinate, then λ = 0; λb is unchanged by permutation of rows or columns; λ lies between λa and λb inclusive. The computation of λa , λb and λ is extremely simple. In particular, in terms of cell counts, P P a nam + b nmb − n·m − nm· , λ= 2n − (n·m + nm· ) where n is the size of the population (total count), nam is the maximum P cell count is the maximum cell count in column b, n·m = maxb { a nab } and in row a, nmbP nm· = maxa { b nab }. See Goodman and Kruskal (1954) and Bishop et al. (1975) for a detailed discussion of lambda measures, as well as for the definitions and a general treatment of other measures of association. A cursory look at the properties of lambda measures shows that they are very well suited to measure not only association between nominal attributes but also their “closeness” to equivalence. In the nominal case, the strength of association and closeness to equivalence can be considered as essentially the same. For attributes measured on an ordinal scale, association between them, can be represented by probabilities for like and unlike orders (positive or negative association). On the other hand, the issue of (approximate) equivalence is not tied to any notion of order. Thus, although the λ measures are designed specifically to measure association when no order is present, they can be used to evaluate approximate equivalence between any ordered and nominal attributes. Usually, clusters which are present in data described by ordered attributes reflect the ordered structure of the data: say, ”small” values of an attribute appear in one cluster and ”medium and large” values in another, not ”small and large” in one and ”medium” in another. While in this paper we stick to the lambda measures, in El-Mouadib and Koronacki (1999) a measure of closeness to equivalence based on probabilities of like and unlike orders is proposed (that other measure is a minor modification of Goodman and Kruskal’s (1954) γ measure of association of ordered attributes). In summary, in order to form one-level taxonomies, a systematic search is performed for pairs of attributes whose strength of association is above some controllable threshold value. In our implementation, a threshold can be set for λa and λb , and any pair of attributes whose λa or λb is above that threshold is considered conductive to one-level taxonomy. If such a pair is found, its contingency table is then aggregated into a two-by-two table by merging values of
74
˙ F.A. El-Mouadib, J. Koronacki, and J.M. Zytkow
the two attributes involved. Aggregation of values should be performed in such a way that the resulting 2 × 2 table have the largest possible value of λ. If λ proves sufficiently close to one, the approximate equivalence described by the 2 × 2 table provides a natural basis for a binary split (i.e., creation of a hierarchy element). If an attribute is nominal, aggregation of its values is preceded by their rearrangement guided by correspondence analysis (rearrangement by correspondence analysis is a standard statistical tool: see, e.g, Mardia et al. (1979) for an illustrative example, Krzanowski (1988) for another example and a focus on singular value decomposition to replace spectral decomposition, and Krzanowski and Marriott (1994) for interpretation in terms of level profiles of attributes involved). Loosely speaking, correspondence analysis of rows of the contingency table (the same applies to columns) may enable one to rearrange the rows in such a way that the least distant (the so-called χ2 distance between the row profiles or, simply, the rows) rows are adjacent after rearrangement and, generally, the order in which rearranged rows appear in the table reflect the distances between the rows. After rows and columns rearrangement, or without it if the attributes are ordered, aggregation can be easily done automatically by merging rows 1 to a into one aggregated level of attribute A and rows a + 1 to α into another level of A, and merging columns 1 to b into one aggregated level of attribute B and columns b + 1 to β into another. Running over all possible pairs (a, b), one finds a 2 × 2 table with the largest value of λ. If the maximum value of λ is above a prespecified threshold, the table is retained for further analysis.
3
Discretization
There are a variety of discretization methods. Some of them are simple and obvious, such as the equal-width-intervals and equal-frequency-intervals methods. Others are not so obvious, such as ChiMerge (see Kerber, 1992), entropy-based methods (see Fayyad and Keki, 1992, 1993) and similarity based method (Van de Merckt, 1993). Assuming that the attribute’s measurements come from a continuous distribution, El-Mouadib and Koronacki (1999) proposed a discretization method which is based on the idea of building first a histogram with a bandwidth which is oversmoothed from the point of view of (asymptotic) mean integrated squared error (MISE), Z MISE = E
∞
−∞
(f (x) − fˆ(x))2 dx,
where E is the expected value, f (·) is the true density underlying the data and fˆ(·) denotes the histogram. See Scott (1992) for a detailed exposition; let us recall here only that, if one takes a family of all one-dimensional densities f (·) having the same variance, then the oversmoothed bin width can be defined as ˜ os , h = 2.603(IQ)n−1/3 ≡ h
Taxonomy Formation by Approximate Equivalence Relations, Revisited
75
where IQ is the interquartile range and n is the sample size (i.e., the number of measurements of the attribute being discretized). The rationale behind the given initial choice of the bin width is simple: an oversmoothed bin width can be thought of as providing an upper bound for the discretization interval, which is then to be sought in a way consistent with the ultimate task of building a taxonomy. The final bin width or discretization interval is found by trial and error; see the iris example in El-Mouadib and Koronacki (1999). A few remarks are in place here. First, the proposed approach is clearly ad hoc, even if natural and straightforward. However, unless some restrictions on distributions involved are imposed, no general theory, treating attributes in different scales in a unified way, is available so far (an interesting approach to discretization is given, e.g., in Ciok (1998), but there the number of clusters is assumed known a priori). Second, since partition is based on two-way tables, discretization should in fact rest on building two-dimensional histograms. And third, since simple histograms are vulnerable to bin edge problems, they should be replaced by the so-called averaged shifted histograms (ASHs). Indeed, while a simple histogram can perform well in an easy case of the famous iris data, in particular the bin edge problem can have a disastrous effect on the algorithm’s ability to find clusters in not so nice data (actually, we encountered this problem in the soybean example discussed in the next section before we switched to using ASHs). Bivariate ASHs are constructed by averaging m2 shifted bivariate histograms, each with bins of prespecified dimension h1 × h2 . Let the bin mesh of the first (in a sense the “leftmost”, as we shall see) of the m2 histograms be fixed. In order to be specific, assume that the bin origin of this first histogram is the point (0, 0) and the histogram is built on the positive quadrant of the Cartesian plane. Then the remaining m2 − 1 histograms are constructed by coordinate shifts that are multiples of δi = hi /m, i = 1, 2, and the ASH assumes the form m m 1 XX ˆ fˆ(·, ·) = 2 fij (·, ·), m i=1 j=1
where the bin origin for the bivariate shifted histogram fˆij (·, ·) is the point ((i − 1)δ1 , (j − 1)δ2 ). The given construction practically alleviates the bin edge problem (see Scott (1992) for a detailed discussion of ASHs). In our examples ˜ os for that attribute we begin with bin width for a given attribute equal to h (admittedly, the issue of properly choosing initial dimension of bins requires an additional study). Concluding this section, let us notice that binning of genuinely discrete data is often needed as well. Indeed, it can be advantageous to produce a histogram with a coarser, often much coarser, mesh than the original one. This can be done by trial and error, as in the handwritten digits example in El-Mouadib and Koronacki (1999).
76
4
˙ F.A. El-Mouadib, J. Koronacki, and J.M. Zytkow
Partition Utility and Creating Many-level Taxonomies
In the taxonomy formation process, the search for approximate equivalence relations may result in creation of many one-level taxonomies (also called hierarchy ˙ elements by Zembowicz and Zytkow, 1996). If two such taxonomies have approximately the same data range and share common descriptors in the children classes, then the pair is merged into one one-level taxonomy. Still, after merging, several one-level taxonomies may remain. One of them is chosen as the root of multi-level taxonomy. To make that choice we use the partition utility function (see Fisher (1996), Fisher and Hapanyengwi (1993) for details). Our taxonomy formation algorithm picks the one-level taxonomy with the greatest partition utility value as the one which guides the split at this level of the tree. The partition utility function (PU) based on the Gini index is given by (analogous formula can be given for the PU function based on entropy) PU =
K 1 X CU (Ck ), K k=1
CU (Ck ) = P (Ck )
Ji I X X
[P (Ai = Vij |Ck )2 − P (Ai = Vij )2 ],
i=1 j=1
where K is the number of categories in a partition (default=2), I is the number of attributes in the given data, and Ji is the number of values of attribute Ai .
5
Experiments and Results
In this section, we demonstrate the results obtained by ourtaxonomy formation system for two examples. A general algorithm for building a taxonomy tree has ˙ been given by Zytkow and others and is not repeated here. To see if the algorithm works as desired, we have used real-life examples with data labeled and coming from known classes. Of course, class labels have been removed before building a taxonomy tree. The first is the satellite data example of Mertz and Murphy (1996) from the UCI repository. We used the set of 2000 data coming from 6 classes, each with 36 attributes of continuous type. As claimed in the commentary to the repository and confirmed by principal component analysis of the data, attributes No. 17, 18, 19 and 20 contain essentially all the information hidden in the data set. Our analysis was therefore confined to this subset of attributes. The tree obtained, ˜ os ’s, is with discretization provided by ASHs and bin widths always equal to h shown in Fig.1. The split at the top level of the tree was guided by an approximate equivalence relation with λ = 0.882, obtained after suitable aggregation for attributes 18 and 20, and depicted in the following 2 × 2 table with rows corresponding to aggregated values of attribute 18 and columns corresponding to aggregated values of attribute 20:
Taxonomy Formation by Approximate Equivalence Relations, Revisited
77
9 1810 172 9
All Class
counts
1 2 3 4 5 6
461 224 397 211 237 470
2000 records
Class
counts
Class
counts
1 2 3 4 5 6
461 36 396 210 237 470
2
172 172 records
1810 records
Class
counts
Class
counts
1 2 3 4 5 6
87 6 11 147 174 436
1 2 3 4 5 6
352 18 361 16 54 13
861 records
814 records
Fig. 1. Taxonomy tree for the satellite data
Of all 2000 records, 18 records which do not fit the relation, and hence do not fit the rule for the split, were “lost” from the taxonomy tree. For the lower level split, λ was equal to 0.819 (and 135 records were “lost”). The best regularities for the two leaves had λ’s equal, respectively, to 0.668 and 0.563, and were therefore not used to expand the taxonomy. The second example deals with the soybean data (again from the UCI repository). The data are nominal, with 35 attributes. The set of 307 records coming from 19 classes was chosen but all the records with missing values were excluded from the set. In this way, a subset of 266 records from 15 classes was obtained as depicted in the table below.
78
˙ F.A. El-Mouadib, J. Koronacki, and J.M. Zytkow No. 1 2 3 4 5 6 7 8
Class description N.of cases diaporthe-stem-canker 10 charcoal-rot 10 rhizoctonia-root-rot 10 phytophthora-rot 16 brown-stem-rot 20 powdery-mildew 10 downy-mildew 10 brown-spot 40
No. 9 10 11 12 13 14 15
Class description N.of cases bacterial-blight 10 bacterial-pustule 10 purple-seed-stain 10 anthracnose 20 phyllosticta-leaf-spot 10 alternarialeaf-spot 40 frog-eye-leaf-spot 40
The classification is shown in Fig. 2. Only 15 cases were “lost” from the taxonomy tree as violating equivalence relation in one of the 2 × 2 tables used to build the tree.
6
Concluding Remarks
In addition to the two examples from the previous section and those from ElMouadib and Koronacki (1999), the taxonomy formation system was examined on several other databases from the UCI repository. In all the examples, results were encouraging and of similar quality. While in most of the examples dendrograms based on similarity/dissimilarity measures could have been easily built, our study has confirmed the validity of the idea of using approximate equivalence relations as a basis for taxonomy formation.
References Bishop, Y., Fienberg, S., Holland, P.: Discrete Multivariate Analysis: Theory and Practice. The MIT Press (1975) Ciok, A.: Discretization as a total in cluster analysis. In: Rizzi, A., Vichi, M., Bock, II (eds.): Advances in Data Science and Classification Societies. Springer, (1998) 349–354 El-Mouadib, F. A., Koronacki, J.: On Taxonomy Formation by Approximate Equivalence Relations. Proc. of the 8th Workshop on Intelligent Information Systems. Polish Acad. Sci. (1999) 37–46 Fayyad, U.M., Keki, I.B.: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. Proc. of 13th Intern. Joint Conf. on Artificial Intelligence. Chambery, France (1993) 1022–1027 Fayyad, U.M., Keki, I.B.: Technical Note On the Handling of Continuous-Valued Attributes in Decision Tree Generation”. Machine Learning 8, (1992) 87–102 Fisher, D.: Iterative Optimization and Simplification of Hierarchical Clusterings. J. of Artificial Intelligence Research (1996) 147–179 Fisher, D., Hapanyengwi, G.: Database Management and Analysis Tools of Machine Induction. Journal of Intelligent Information Systems, 2 (1993) 5–38 Goodman, L., Kruskal, W.: Measure of Association for Cross Classification. Springer Series in Statistics, Springer-Verlag, 1979. Reprinted from J. of the American Statistical Association, 49 (1954) 732–764. Kerber, R.: ChiMerge: Discretization of Numeric Attributes. Proc. of the 10th National Conference on Artificial Intelligence, AAAI/MIT Press, (1992) 123–128
Taxonomy Formation by Approximate Equivalence Relations, Revisited
79
266 256 169 4/5
10 87
165
60
36 25
5/8 1./15
128
6/15
19/15
124 121
59 4/8
43
3/11
26/8 10/9 6/10 10/13 38/14 13/15
16/4
13/12
1 /12 10/6
26 10/3
5/2
23 10 /1
11 5/8
5/2
17 16/5
1/12
14 10/7
4/10
Fig. 2. Taxonomy tree for the soybean data (m in a node gives the number of records in the node and m/n in leaves means “m records of class n in the given leaf”)
Krzanowski, W.: Principles of Multivariate Analysis. Oxford University Press, (1988) Krzanowski, W., Marriott, F.: Multivariate Analysis. Part 1. Edward Arnold (1994) Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, (1979) Merz, C.J.,Murphy, P. M.: UCI Repository of machine learning databases [http://www.ics.uci.edu/ mlearn/MLRepository.html], Irving, CA: University of California, Department of Information and Computer Science, (1996) Scott, D.W.: Multivariate Density Estimation: Theory, Practice and Visualization. Wiley, (1992) ˙ Troxel, M., Swarm, K., Zembowicz, R., Zytkow, J.: Concept Hierarchies: a Restricted Form of Knowledge Derived From Regularities. In: Ras, Z., Zemankova, M. (eds.): Proc. of the Seventh International Symposium on Methodologies for Intelligent Systems, (1994) 437–447 Van de Merckt, T.: Decision Trees in Numerical Attribute Space. Proc. of 13th Intern. Joint Conf. on Artificial Intelligent, Chambery, France, (1993) 1016–1021 ˙ Zembowicz, R., Zytkow, J.: From Contingency Tables to Various Forms of Knowledge in Databases. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advances in Knowledge Discovery and Data Mining, AAAI Press, (1995) ˙ Zytkow, J., Zembowicz, R.: Database Exploration in Search of Regularities. Journal of Intelligent Information Systems, 2 (1993) 39–81
On the Use of Self-Organizing Maps for Clustering and Visualization Arthur Flexer The Austrian Research Institute for Artificial Intelligence Schottengasse 3, A-1010 Vienna, Austria [email protected]
Abstract. We show that the number of output units used in a selforganizing map (SOM) influences its applicability for either clustering or visualization. By reviewing the appropriate literature and theory and own empirical results, we demonstrate that SOMs can be used for clustering or visualization separately, for simultaneous clustering and visualization, and even for clustering via visualization. For all these different kinds of application, SOM is compared to other statistical approaches. This will show SOM to be a flexible tool which can be used for various forms of explorative data analysis but it will also be made obvious that this flexibility comes with a price in terms of impaired performance. The usage of SOM in the data mining community is covered by discussing its application in the data mining tools CLEMENTINE and WEBSOM.
1
Introduction
Self-organizing maps (SOM) [12] are a very popular tool used for a range of different purposes including clustering and visualization of high dimensional data spaces. SOM is also used in two prominent data mining tools: it is one of the algorithms implemented in CLEMENTINE [6] and it is at the heart of WEBSOM, a system for automatic organization of large text document collections (see [15] and [14]). Although there is vast literature available concerning SOMs, a recent survey [13] contains about 2000 entries, it is still far from clear when and how to apply SOMs for either clustering or visualization or even how these two purposes and goals relate to each other. In a recent comprehensive monograph [13] SOM is said to “project and visualize high-dimensional data spaces”. The fact that there is a relation to clustering and visualization techniques is also well known, see e.g. [1], [10], [13], [4] and [21]. Theoretical analysis of SOM concentrates on issues within the method (e.g. convergence) rather than commenting on how and for what SOM should actually be used (see [7] for a survey of results). However, there is also a considerable amount of criticism formulated both in terms of empirical and theoretical comparison. In [1] as well as [24] SOM is compared to various clustering algorithms on artificial data. In [2] SOM is compared to principal component analysis and Sammon mapping on a series of artificial and real world data sets. In [10] SOM is compared to a combined ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 80–88, 1999. c Springer-Verlag Berlin Heidelberg 1999
On the Use of Self-Organizing Maps for Clustering and Visualization
81
method of vector quantization plus Sammon mapping of the codebook using multivariate normal data. Most of these empirical studies show SOM to perform equal or worse than the statistical approaches. There also exist two alternative re-formulations of the original idea of SOMs in more principled probabilistic frameworks ([4] and [21]). In [4] SOM is criticized for not defining a density model, for not optimizing an objective error function and for the lack of a guaranteed convergence property. Albeit the wealth of work which has been done using and analysing SOMs and even although considerable amounts of criticism have already been formulated, what is still missing are some constructive guidelines as to clarify when and how to use SOMs for either clustering and visualization and how these notions relate to each other in the context of SOMs. This is exactly what this paper tries to achieve by showing that the number of output units used in a SOM influences its applicability for either clustering or visualization. Appropriate literature and theory will be reviewed and own empirical results will be presented which compare SOM to other statistical approaches. The usage of SOM in the two data mining tools CLEMENTINE and WEBSOM will be discussed.
2
SOM for Clustering
According to a standard text book on pattern recognition [19] “Clustering algorithms are methods to divide a set of n observations into g groups so that members of the same group are more alike than members of different groups...the groups are called clusters”. A classical technique to achieve such a grouping is the K-means approach developed in the cluster analysis literature (starting from [16]). Closely related to SOM is online K-means clustering (oKMC) consisting of the following steps: 1. Initialization: Given N = number of codebook vectors, k = dimensionality of the vectors, n = number of input vectors, a training sequence {xj ; j = ˆ and a discrete-time 0, . . . , n − 1}, an initial set Aˆ0 of N codebook vectors x coordinate t = 0 . . . , n − 1. xi ; i = 1, . . . , N }, find the minimum distortion partition P (Aˆt ) = 2. Given Aˆt = {ˆ {Si ; i = 1, . . . , N }. Compute d(xt , x ˆi ) for i = 1, . . . , N . If d(xt , x ˆi ) ≤ (xt , x ˆl ) for all l, then xt ∈ Si (d is usually Euclidean distance). 3. Update the codebook vector with the minimum distortion ˆ(t−1) (Si ) + α[x(t) − x ˆ(t−1) (Si )] x ˆ(t) (Si ) = x
(1)
where α is a learning parameter to be defined by the user. Define Aˆt+1 = x ˆ(P (Aˆt )), replace t by t + 1, if t = n − 1, halt. Else go to step 2. The main difference between the SOM-algorithm and oKMC is the fact that the codebook vectors are the weight vectors of the output units which are ordered either on a line or on a planar grid (i.e. in a one or two dimensional output space). The iterative procedure is the same as with oKMC where Equ. 1 is replaced by ˆ(t−1) (Si ) + h[x(t) − x ˆ(t−1) (Si )] x ˆ(t) (Si ) = x
(2)
82
A. Flexer
and this update is not only computed for the x ˆi that gives minimum distortion, but also for all the codebook vectors which are in the neighbourhood of this x ˆi on the line or planar grid. The degree of neighbourhood and amount of codebook vectors which are updated together with the x ˆi that gives minimum distortion is expressed by h, a function that decreases both with distance on the line or planar grid and with time and that also includes an additional learning parameter α. If the degree of neighbourhood is decreased to zero, the SOMalgorithm becomes equal to the oKMC-algorithm. Whereas local convergence is guaranteed for oKMC (at least for decreasing α, [5]), no general proof for the convergence of SOM with nonzero neighbourhood is known. In [13] it is noted that the last steps of the SOM algorithm should be computed with zero neighbourhood in order to guarantee “the most accurate density approximation of the input samples”. One of the main problems in clustering data is to decide for the correct number of clusters (i.e. codebook vectors). Clearly N , the number of cluster centers or output units, should be equal g, the number of clusters present in the data. In [8] it is argued that one should compute successive partitions of the data with an ever growing number of clusters N . If samples are really grouped into g compact, well separated clusters, one would expect to see any error function based on within or between cluster variance (the same obviously holds for average distortion) decrease rapidly until N = g. Such error functions should decrease much more slowly thereafter until they reach zero at N = n. The two most comprehensive studies on SOM’s clustering ability ([1] and [24]) use SOMs and cluster algorithms with N always set equal to g, the number of clusters known to be in the data. In [24] SOM is compared to five different cluster algorithms on 2580 artificial data sets. One-dimensional SOMs are being used with zero neighbourhood at the end of learning and consequently SOMs and K-means clustering perform equally well in terms of data points misclassified1 , both being better than the other hierarchical cluster methods. In [1] SOM is compared to K-means clustering on 108 multivariate normal clustering problems but the SOM neighbourhood is not decreased to zero at the end of learning. SOM performs significantly worse in terms of data points misclassified since the additional neighbourhood term tends to pull the obtained cluster centers away from the true ones (the SOM cluster centers are pulled towards each other). In [13] this effect is described as two “opposing forces” where the weight vectors of the output units tend to describe the density function of the inputs and the local interactions between output units tend to preserve topology.
1
Although SOM is an unsupervised technique not built for classification, the number of points misclassified to a wrong cluster center is an appropriate and commonly used performance measure for cluster procedures if the true cluster structure is known. Given N = g, all members of one true cluster in the data space should be members of just one cluster in the obtained partition. All exchanges between clusters constitute data points misclassified.
On the Use of Self-Organizing Maps for Clustering and Visualization
3
83
SOM for Simultaneous Clustering and Visualization
SOM is however more than just a technique to cluster data. It has the appealing property to do clustering and visualization at the same time by preserving the topological ordering of the input data reflected by an ordering of the codebook vectors in a one or two dimensional output space. Note that in order to use SOM for visualization and clustering at the same time it is again necessary that N , the number of output units, is equal g, the number of clusters in the data set. Formally, a topology preserving algorithm is a transformation Φ : Rk 7→Rp , that either preserves similarities or just similarity orderings of the points in the input space Rk when they are mapped into the output-space Rp . For most algorithms it is the case that both the number of input vectors | x ∈ Rk | and the number of output vectors | x˙ ∈ Rp | are equal to n. A transformation Φ : x˙ = Φ(x), that preserves similarities poses the strongest possible constraint ˙ x˙ i , x˙ j ) for all xi , xj ∈ Rk , all x˙ i , x˙ j ∈ Rp , i, j = 1, . . . , n − 1 since d(xi , xj ) = d( ˙ being a measure of distance in Rk (Rp ). Such a transformation is said and d (d) to produce an isometric image. Techniques for finding such transformations Φ are, among others, various forms of multidimensional scaling2 (MDS) like Sammon mapping [20], but also principal component analysis (PCA) (see e.g. [11]) or SOM. Sammon mapping is doing MDS by minimizing the following via steepest descent: Pn−1 P i=0
n−1 XX
1
j
d(xi , xj )
i=0 j
˙ x˙ i , x˙ j ))2 (d(xi , xj ) − d( d(xi , xj )
(3)
˙ x˙ i , x˙ j ) is the distance in the output space that corresponds to the where d( distance d(xi , xj ) in the input space. Since SOM has been designed heuristically and not to find an extremum for a certain energy function3 , the theoretical connection to other MDS algorithms remains unclear. It should be noted that for SOM the number of output vectors | x˙ ∈ Rp | is limited to N , the number of cluster centroids x ˆ and that the x˙ are further restricted to lie on a planar grid. This restriction entails a discretization of the output-space Rp which allows only Ps N (N −1) different i=2 i,(s≥2) different distances in an s×s planar grid instead of 2 distances for N = s × s cluster centroids mapped via e.g. Sammon mapping. In what we believe to be the only existing empirical study on SOM’s ability of doing both clustering and visualization at the same time, we have compared SOM to a combined technique of online K-means clustering plus Sammon mapping of the cluster centroids. Our new combined approach (abbreviated oKMC+) consists of simply finding the set of Aˆ = {ˆ xi , i = 1, . . . , N } codebook vectors that ˆ = {Si ; i = 1, . . . , N } via oKMC and give the minimum distortion partition P (A) then using the x ˆi as input vectors to Sammon mapping and thereby obtaining a 2 3
Note that for MDS not the actual coordinates of the points in the input space but only their distances or the ordering of the latter are needed. In [9] it is even shown that such an objective function cannot exist for SOM.
84
A. Flexer
two dimensional representation of the x ˆi via minimizing the term in Equ. 3. Contrary to SOM, this two dimensional representation is not restricted to any fixed form and the distances between the N mapped x ˆi directly correspond to those in the original higher dimension. In [21] a similar combined technique is proposed with the difference that clustering and visualization is achieved simultaneously and not one after the other. The empirical comparison was done using multivariate normal distributions generated by a procedure which is standard for comparisons of cluster algorithms (see [18] and [1]). We produced 36 data sets with number of clusters being 4 or 9, and the number of dimensions being 4, 6 or 8. All clusters showed internal cohesion as well as external isolation. The latter was defined as having all clusters non-overlapping in the first dimension. We compared two-dimensional SOMs with numbers of output units set equal to the numbers of clusters known to be in the data (4 or 9) to oKMC+ models with corresponding sizes of codebooks. SOM performed almost equally well as oKMC+ in recovering the structure of the clusters (measured via the so-called Rand index which is closely related to data points misclassified) which is as expected since we set the neighbourhood to zero at the end of training. We used Pearson correlation to measure how well the topology is preserved by both SOM and oKMC+. We computed the Pearson correlation of the distances d(x1 , x2 ) in ˙ x˙ i , x˙ j ) in the output space for all possible the input space and the distances d( pairwise comparisons of data points. Note that for SOM the coordinates of the ˙ An algorithm codebook vectors on the planar grid were used to compute the d. that preserves all distances in every neighbourhood would produce an isometric image and yield a value of 1.0 (see [2] for a discussion of measures of topology preservation). SOM performed significantly worse in preserving the topology, we obtained a correlation 0.67 for SOM and of 0.88 for oKMC+. This is a direct implication of SOM’s restriction to planar grids described above. Using a nonzero neighbourhood at the end of SOM training did not warrant any significant improvements. Full details of this study are given in [10].
4
SOM for Visualization
Another possibility to apply SOM is to use it for visualization only thereby neglecting its clustering ability. It is then not necessary to try to set the number of output units equal to a presumed number of clusters in the data. It is possible and even common practice to apply SOM with numbers of output units N that are a multiple of the number of input vectors n available for training (see e.g. the “poverty map” example given in [13]). This means of course that SOMs employing numbers of codebook vectors which are comparable to or are even a multiple of the number of input vectors available can be used for visualization purposes only. If one uses more or even only the same amount of codebook vectors than input vectors during clustering, each codebook vector will become identical to one of the input vectors in the limit of learning. So every xi is replaced with an identical xˆi , which does not make any sense in terms of clustering.
On the Use of Self-Organizing Maps for Clustering and Visualization
85
In [2] SOM is compared to principal component analysis and Sammon mapping on six artificial data sets with different numbers of points and dimensionality and different shapes of input distributions and on the Anderson IRIS data. The degree of preservation of the spatial ordering of the input data is measured via a Spearman rank correlation instead of Pearson correlation similar to our approach described above. The traditional techniques preserve the distances much more effectively than SOM, the performance of which decreases rapidly with increasing dimensionality of the input data.
222 22 9 9 9 99999999 999 999999999 9 9 8 88888888 88888888 8 8 888 8 88
2 22 222 22 2 2
6 66 6666 6 6 7 7 6 77777777 6 6666666666 7 6 777 7 7 77777777 5 66 6 5 7 5 55555555 555555 5 55555 5 4 222222 2 4 4 2 2222 4444444444 2222222 2 4 44444444 22 2 2 4 4 2 33 4 3 33 33 333 333333333333 33 3
33 3333 3 33 3 33 3
11 1111 1 1111111 11 1 111 1 11 1 1
4 44 444 4 4 4 4 4 4 44 4 444 4
4
5
1111 11 111 1
8 88 888 888 888 8
5 555555 55555 555 7
66 66 66 6666
7 777777 77777777
99 9 9 9999 9999
Fig. 1. Output representations after mapping nine eight-dimensional clusters via Sammon mapping (left) and SOM (right). Numbers indicate true cluster membership.
We did an own study on visualization with SOM using the same 36 data sets described in Sec. 3. We computed SOMs consisting of 20 × 20 (for data sets consisting of 4 clusters and 100 points) or 30 × 30 (for 9 clusters and 225 points) codebook vectors for all 36 data sets which gave an average correlation of 0.77 between the distances di and d˙i . This is significantly worse at the .05 error level compared to the average correlation of 0.95 achieved by Sammon mapping applied to the input data directly. This result together with the previously described study [2] indicates that even using more output units than input vectors available does not help against the drawbacks of SOM’s discretization of the output-space. This rigidity of the output map is clearly visible if one compares examples of output maps given in Fig. 1.
5
SOM for Clustering via Visualization
Yet another possible application of SOM is to use it to cluster data via visualization. This is done by first visualizing the data via a SOM output map and then using ones own subjective judgement by just looking at the resulting output map
86
A. Flexer
and counting how many clusters one is able to see. Reviewing clustering studies employing SOM quickly shows that indeed SOMs are often used for this kind of clustering via visualization. There is even work on trying to augment cluster visibility in SOM output maps (see e.g. [23] and [3]). It should be clear that for this type of application SOMs with large amounts of output units will be best suited. However, it has long been known within the clustering community that doing clustering via visualization bears some pitfalls. In [22] it is shown that there is a high probability that a researcher will conclude that a subset of points comprise one cluster, when in fact the points comprise two or more clusters. This is due to the reduction in dimensionality produced by the mapping to the output space which impairs the user’s ability to detect clusters that existed in the space defined by the original variables. In [17] it is shown that even if researchers are asked to determine cluster membership from identical two-dimensional representations, their inter-rater reliability is on average as low as 0.77. If one compares output maps obtained by SOM and Sammon mapping given in Fig. 1, it seems that whereas the 9 clusters are still clearly visible in the Sammon mapping picture this is not so clear in SOM’s output map. Clusters 2 and 4 are no longer coherent and members of cluster 5 and 7 appear as outliers. Both data mining applications CLEMENTINE and WEBSOM use SOM for clustering via visualization. The CLEMENTINE user guide [6, p.8] states that SOMs “are a type of neural network that perform clustering” but does not advise the reader how such a clustering can be achieved. However, besides an “Expert Training Method” which requires the user herself to choose the number of output units, there is a “Simple Training Method” available which automatically chooses this parameter. We trained SOMs using the CLEMENTINE function “Train Kohonen” with the “Simple Training Method” on all of the 36 data sets described in Sec. 3. Although the data sets with either 100 or 225 data points contained either 4 or 9 easily separable clusters, CLEMENTINE always automatically chose a 7 × 5 grid of output units. This means that if CLEMENTINE’s SOMs are used with the “Simple Training Method”, the aim is not to do clustering or simultaneous clustering and visualization as described in Secs. 2 and 3 since the number of output units is far from estimating the correct number of clusters present in the data. CLEMENTINE rather uses SOM for visualization only or, if we follow the user guide’s advice that SOMs “perform clustering”, for clustering via visualization. WEBSOM organizes large collections of text documents by mapping vectorial representations (which are related to word frequencies) onto a two-dimensional display using a SOM. In an example given in [14] 1, 124, 134 documents from “80 very different Usenet newsgroups” are being mapped onto a SOM with 104, 040 output units. Again it should be clear that SOM is used for clustering via visualization since the huge number of output units stands in no relation to the assumed number of clusters present in the data (80 clusters corresponding to 80 different newsgroups). The method described in [23] is used “to indicate the clustering tendency” as shades of gray on the output grid.
On the Use of Self-Organizing Maps for Clustering and Visualization
6
87
Conclusions
In this work we tried to make the notion of using SOM as a “data visualization tool” more concrete by showing that the number of output units used in a SOM influences its applicability for either clustering or visualization. We showed that if the number of output units N is set equal to g, the number of clusters present in the data set, SOM can be used both for clustering alone and for clustering plus simultaneous visualization. Theoretical as well as empirical results make clear that for these purposes the degree of neighbourhood should be set to zero at the end of learning which makes SOM equivalent to online K-means Clustering. Our own empirical results show that the simultaneous visualization of cluster centers (output units) is impaired due to SOM’s discretization of the output space. SOM can also be used for visualization only or for clustering via visualization and then the number of output units N can be in the order of the number of input vectors n or even a multiple of it. SOM’s visualization ability does again suffer from the discretization of the output space which is exemplified via empirical results. As about clustering via visualization, it is known from the literature that this bears the high risk of missing the true cluster structure. We conclude that SOM is a flexible tool which can be used for various forms of clustering and visualization but that this flexibility comes with a price in terms of impaired performance. Concerning the use of SOM in the data mining community as discussed in the context of CLEMENTINE and WEBSOM, it has to be said that these tools rely on SOM’s ability to do clustering via visualization. Users of CLEMENTINE and WEBSOM should be aware of the possible pitfall of missing the true cluster structure as well as of the impaired visualization due to the discretization of the output display. Acknowledgements: Parts of this work were done within the BIOMED-2 BMH4-
CT97-2040 project SIESTA, funded by the EC DG XII. The Austrian Research Institute for Artificial Intelligence is supported by the Austrian Federal Ministry of Science and Transport. The author was supported by a doctoral grant of the Austrian Academy of Sciences.
References 1. Balakrishnan P.V., Cooper M.C., Jacob V.S., Lewis P.A.: A study of the classification capabilities of neural networks using unsupervised learning: a comparison with k-means clustering, Psychometrika, Vol. 59, No. 4, 509-525, 1994. 2. Bezdek J.C., Nikhil R.P.: An index of topological preservation for feature extraction, Pattern Recognition, Vol. 28, No. 3, pp.381-391, 1995. 3. Bishop C.M., Svensen M., Williams C.K.I.: Magnification factors for the SOM and GTM algorithms, Proc. of WSOM’97: Workshop on Self-Organizing Maps, Helsinki, pp. 333-338, 1997. 4. Bishop C.M., Svensen M., Williams C.K.I.: GTM: The Generative Topographic Mapping, Neural Computation, Vol. 10, Issue 1, p.215-234, 1998.
88
A. Flexer
5. Bottou L., Bengio Y.: Convergence Properties of the K-Means Algorithms, in Tesauro G., et al.(eds.), Advances in Neural Information Processing System 7, MIT Press, Cambridge, MA, pp.585-592, 1995. 6. Clementine User Guide, Integral Solutions Limited, 1998. 7. Cottrell M., Fort J.C., Pages G.: Theoretical aspects of the SOM algorithm, Neurocomputing, (21)1-3, pp.119-138, 1998. 8. Duda R.O., Hart P.E.: Pattern Classification and Scene Analysis, John Wiley & Sons, N.Y., 1973. 9. Erwin E., Obermayer K., Schulten K.: Self-organizing maps: ordering, convergence properties and energy functions, Biological Cybernetics, 67, 47- 55, 1992. 10. Flexer A.: Limitations of Self-Organizing Maps for Vector Quantization and Multidimensional Scaling, in Mozer M.C., et al.(eds.), Advances in Neural Information Processing Systems 9, MIT Press/Bradford Books, pp.445-451, 1997. 11. Jolliffe I.T.: Principal Component Analysis, Springer, 1986. 12. Kohonen T.: Self-Organization and Associative Memory, Springer, 1984. 13. Kohonen T.: Self-organizing maps, Springer, Second Extended Edition, Springer Series in Information Sciences, Vol. 30, 1997. 14. Kohonen T.: Self-Organization of Very Large Document Collections: State of the Art, in Niklasson L., et al.(eds.), Proceedings of the 8th International Conference on Artificial Neural Networks, Springer, 2 vols., pp.65-74, 1998. 15. Lagus K., Honkela T., Kaski S., Kohonen T.: Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration, in Simoudis E. & Han J.(eds.), KDD-96: Proceedings Second International Conference on Knowledge Discovery & Data Mining, AAAI Press/MIT Press, pp.238-243, 1996. 16. MacQueen J.: Some Methods for Classification and Analysis of Multivariate Observations, Proc. of the Fifth Berkeley Symposium on Math., Stat. and Prob., Vol. 1, pp. 281-296, 1967. 17. Mezzich J.: Evaluating clustering methods for psychiatric diagnosis, Biological Psychiatry, 13, 265-346, 1978. 18. Milligan G.W., Cooper M.C.: An examination of procedures for determining the number of clusters in a data set, Psychometrika 50(2), 159-179, 1985. 19. Ripley B.D.: Pattern Recognition and Neural Networks, Cambridge University Press, 1996. 20. Sammon J.W.: A Nonlinear Mapping for Data Structure Analysis, IEEE Transactions on Comp., Vol. C-18, No. 5, p.401-409, 1969. 21. Schwenker F., Kestler H., Palm G.: Adaptive Clustering and Multidimensional Scaling of Large and High-Dimensional Data Sets, in Niklasson L., et al.(eds.), Proceedings of the 8th International Conference on Artificial Neural Networks, ICANN’98, Springer, pp.911-916, 1998. 22. Sneath P.H.A.: The risk of not recognizing from ordinations that clusters are distinct, Classification Society Bulletin, 4, 22-43, 1980. 23. Ultsch A.: Self-organizing Neural Networks for Visualization and Classification, in Opitz O., et al.(eds.), Information and Classification, Springer, Berlin, 307-313, 1993. 24. Waller N.G., Kaiser H.A., Illian J.B., Manry M.: A comparison of the classification capabilities of the 1-dimensional Kohonen neural network with two partitioning and three hierarchical cluster analysis algorithms, Psychometrika, Vol. 63, No.1, 5-22, 1998.
Speeding Up the Search for Optimal Partitions Tapio Elomaa1 and Juho Rousu2 1 2
Department of Computer Science, P. O. Box 26 (Teollisuuskatu 23) FIN-00014 Univ. of Helsinki, Finland, [email protected] VTT Biotechnology and Food Research, Tietotie 2, P. O. Box 1501 FIN-02044 VTT, Finland, [email protected]
Abstract. Numerical value range partitioning is an inherent part of inductive learning. In classification problems, a common partition ranking method is to use an attribute evaluation function to assign a goodness score to each candidate. Optimal cut point selection constitutes a potential efficiency bottleneck, which is often circumvented by using heuristic methods. This paper aims at improving the efficiency of optimal multisplitting. We analyze convex and cumulative evaluation functions, which account for the majority of commonly used goodness criteria. We derive an analytical bound, which lets us filter out—when searching for the optimal multisplit—all partitions containing a specific subpartition as their prefix. Thus, the search space of the algorithm can be restricted without losing optimality. We compare the partition candidate pruning algorithm with the best existing optimization algorithms for multisplitting. For it the numbers of evaluated partition candidates are, on the average, only approximately 25% and 50% of those performed by the comparison methods. In time saving that amounts up to 50% less evaluation time per attribute.
1
Introduction
In inductive processes numerical attribute domains often need to be discretized, which may be time consuming if the domain at hand has a very high number of candidate cut points. This affects both binarization [4,14] methods and, in particular, algorithms that need to partition numerical ranges into more than two subsets; e.g., off-line discretization algorithms [5] and optimal [8,11] or greedy [5,10] multisplitters in decision tree learning, rule induction, and nearest neighbor methods. In data mining applications numerical attributes may constitute a significant time consumption bottleneck. In this paper we continue to explore ways to enhance the efficiency of numerical attribute handling in classification learning. Previous work has shown that the class of well-behaved evaluation functions, for which only a part of the potential cut points needs to be examined in optimal partition selection, contains all the most commonly used attribute evaluation functions [8]. In this paper we analyze convex attribute evaluation functions. The analysis brings out new opportunities for pruning the set of candidate partitions. Empirical evaluation shows that the speed-up obtained is substantial; on the average, ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 89–97, 1999. c Springer-Verlag Berlin Heidelberg 1999
90
T. Elomaa and J. Rousu
the evaluation of half of the partition candidates can be omitted without sacrificing the optimality of the resulting partition.
2
Preliminaries and an Overview
The processing of a numerical attribute begins by sorting the training data by the value of the attribute. We consider a categorized version of the data, where all examples with an equal value constitute a bin of examples. In supervised learning, the task in numerical value range discretization is to find a set of cut points to partition the range into a small number of intervals that have good class coherence. The coherence is usually measured by an evaluation function. Many, though not all [6,8], of the most widely used attribute evaluation functions are either convex (upwards) or concave (i.e. convex downwards), both are usually referred to as convex functions. Definition 1. A function f (x) is said to be convex over an interval (a, b) if for every x1 , x2 ∈ (a, b) and 0 ≤ ρ ≤ 1, f (ρx1 + (1 − ρ)x2 ) ≤ ρf (x1 ) + (1 − ρ)f (x2 ). A function f is said to be strictly convex if equality holds only if ρ = 0 or ρ = 1. A function f is concave if −f is convex. Fayyad and Irani’s [9] analysis of the binarization technique proved that for the information gain function [13,14] only boundary points need to be considered as potential cut points due to the convexity of the function. Definition 2. Let a sequence S of examples be sorted by the value of a numerical attribute A. The set of boundary points is defined as follows: A value T ∈ Dom(A) is a boundary point if and only if there exists a pair of examples s1 , s2 ∈ S, having different classes, such that valA (s1 ) = T < valA (s2 ); and there does not exist another example s ∈ S such that valA (s1 ) < valA (s) < valA (s2 ). A block of examples is the sequence of examples in between two consecutive boundary points. Blocks can be obtained from bins by merging adjacent class uniform bins with the same class label. A well-behaved function always has an optimal multisplit on boundary points. All the most commonly used attribute evaluation functions fall into this category [8], including all convex evaluation functions and some non-convex such as the gain ratio, GR [13,14] and the normalized distance measure, ND [12]. Table 1 summarizes the current knowledge of optimization algorithms for families of evaluation functions. Brute-force exhaustive search can be used to optimize any evaluation function, but the search method is exponential in the (maximum) arity, k, of the partition for each attribute. With well-behaved functions like GR and ND, only boundary points need to be examined, which leads to slightly better efficiency.
Speeding Up the Search for Optimal Partitions
91
Table 1. Types of functions, their optimization algorithms, asymptotic time and space requirements for these algorithms, and examples of functions in these categories. Type Any Well-behaved Cumulative + Well-behaved + Convex Monotonic
Algorithm Time Space Functions Brute-force (bins) O(mV k ) O(mV ) Brute-force (blocks) O(mB k ) O(mB) GR, ND Bin-Opt O((k + m)V 2 ) O((k + m)V ) Block-Opt O((k + m)B 2 ) O((k + m)B) Block-Opt-P O((k + m)B 2 ) O((k + m)B) ACE, IG, GI One-pass O(kmn) O(km) TSE
Cumulative evaluation functions, i.e., functions that compute a (weighted) sum of goodness scores of the subsets, can be optimized in time quadratic in the number of bins using the general algorithm which uses dynamic programming [11,8]. Subsequently we refer to this algorithm as Bin-Opt. If the evaluation function, additionally, is well-behaved, then an algorithm called Block-Opt [8] can be used to optimize it in time quadratic in the number of blocks. This paper introduces a pruning method for minimization of concave and cumulative evaluation functions, which improves the efficiency of the Block-Opt algorithm. The asymptotic time requirement does not change, but as demonstrated in the subsequent experiments, the practical speed-up is substantial. Examples of concave and cumulative evaluation functions include the gini index (of diversity), GI [4] and the average class entropy, ACE. For a partition U S of the data set S, ACE is defined to be i i P P U ACE ( i Si ) = (1/|S|) i |Si |H(Si ) = (1/n) i |Si |H(Si ), Pm where H is the entropy function: H(S) = − j=1 P (Cj , S) log2 P (Cj , S), in which m denotes the number of classes and P (C, S) stands for the proportion of examples in S that have class C. Many other evaluation functions use ACE as their building block. Such functions include, e.g., the information gain function, IG [13,14], GR, and ND. In the experiments of Section 5 we use IG, which is defined as U U IG ( i Si ) = H(S) − ACE ( i Si ) . Finally, there is only one concave and cumulative evaluation function that is known to be optimizable in linear time by one-pass evaluation [1,2,11]: training set error, TSE. Unfortunately, the function has many defects, which disqualify it from application in multi-class induction.
3
Pruning Partition Candidates
The algorithm Block-Opt uses dynamic programming to efficiently search all boundary point combinations in order to find the best partition [8]. It uses a left-to-right scan over the blocks and tabulates the goodness scores of prefix
92
T. Elomaa and J. Rousu
partitions to avoid repetitive calculation of the scores. Although the algorithm works well when there is a moderate amount of boundary points in the range, its efficiency suffers when they are more frequent. This section studies how the search space of the algorithm can be restricted by utilizing the convexity properties of the evaluation functions. Let X be a variable P with domain X . Let E denote the expectation. In the discrete case EX = x∈X p(x)x, where p(x) = Pr{X = x}. Theorem 1 (Jensen’s inequality [7, pp. 25–26]). If f is a convex function and X is a random variable, then Ef (X) ≥ f (EX). Jensen’s inequality does not restrict the probability distribution underlying the expectation. Hence, for a concave function f it holds that P P (1) i αi f (ti ) ≤ f ( i αi ti ) P for αi ≥ 0, i αi = 1. Typically, partition ranking functions give each interval a score using an other function, which tries to estimate the class coherence of the interval. A common class of such functions are the impurity functions [4]. The interval scores are weighted relative to the sizes of the intervals. Thus, a common form of an evaluation function F is U P F ( i Si ) = i (|Si |/|S|)I(Si ), (2) P where I is an impurity function. Now, |Si |/|S| ≥ 0 and i (|Si |/|S|) = 1. If the impurity function I is concave, then by Eq. 1: P P U (3) i (|Si |/|S|)I(Si ) ≤ I ( i (|Si |/|S|)Si ) ⇔ F ( i Si ) ≤ F (S), in which F (S) is the score of the unpartitioned data. Observe that, since I is concave, any splitting of the data can only decrease the value of F . Thus splitting on all cut points will lead to best score. Hence, in practice, the arity of the partition needs to be bounded, either a priori or by using some penalizing term. For example, the evaluation function ACE fulfills the requirements of the function F above; the entropy function, H, is concave, because function x log x is convex [7]. Theorem 2. Let F be the evaluation function defined in Eq. 2 and let I be a concave impurity function. Let S be a sequence of examples consisting of consecutive intervals S1 , S2 , . . . , Sm . Let P1 be, for some fixed k ≥ 2, a (k − 1)-partition for the interval S1 and P2 be a (k − 1)-partition for S1 ∪ S2 . If |S1 ∪ S2 |F (P2 ) − |S1 |F (P1 ) − |S2 |I(S2 ) ≤ 0, then for any example set S∗ F (P2 ] S∗ ) ≤ F (P1 ] {S2 ∪ S∗ }).
(4)
Speeding Up the Search for Optimal Partitions
z
|P1 | = k − 1 S1
}|
93
{
P∗1
S2 ∪ S∗
P∗2
S∗
Fig. 1. P1 and P2 are two (k − 1)-partitions of the prefixes of the data set. They can be extended into k-partitions P∗1 and P∗2 , respectively, for a larger sample by augmenting a new interval to them.
Proof. Let us now consider different k-partitions of S1 ∪ S2 ∪ S∗ , where S∗ is a combination of any number of bins immediately following S2 . P∗1 = P1 ](S2 ∪S∗ ) and P∗2 = P2 ] S∗ are two k-partitions of S1 ∪ S2 ∪ S∗ (see Fig. 1). Assume that the inequality 4 holds. According to the inequality of Eq. 3 |S|F (P∗1 ) = |S1 |F (P1 ) + |S2 ∪ S∗ |I(S2 ∪ S∗ ) ≥ |S1 |F (P1 ) + |S2 |I(S2 ) + |S∗ |I(S∗ ) and 2 |S|F (P∗ ) = |S1 ∪ S2 |F (P2 ) + |S∗ |I(S∗ ). The difference of these two candidates can be bound from above by the inequality 4 |S|F (P∗2 ) − |S|F (P∗1 ) ≤ |S1 ∪ S2 |F (P2 ) − |S1 |F (P1 ) − |S2 |I(S2 ) ≤ 0, from where the claim follows by dividing by |S|. The theorem gives us a possibility of pruning the search space significantly: we can test the bound for empty S∗ and, if the pruning condition is satisfied, subsequently drop all partitions containing P1 from further consideration.
4
The Algorithm for Finding Optimal Partitions
We incorporate the candidate pruning method to algorithm Block-Opt, which uses a dynamic programming scheme similar to that suggested by Fulton et al. [11]. The main modification is that the algorithm works on blocks of examples rather than on individual examples. The blocks are extracted in two-pass preprocessing. That entails bin construction from the sorted example sequence and merging of adjacent class uniform bins (of the same class) into blocks. The time and space complexity of preprocessing is O(n + mV ). The search algorithm inputs a sequence µ of class distributions of the blocks b1 , . . . , bB , an upper limit for the arity of the partition and an evaluation function of the form g(µ) = |S|I(S), where I is a concave function, µ is the class distribution of the set S.
94
T. Elomaa and J. Rousu
Table 2. The search algorithm for multisplits. After executing the algorithm, for each i and k, Pi,k is the cost of the optimal k-split of first i blocks and Li,k is the index to the block which is situated immediately left from the rightmost cut point. procedure Search(µ, g, aritymax) /* µ = {µ1 ,...,µB } contains the class frequency distributions of blocks b1 ,...,bB , g(µj ) = |Sj |I(Sj ), where I is a concave function */ method: 1. for i ← 1 to B do 2. for j ← 1 to i-1 do µj ← µj + µi ; costj ← g(µj ) od; 3. Pi,1 ← cost1 ; Ni,1 ← i-1; 4. if i = B then limit ← aritymax else limit ← aritymax-1 fi; /* Compute the best k-split of b1 ∪...∪ bi for each k */ 5. for k ← 2 to min(i,limit) do 6. minimum ← ∞; rejectlevel ← Pi,k−1 ; 7. l ← i; j ← Nl,k−1 ; 8. while j ≥ k do /* Scan the remaining candidate (k-1)-splits */ 9. current ← Pj,k−1 + costj+1 ; 10. if current ≥ rejectlevel then Nl,k−1 ← Nj,k−1 /* prune */ 11. else /* This candidate could be the optimal one */ 12. if current < minimum then 13. minimum ← current; indexofmin ← j fi; 14. l ← j fi; 15. j ← Nj,k−1 od; 16. Pi,k ← minimum; Li,k ← indexofmin; Ni,k ← i-1 od od
The search algorithm (Table 2) scans the blocks b1 , . . . , bB from left to right. Array P stores the costs of the best multisplits: Pi,k is the minimum cost obtained when the i first intervals are split optimally into k subsets. At step i, array P is updated according to the formula: Pi,k ←
min {Pj,k−1 + g(µj+1 )},
j∈Ni,k−1
which denotes that the optimal partitioning of b1 , . . . , bi into k subsets is the minimum cost over Si all combinations—remaining in the search space—of fixing the last interval l=j+1 bl and adding the cost of the best (k − 1)-split of b1 ∪ · · · ∪ bj . As the scan proceeds, the distributions of blocks are merged, so that at point i, each µj , j ≤ i, represents the class distribution of bj ∪ · · · ∪ bi . The corresponding evaluation function score is stored in array cost. Array L stores the corresponding cut points: Li,k is an index to the block that contains the rightmost cut point of the multisplit having the cost Pi,k . The search space is pruned incrementally by comparing the best (k − 1)split of the intervals processed so far with each remaining candidate k-split of the same range: if Pi,k−1 ≤ Pj,k−1 + costj+1 , the candidate is eliminated. The connection to Theorem 2 is the following: Pi,k−1 corresponds to P2 , Pj,k−1 to
Speeding Up the Search for Optimal Partitions
95
P1 , bj+1 ∪ · · · ∪ bi to S2 , and an empty set to S∗ . The array N stores the search space of remaining partition candidates in linked lists: j = Ni,k−1 denotes that the next (k − 1)-partition to be considered as the prefix of an optimal k-split, after the best (k − 1)-split of the blocks b1 ∪ · · · ∪ bi , is the optimal (k − 1)-split of the blocks b1 , . . . , bj . Note that testing against the best candidate so far is conditioned on passing the pruning test. The reason for this is that by convexity there is always a k-split that is at least as good as the best (k − 1)-split. Hence, the optimal k-split will always pass the first test. The asymptotic time and space complexities of the algorithm are the same as the algorithm Block-Opt [8]. The algorithm takes the time O((k+m)B 2 ) because the incremental merging of the class distributions take the time O(mB 2 ) and scanning the table P takes the time O(kB 2 ) in the worst case. The tables P , L and N are of size O(kB) and the class distributions of the blocks allocate the space O(mB), which leads to the total space complexity of O((k + m)B).
5
Empirical Evaluation
We contrast the multisplitting algorithms Bin-Opt and Block-Opt with and without the new candidate pruning technique. The pruning version is called Block-Opt-P. As baseline we use a breadth-first implementation of Fayyad and Irani’s [10] widely used heuristic greedy multisplitting method. Keep in mind that this method does not produce optimal partitions, even though the scores of the resulting partitions often are very close to optimal [8]. As the evaluation function we use information gain [13], which is convex (thus also well-behaved) and cumulative. In the experiment we partition the numerical dimensions of 31 test domains, which come mainly from the UCI repository [3], using all four partitioning strategies. For each domain we record the number of candidate partitions evaluated in processing each numerical attribute. Fig. 2 depicts the results of this experiment. The figures on the top are the average number of evaluations per numerical attribute performed by the algorithm Bin-Opt, which operates on example bins, the white bars represent the relative number of evaluations per attribute for the algorithm Block-Opt operating on blocks, the gray bars are those of Block-Opt-P, where the new candidate pruning is employed, and the black ones correspond to those of the greedy heuristic selection. We can see that the average reduction in the number of examined partitions between Block-Opt and Bin-Opt is close to 50%. An average reduction of the same size is obtained when pruning is employed. Hence, the total average saving in candidate evaluations between Bin-Opt and Block-Opt-P is approximately 75%. These reductions do not correspond linearly to the search times. For instance, in the domain Adult pruning only filters 15% of the candidate partitions examined by Block-Opt, but in time that amounts to a relative saving of 32%. The time consumption of Block-Opt-P is only 14% of that of Bin-Opt. As ano-
T. Elomaa and J. Rousu
4. 41 764 .51 .30 8.4 3,0 48 17 440 ,0 5.8 1,0 22 94 .34 ,5 9, 66 160 7 . 19 57 ,0 0 4 12 .05 ,8 2.8 9,8 1 09 38 .39 ,7 0. 2, 62 685 3 . , 10 479 7 .57 ,5 9. 5, 11 568 4 15 .92 ,9 8.0 5,5 3.2 50,0 08 , 13 603 8 .41 ,0 1.7 57.8 3,7 46 75 .12 ,9 0, 22 49 0 .30 ,6 92 2 . , 80 922 6 13 .85 ,0 8.1 6,5 2.5 57.9 45,3 2.096.1 48,2 65 97 . 39 443.8 . , 12 546 3 .79 ,7 1,3
96
75%
50%
Ab
al An A one d A nea ult Auustr ling to ali Br ins an ea ur st . W Di Co Eu ab lic Fe thyetes rm ro e id Ge nta rm t. a G He la n s H art s Heeart C Hy pa H po titi th s yr Le tte Ir . r r is ec Pa Live . ge M r blo ole Ro cks Se Sat bo gm ell t en ite Sh tat. ut S tle Ve onar h W Voicle av we efo l W rm Yeine as t
25%
Fig. 2. The relative average numbers of partition candidate evaluations per attribute performed by the the algorithms Block-Opt(white bars), Block-Opt-P(gray bars), and the greedy approach (black bars). The figures on the top are the absolute averages for the algorithm Bin-Opt.
ther example, in the domain Vowel the filtering leaves unevaluated 43% of the partitions examined by Block-Opt, but the relative time saving is only 5%. Pruning attains only small savings in domains with the least numbers of initial comparisons (e.g., Breast W and Robot). In these domains the time consumption is low to begin with. On other domains better pruning results are observed. Unfortunately, the relative reduction in the number of examined partition candidates is small also for some of the hardest domains to evaluate (Abalone and Adult). The actual time saving, though, can be larger as demonstrated above. Only in some domains, those where the number of comparisons per attribute is the least, is the pruning technique’s efficiency comparable with that of the O(kB) time greedy multisplitting method. However, the greedy method is not guaranteed to find the optimal partition.
6
Conclusion
Multipartition optimization lacks an efficient general solution. However, specific subclasses of attribute evaluation functions can be optimized in polynomial time. In particular, cumulative and well-behaved evaluation functions can be optimized in time quadratic in the number of blocks in the numerical domain. It
Speeding Up the Search for Optimal Partitions
97
seems unlikely that this asymptotic bound could be improved without trading off generality. Linear-time optimization would seem to require that the goodness score of the best partition changes monotonically during the search procedure, as happens with TSE. The class of convex evaluation functions is a large one, including many of the commonly used functions. In this paper we bound the value that can be obtained by a partition determined by a convex evaluation function. With the analytical bound we were able to reduce the number of partition candidates that need to be evaluated in optimizing any convex evaluation function. The pruning technique does not improve the asymptotic time requirement of optimizing a convex function, but it does have a great impact on the practical time consumption of the search algorithm.
References 1. Auer, P.: Optimal splits of single attributes. Unpublished manuscript, Institute for Theoretical Computer Science, Graz University of Technology (1997) 2. Birkendorf, A.: On fast and simple algorithms for finding maximal subarrays and applications in learning theory. In: Ben-David, S. (ed.): Computational Learning Theory, Third European Conference. Lecture Notes in Artificial Intelligence, Vol. 1208, Springer-Verlag, Berlin Heidelberg New York (1997) 198–209 3. Blake, C., Keogh, E., Merz, C.: UCI repository of machine learning databases. http://www.ics.uci.edu/∼mlearn/MLRepository.html (1998) 4. Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C. J.: Classification and Regression Trees. Wadsworth, Pacific Grove, CA (1984) 5. Catlett, J.: On changing continuous attributes into ordered discrete attributes. In: Kodratoff, Y. (ed.): Machine Learning – EWSL-91, Fifth European Working Session on Learning. Lecture Notes in Computer Science, Vol. 482. Springer-Verlag, Berlin Heidelberg New York (1991) 164–178 6. Codrington, C. W., Brodley, C. E.: On the qualitative behavior of impurity-based splitting rules I: The minima-free property. Mach. Learn. (to appear) 7. Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991) 8. Elomaa, T., Rousu, J.: General and efficient multisplitting of numerical attributes. Mach. Learn. 36 (1999) to appear 9. Fayyad, U. M., Irani, K. B.: On the handling of continuous-valued attributes in decision tree generation. Mach. Learn. 8 (1992) 87–102 10. Fayyad, U. M., Irani, K. B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Mateo, CA (1993) 1022–1027 11. Fulton, T., Kasif, S., Salzberg, S.: Efficient algorithms for finding multi-way splits for decision trees. In: Prieditis, A., Russell, S. (eds.): Machine Learning: Proceedings of the Twelfth International Conference. Morgan Kaufmann, San Francisco, CA (1995) 244–251 12. L´ opez de M` antaras, R.: A distance-based attribute selection measure for decision tree induction. Mach. Learn. 6 (1991) 81–92 13. Quinlan, J. R.: Induction of decision trees. Mach. Learn. 1 (1986) 81–106 14. Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993)
Experiments in Meta-level Learning with ILP Ljupˇco Todorovski1,2 and Saˇso Dˇzeroski2 1
2
Faculty of Medicine, Institute for biomedical informatics Vrazov trg 2, 1000 Ljubljana, Slovenia [email protected] Joˇzef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia [email protected]
Abstract. When considering new datasets for analysis with machine learning algorithms, we encounter the problem of choosing the algorithm which is best suited for the task at hand. The aim of meta-level learning is to relate the performance of different machine learning algorithms to the characteristics of the dataset. The relation is induced on the basis of empirical data about the performance of machine learning algorithms on the different datasets. In the paper, an Inductive Logic Programming (ILP) framework for meta-level learning is presented. The performance of three machine learning algorithms (the tree learning system C4.5, the rule learning system CN2 and the k-NN nearest neighbour classifier) were measured on twenty datasets from the UCI repository in order to obtain the dataset for metalearning. The results of applying ILP on this meta-learning problem are presented and discussed.
1
Introduction
In the area of machine learning a large number of different algorithms have been developed. When considering new datasets for analysis using these algorithms, the problem of choosing the most suitable one(s) occurs. The choose of the appropriate machine learning algorithm, can be especially time consuming in the process of knowledge discovery in very large datasets. This problem can be solved using meta-level machine learning, i.e. by learning to predict how well each of the machine learning algorithms can perform on the dataset on the basis of the dataset itself. Using this predictor, users can discard algorithms that are not suitable for the dataset at hand and save a lot of effort in trying out all the algorithms. The concept relating the performances of different machine learning algorithms to the characteristics of the datasets can be induced from empirical data using an arbitrary machine learning algorithm. The empirical data contain information about the performance of different machine learning algorithms on some set of datasets. In state-of-the-art meta-learning studies the concept is induced using attribute oriented machine learning algorithms for rule induction [1,3]. In order to use such algorithms a fixed set of attributes describing the datasets have ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 98–106, 1999. c Springer-Verlag Berlin Heidelberg 1999
Experiments in Meta-level Learning with ILP
99
to be chosen. This set usually includes statistical and information-theory measures [3]. The description of the datasets using a fixed set of attributes can be problematic in several ways. First, the characteristics of the dataset are always measured for the whole dataset only, which is much less informative than using measures for individual attributes. This lack of information can be partly compensated using some advanced measures for the distribution of the data in the dataset. However, calculating these measures can be more complex then actually applying some of the machine learning algorithms to the dataset at hand. Using more powerful formalisms for dataset description can be a way to surpass these problems. In the paper, we introduce an Inductive Logic Programming (ILP) framework for meta-learning. This framework includes some of the measures used in previous state-of-the-art meta-level learning studies. But it also extends the possibilities of describing datasets with the possibility of including statistical and information-theory measures for parts of the dataset (for each attribute and example) and not only the dataset as a whole. Using ILP learning systems, the concept relating this extended dataset description to the performance of different machine learning algorithms can be induced. In preliminary experiments with the presented ILP framework, we used the ILP system FOIL. The performance of three classification algorithms on twenty datasets was measured and related to the dataset features. The paper is organized as follows. The meta-learning ILP framework is introduced in Section 2. In Section 3 the preliminary results of the experiments with thirteen datasets are presented. Section 4 concludes with a discussion on related work and some directions for further work.
2
Meta-level Learning: An ILP Framework
In state-of-the-art meta-learning studies, such as [3] and [1], a fixed set of properties for the whole dataset are used for the dataset description. Considering the whole dataset at once in calculating the properties can be problematic because of the mixture of different types of attributes in the dataset. Some standard statistical measures, such as mean and standard deviation, are used for continuous attributes only, and other, such as median and entropy, are preferred for discrete attributes in the dataset. All measures used in the propositional formalism for dataset description should be well defined for both continuous and discrete attributes in order to calculate their averages among all the attributes in the dataset. In the recent study [6] the problem of averaging the measures among different types of attributes has been addressed. However, the propositional framework used in the study prevents the use of measures for individual attributes in the dataset. In the ILP framework for the dataset description, summarized in Table 1, the propositional properties are used along with some properties which are calculated for each attribute in the dataset. Measures used in the framework resemble the measures used in other meta-learning studies. These include standard simple measures like number of attributes and examples, some statistical measures for
100
L. Todorovski and S. Dˇzeroski Table 1. Relations used for description of datasets.
Relation dataset(D) attr(D,A) num of attrs(D,V) attr cont(D,A) num of cont attrs(D,V) attr disc(D,A) num of disc attrs(D,V) num of bin attrs(D,V) num of classes(D,V) class entropy(D,V) num of examples(D,V) values(D,V) entropy(D,V) skewness(D,V) kurtosis(D,V) mutual inf(D,V) perc of na values(D,V) attr values(D,A,V) attr entropy(D,A,V) attr stddev(D,A,V) attr skewness(D,A,V) attr kurtosis(D,A,V) attr class mutual inf(D,A,V) perc of attr na values(D,A,V)
Description dataset’s identification attribute’s identification number of attributes continuous attribute number of continuous attributes discrete attribute number of discrete attributes number of binary attributes number of classes entropy of class number of examples mean number of values of discrete attributes mean entropy of discrete attributes mean skewness of continuous attributes mean kurtosis of continuous attributes mean mutual information of class and attributes percentage of unknown values number of values of the discrete attribute the entropy of the discrete attribute standard deviation of the continuous attribute skewness of the continuous attribute kurtosis of the continuous attribute mutual information of class and attribute percentage of unknown values of the attribute
continuous attributes (mean, standard deviation, skewness and kurtosis) and entropy of discrete attributes. The mutual information between attributes and class is calculated using Siverman’s method [10]. Beside averages among all the attributes in the dataset, the calculations for each attribute are also used in the dataset description (the lowest part of Table 1).
3
Experiments
Three different propositional classification algorithms were used in the experiments: tree-learning algorithm C4.5 [8], rule-learning algorithm CN2 with mestimate [4,5] and k-nearest neighbour (k-NN) algorithm [10]. These algorithms were used both for base-level and meta-level learning. For base-level learning, they were applied to twenty datasets from the UCI Repository of Machine Learning Databases and Domain Theories [7]. For meta-level learning, the three propositional algorithms as well as two ILP systems FOIL [9] and TILDE [2] were applied to the results of base-level learning, as described below.
Experiments in Meta-level Learning with ILP
3.1
101
Experimental Setting
The measure of performance used in the experiments is the error rate of the classifier on the unseen examples. For each learning algorithm, the error rate for each of the twenty datasets was measured using stratified 10-fold cross validation. The dataset was first partitioned into ten folds with equal sizes and similar class distributions. The average error rates on unseen examples (over the ten folds) for twenty datasets are given in Table 2. Table 2. Average error rates (in percents) of three classification algorithms on twenty datasets. Dataset australian bridges-type diabetes german heart hypothyroid iris lenses soya vote
C4.5 15.34 44.57 27.51 28.90 23.31 0.83 5.33 16.67 8.05 3.72
CN2 16.50 44.28 26.06 25.90 22.94 1.13 6.68 26.66 8.51 3.69
k-NN 14.06 42.36 26.04 27.00 17.41 2.12 6.00 30.00 16.83 10.59
Dataset bridges-td chess echocardiogram glass hepatitis image labor machine tic-tac-toe zoo
C4.5 17.64 0.33 33.54 31.27 17.90 3.29 21.34 28.19 14.83 5.00
CN2 13.73 0.45 36.63 33.60 17.43 6.54 11.01 31.99 1.77 5.91
k-NN 14.82 3.47 28.19 29.31 15.38 3.13 14.33 30.59 4.60 3.91
Additionally, the parameters for C4.5 and CN2 were optimized to minimize the error rate using 10-fold cross validation on the training data of each fold from the previous stage. Nine parts of the training data were used to build the classifier. Its error rate was then measured on the remaining part. The parameters that minimize the average error rate over the 10 folds of the training data were chosen to perform the experiment for measuring the performance of the classification algorithms on the testing data. The values of two C4.5 parameters were optimized: minimal number of examples in the leaf node (possible values from 1 to 5), and tree pruning parameter (from 0% to 100% with step 5%). In the experiments with CN2 the values of parameter m (0, 0.01, 0.1, 0.2, 0.5, 1, 2, 4, 8, 16, 32, 64, 128) and rule significance level (0%, 95% and 99%) were optimized. The optimal value of the parameter k (possible values from 1 to 100) in the experiments with k-NN classifier was chosen using leave-one-out method as described in [10]. To prepare the data for meta-level learning task, we classified the algorithms for each of the twenty datasets in two classes: applicable and inapplicable. The algorithms with low error rates were considered applicable and others were considered inapplicable. The error rate limit for classification was used as in [3]: the algorithms with error rates within the interval h p err min , err min + k · err min (1 − err min )/ntest are considered applicable. err min denotes the lowest of the three error rates for the dataset, ntest is the number of test examples and k is an error margin
102
L. Todorovski and S. Dˇzeroski
parameter. The classification of the algorithms for k = 0.25 is summarized in Table 3. Table 3. Applicability of the machine learning algorithms for twenty datasets. Dataset australian bridges-type diabetes german heart hypothyroid iris lenses soya vote
C4.5 CN2 k-NN √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
Dataset bridges-td chess echocardiogram glass hepatitis image labor machine tic-tac-toe zoo
C4.5 CN2 k-NN √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
We used the ILP system FOIL for the meta-level experiments, along the base-level learning algorithms. Three different datasets for meta-level learning were constructed, one for each classification algorithm. The target relations were appl c45(D), appl cn2(D) and appl knn(D) defined as in Table 3. All the relations from Table 1 were used as a background knowledge. The propositional learning algorithms use the attributes based on the subset of relations from Table 3 of the form relation(D,V). In order to evaluate the obtained models, we used the leave-one-out method. Following this method, we used all but one examples to build a model, while the remaining example was used for testing. 3.2
Results of the Experiments
We used FOIL in two series of experiments. In the first one (labeled FOIL in the tables) the default values for the parameters were used. To examine the importance of newly introduced relations, which can not be included in the experiments with propositional machine learning system, we also performed another series of experiments (labeled FOIL-ND in the tables). In this second series, the values of the parameters are set, so that no determinate literals are included in the model. The determinate literals in the case of meta-learning are exactly the literals of the form relation(D,V) used in the propositional experiments. When using FOIL with default parameters setting, the induced concepts use the determinate literals only. Thus, the induced concept do not include any of the newly introduced relations, measuring the properties of individual attributes. In part, this is due to the heuristics used in FOIL. To surpass this a different parameters setting was used, so that only indeterminate literals are included in the concept, if they are available. Still, some of the indeterminate literals used in our framework (e.g. attr class mutual inf(D,A,V)) are defined for all the
Experiments in Meta-level Learning with ILP
103
Table 4. Concepts induced with ILP system FOIL. appl c45(A) :class entropy(A,B), B>0.991231.
appl c45 appl c45(A) :not(kurtosis(A, 1)), attr entropy(A,B,C), C<=1.
appl c45(A) :appl c45(A) :num of bin attrs(A,B), not(entropy(A, 1)), B>13. attr kurtosis(A,B,C), C>10.1512. appl c45(A) :entropy(A,B), B>2.27248. appl cn2 appl cn2(A) :appl cn2(A) :class entropy(A,B), perc of attr na values(A,B,C), perc of na values(A,C), attr disc(A,B), C>2.30794. C>4.70738, B>0.276716. appl cn2(A) :mutual inf(A,B), B>4.32729. appl knn appl knn(A) :appl knn(A) :num of attrs(A,B), not(entropy(A, 1)). num of disc attrs(A,C), C<=6, B<>C. appl knn(A) :skewness(A,B), B<=1.45483. appl knn(A) :values(A,B), B>4.15385.
datasets and attributes and do not make any discrimination between positive and negative examples. With heuristics used in FOIL such literals are not induced in the induced concepts. The concepts induced with FOIL are presented in Table 4. The only concept based on the property of a single attribute is the one for the applicability of CN2. It states that CN2 is applicable to the datasets which contain discrete attribute with more then 2.3% unknown values. It should be noted that this concept was induced with FOIL-ND, which gained maximal accuracy in leaveone-out experiments for CN2 (see Table 6). Another indeterminate literals that occurs in the concepts are not(entropy(A, 1)) (stating that all attributes in the dataset A are continuous) and not(kurtosis(A, 1)) (all attributes in A are discrete).
104
L. Todorovski and S. Dˇzeroski Table 5. Concepts induced with ILP system TILDE. appl c45 class entropy(A,C) , C > 0.991231 ? +--yes: yes [9 / 9] +--no: num of bin attrs(A,D) , D > 13 ? +--yes: yes [2 / 2] +--no: no [9 / 9] appl cn2 attr kurtosis(A,C,D) , D > 22.7079 ? +--yes: no [5 / 5] +--no: attr class mutual inf(A,E,F) , F > 0.576883 ? +--yes: kurtosis(A,G) , G > 3.87752 ? | +--yes: yes [7 / 7] | +--no: num of examples(A,H) , H > 270 ? | +--yes: yes [3 / 3] | +--no: no [3 / 3] +--no: no [2 / 2] appl knn num of attrs(A,C) , C > 19 ? +--yes: no [4 / 4] +--no: num of examples(A,D) , D > 57 ? +--yes: num of bin attrs(A,E) , E > 15 ? | +--yes: no [1 / 1] | +--no: yes [12 / 13] +--no: no [2 / 2]
The concepts induced with the ILP system TILDE are presented in Table 5. The only concept based on the property of a single attribute (kurtosis of a single attribute and mutual information between the class and the attribute) is the one for the applicability of CN2. Table 6. Accuracy of the meta-level models measured using leave-one-out method. Dataset C4.5 CN2 k-NN FOIL FOIL-ND TILDE default appl c45 16/20 16/20 14/20 16/20 7/20 18/20 11/20 appl cn2 9/20 5/20 11/20 9/20 13/20 9/20 0/20 appl knn 9/20 11/20 9/20 10/20 11/20 14/20 12/20 Sum 34/60 32/60 34/60 35/60 31/60 41/60 23/60
Finally, the results of the leave-one-out experiments are summarized in Table 6. Please note here, that the model induced in each leave-one-out experiment can differ from the others (and the ones presented in Tables 4 and 5), but the accuracy of the classifiers was our primary interest in these experiments. It can
Experiments in Meta-level Learning with ILP
105
be seen from the table that FOIL has a slightly better and FOIL-ND a comparable accuracy with respect to the propositional machine learning systems. TILDE outperforms other machine learning systems on two out of three meta-learning tasks.
4
Discussion
The work presented in the paper extends the work already done in the area of meta-learning in several ways. First, an ILP framework for meta-level learning is introduced. It extends the methodology for dataset description used in [3] with non-propositional constructs which are not allowed when using propositional classification systems for meta-level learning. ILP framework incorporates measures for individual attributes in the dataset description. The ILP framework is also opened for incorporating prior expert knowledge about the applicability of classification algorithms. Also all the datasets used in the experiments are public domain and the experiments can be repeated. This was not the case with the StatLog dataset repository where more then half of the datasets used are not publicly available. Another improvement is the use of a unified methodology for measuring the error rate of different classification algorithms and the optimization of their parameters. The ILP framework used in this paper was build to include the measures used in the state-of-the-art meta-learning studies. It can be extended in several different ways. Beside including other more complex statistical and information theory based measures, it can be also extended with the properties measured for any subset of attributes or examples in the dataset. Individual or set of examples from the dataset can also be included in the description. From the preliminary results based on the experiments with only twenty datasets it is hard to make strong conclusions about the usability of the ILP framework for meta-level learning. The obtained models can capture some chance regularities beside the relevant ones. However, the results of the leave-one-out evaluation method show slight improvement of the classification accuracy when using an ILP description of the datasets. This improvement should be further investigated and tested for statistical significance performing experiments for other datasets from the UCI repository. To obtain a larger dataset for metalevel learning, experiments with artificial datasets should also be performed in the future.
Acknowledgments This work was supported in part by the Slovenian Ministry of Science and Technology and in part by the European Union through the ESPRIT IV Project 20237 Inductive Logic Programming 2. We greatly appreciate the comments of two anonymous reviewers of the proposed version of the paper.
106
L. Todorovski and S. Dˇzeroski
References 1. Aha, D. (1992) Generalising case studies: a case study. In Proceedings of the 9th International Conference on Machine Learning, pages 1–10. Morgan Kaufmann. 2. Blockeel, H. and De Raedt, L. (1998) Top-down induction of first order logical decision trees. Artificial Intelligence, 101(1–2): 285–297. 3. Brazdil, P. B. and Henery, R. J. (1994) Analysis of Results. In Michie, D., Spiegelhalter, D. J., and Taylor, C. C., editors: Machine learning, neural and statistical classification. Ellis Horwood. 4. Clark, P. and Boswell, R. (1991) Rule induction with CN2: Some recent improvements. In Proceedings of the Fifth European Working Session on Learning, pages 151–163. Springer. 5. Dˇzeroski, S., Cestnik, B. and Petrovski, I. (1993) Using the m-estimate in rule induction. Journal of Computing and Information Technology, 1:37–46. 6. Kalousis, A. and Theoharis, T. (1999) NEOMON: An intelligent assistant for classifier selection. In Proceedings of the ICML-99 Workshop on Recent Advances in Meta-Level Learning and Future Work, pages 28–37. 7. Murphy, P. M. and Aha, D. W. (1994) UCI repository of machine learning databases [http://www.ics.uci.edu/˜mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science. 8. Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann. 9. Quinlan, J. R. and Cameron-Jones, R. M. (1993) FOIL: A midterm report. In Brazdil, P., editor: Proceedings of the 6th European Conference on Machine Learning, volume 667 of Lecture Notes in Artificial Intelligence, pages 3–20. Springer-Verlag. 10. Wettschereck, D. (1994) A study of distance-based machine learning algorithms. PhD Thesis, Department of Computer Science, Oregon State University, Corvallis, OR.
Boolean Reasoning Scheme with Some Applications in Data Mining Andrzej Skowron and Hung Son Nguyen Institute of Mathematics, Warsaw University, Banacha 2, 02-097 Warsaw, Poland Email: {skowron,son}@mimuw.edu.pl
Abstract. We present a general encoding scheme for a wide class of problems (including among others such problems like data reduction, feature selection, feature extraction, decision rules generation, pattern extraction from data or conflict resolution in multi-agent systems) and we show how to combine it with a propositional (Boolean) reasoning to develop efficient heuristics searching for (approximate) solutions of these problems. We illustrate our approach by examples, we show some experimental results and compare them with those reported in literature. We also show that association rule generation is strongly related with reduct approximation.
1
Introduction
We discuss a representation scheme for a wide class of problems including problems from such areas like decision support [14], [9], machine learning, data mining [4], or conflict resolution in multi-agent systems [10]. On the basis of the representation scheme we construct (monotone) Boolean functions with the following property: their prime implicants [3](minimal valuations satisfying propositional formulas) directly correspond to the problem solutions (compare the George Boole idea from 1848 discussed e.g. in [3]). In all these cases the implicants close to prime implicants define approximate solutions for considered problems (compare the discussion on Challenge 9 in [12]). The results are showing that the efficient heuristics for feature selection, feature extraction, pattern extraction from data can be developed using Boolean propositional reasoning. Moreover the experiments are showing that these heuristics can give better results concerning classification quality or/and time necessary for learning (discovery) than those derived using other methods. Our experience is showing that formulations of problems in the Boolean reasoning framework creates a promising methodology for developing very efficient heuristics for solving real-life problems in many areas. Let us also mention applications of Boolean reasoning in other areas like negotiations and conflict resolving in multi-agent systems [10]. Because of lack of space we illustrate the approach using two illustrative examples related to symbolic value grouping and association rule extraction in Data Mining (or Machine Learning) problems. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 107–115, 1999. c Springer-Verlag Berlin Heidelberg 1999
108
2
A. Skowron and H.S. Nguyen
Basic Notions
An information system is a pair S = (U, A), where U - is a non-empty, finite set called the universe, A - is a non-empty, finite set of attributes, i.e., a : U → Va for a ∈ A, where Va is called the value set of a. Elements of U are called situations S and interpreted as e.g. cases, states, patients, observations. The set V = a∈A Va is said to be the domain of A. A decision table is any information system of the form S = (U, A ∪ {d}), where d ∈ / A is a distinguished attribute called decision. The elements of A are called conditional attributes (conditions). In a given information system, in general, we are not able to distinguish all pairs of situations objects (using attributes of the system). Namely, different situations can have the same values on considered attributes. Hence, any set of attributes divides the universe U into some classes which establish a partition [9] of the set of all objects U . With any subset of attributes B ⊆ A we associate a binary relation ind(B), called an indiscernibility relation, which is defined by ind(B) = {(u, u0 ) ∈ U × U for every a ∈ B, a(u) = a(u0 )}. The B-discernibility relation is defined to be the complement of ind(B) in U × U. Let S = (U, A) be an information system, where A = {a1 , ..., am }. Pairs (a, v) with a ∈ A, v ∈ V are called descriptors. By DESC(A, V ) we denote the set of all descriptors over A and V . Instead of (a, v) we also write a = v or av . One can assign Boolean variable to any descriptor. The set of terms over A and V is the least set containing descriptors (over A and V ) and closed with respect to the classical propositional connectives: ¬ (negation), ∨ (disjunction), and ∧ (conjunction), i.e., 1. Any descriptor (a, v) ∈ DESC(A, V ) is term over A and V . 2. If τ, τ 0 are terms then ¬τ, (τ ∨ τ 0 ), (τ ∧ τ 0 ) are terms over A and V too. The meaning kτ kS (or in short kτ k) of τ in S is defined inductively as follows: k(a, v)k = {u ∈ U : a(u) = v} for a ∈ A and v ∈ Va ; 0 0 0 k(τ ∨ τ )k = kτ k ∪ kτ k; k(τ ∧ τ )k = kτ k ∩ kτ 0 k; k¬τ k = U − kτ k. Two terms τ and τ 0 areWequivalent, τ ⇔ τ 0 , if and only if kτ k = kτ 0 k. In particular we have: ¬(a = v) ⇔ {a = v 0 : v 0 6= v and v 0 ∈ Va }. The information systems (desision tables) are representations of knowledge bases discussed in Introduction: rows corresponds to consistent sets of propositional variables defined by all descriptors a = v where v is the value of attribute a in a given situation and conflicting pairs, in case of information systems, are all pairs of situations which are discernible by some attributes. Let S = (U, A) be an information system, where U = {u1 , ..., un }, and A = {a1 , ..., am }. By M(S) we denote an n × n matrix (cij ), called the discernibility matrix of S, such that cij = {a ∈ A : a(ui ) 6= a(uj )} for i, j = 1, ..., n. With every discernibility matrix M(S) one can associate a discernibility function fM(S) , defined as follows. A discernibility function fM(S) for an information system S is a Boolean function of m propositional variables a∗1 , ...,Wa∗m (where ai ∈ A for i = 1, ..., m) defined as the conjunction of all expressions c∗ij , where
Boolean Reasoning Scheme with Some Applications in Data Mining
109
W
c∗ij is the disjunction of all elements of c∗ij = {a∗ : a ∈ cij }, where 1 ≤ j < i ≤ n and cij 6= ∅. In the sequel we write a instead of a∗ . One can show that every prime implicant of fM(S) (a∗1 , ..., a∗k ) corresponds exactly to one reduct in S. One can see that the set B ⊂ A is reduct if B has nonempty intersection with any nonempty set ci,j i.e. B is reduct in S
iff
∀i,j (ci,j = ∅) ∨ (B ∩ ci,j 6= ∅)
One can show that prime implicants of the discernibility function correspond exactly to reducts of information systems [9], [14]. Hence, Boolean reasoning can be used for information reduction. This can be extended to feature selection and decision rule synthesis (see e.g. [2], [9]). One can show that the problem of finding a minimal (with respect to cardinality) reduct is NP-hard [14]. In general the number of reducts of a given information system can be exponential with respect to the number of attributes (more exactly, any information system S has m reducts, where m=card(A)). Nevertheless, existing procedures at most bm/2c for reduct computation are efficient in many applications and for many cases one can apply some efficient heuristics (see e.g. [2]). Moreover, in some applications (see [13]), instead of reducts we prefer to use their approximations called α-reducts, where α ∈ [0, 1] is a real parameter. The set of attributes B ⊂ A is called α-reduct if |{ci,j : B ∩ ci,j 6= ∅}| ≥α B is α-reduct in S iff |{ci,j : ci,j 6= ∅}| One can show that for a given α, the problems of searching for shortest α-reducts and for all α-reducts are also NP-hard. Let us note that e.g. simple greedy Johnson strategy for computing implicants close to prime implicants of the discernibility function has time complexity of order O(k 2 n3 ) where n is the number of objects and k is the number of attributes. Hence, for large n this heuristic will be not feasible. We will show how to construct some more efficient heuristics in case when some additional knowledge is given about problem encoded by information system or decision table.
3
Feature Extraction by Grouping of Symbolic Values
In case of symbolic value attribute (i.e. without pre-assumed order on values of given attributes) the problem of searching for new features of the form a ∈ V is, in a sense, from practical point of view more complicated than the for real value attributes. However, it is possible to develop efficient heuristics for this case using Boolean reasoning. Let S = (U, A ∪ {d}) be a decision table. Any function Pa : Va → {1, . . . , ma } (where ma ≤ card(Va )) is called a partition of Vai . The rank of Pai is the value rank (Pi ) = card (Pai (Vai )). The family of partitions {Pa }a∈B is consistent with B (B − consistent) iff the condition [(u, u0 ) ∈ / ind(B/{d}) implies ∃a∈B [Pa (a(u)) 6= Pa (a(u0 ))] holds for any (u, u0 ) ∈ U. It means that if two objects u, u0 are discerned by B and d, then they must be discerned by partition attributes defined by {Pa }a∈B . We consider the following optimization problem
110
A. Skowron and H.S. Nguyen
PARTITION PROBLEM: symbolic value partition problem: Given a decision table S = (U, A ∪ {d}) and a set of attributes B ⊆ A, search for the minimal B − consistent family of partitions (i.e. such B − consistent P family {Pa }a∈B that a∈B rank (Pa ) is minimal). 0
To discern between pair of objects will use new binary features avv (for v 6= v 0 ) 0 defined by avv (x, y) = 1 iff a(x) = v 6= v 0 = a(y). One can apply the Johnson’s heuristic for the new decision table with these attributes to search for minimal set of new attributes that discerns all pairs of objects from different decision classes. After extracting of these sets, for each attribute ai we construct graph Γa = hVa , Ea i where Ea is defined as the set of all new attributes (propositional variables) found for the attribute a. Any vertex coloring of Γa defines a partition of Va . The colorability problem is solvable in polynomial time for k = 2, but remains NP-complete for all k ≥ 3. But, similarly to discretization[7], one can apply some efficient heuristic searching for optimal partition. Let us consider an example of decision table presented in Figure 1 and (a reduced form) of its discernibility matrix (Figure 1). From the Boolean function fA with Boolean variables of the form avv21 one can find the shortest prime implicant: aaa12 ∧ aaa23 ∧ aaa14 ∧ aaa34 ∧ bbb14 ∧ bbb24 ∧ bbb23 ∧ bbb13 ∧ bbb35 which can be treated as graphs presented in the Figure 2. We can color vertices of those graphs as it is shown in Figure 2. The colors are corresponding to the partitions: Pa (a1 ) = Pa (a3 ) = 1; Pa (a2 ) = Pa (a4 ) = 2; Pb (b1 ) = Pb (b2 ) = Pb (b5 ) = 1; Pb (b3 ) = Pb (b4 ) = 2. At the same time one can construct the new decision table (Figure 2). A u1 u2 u3 u4 u5 u6 u7 u8 u9 u10
a a1 a1 a2 a3 a1 a2 a2 a4 a3 a2
b b1 b2 b3 b1 b4 b2 b1 b2 b4 b5
d 0 0 0 0 1 1 1 1 1 1
M(S) u1 b u5 b b1 u6 u7
=⇒
u8 u9 u10
4 1 aa a2 , 1 aa a2 1 aa a4 , 1 aa a3 , 1 aa a2 ,
u2 b b b2
b
b b1 2
b
b b1 2 b b b1 4 b b b1 5
4 1 aa a2 1 aa a2 , 1 aa a4 1 aa a3 , 1 aa a2 ,
u3 u4 b3 b1 a1 1 aa a , bb a a , bb b
2
4
bb2
b
b b1 2
b b b2 4 b b b2 5
3 b bb1 3 2 aa a4 , 2 aa a3 , b bb3 5
b
b b2 3 b b b3 4
3
b
4
1 2 aa a , bb 3 2 aa a3 3 aa a4 , b b b1 4 2 aa a3 ,
2
b
b b1 2
b
b b1 5
Fig. 1. The decision table and the discernibility matrix
s ca2 @ @ a @ ca4 a3 s a1
s sb2 Q B Q B Q s B b QBB cb3 B B cb4 b1
b5
=⇒
a Pa 1 2 1 2
b Pb 1 2 2 1
d 0 0 1 1
Fig. 2. Coloring of attribute value graphs and the reduced table.
One can extend the presented approach (see e.g. [6]) to the case when in a given decision system nominal and numeric attribute appear. The received heuristics are of very good quality.
Boolean Reasoning Scheme with Some Applications in Data Mining
111
Experiments for classification methods (see [6]) have been carried over decision systems using two techniques called “train-and-test” and “n-fold-crossvalidation”. In Table 1 some results of experiments obtained by testing the proposed methods MD (using only discretization based on MD-heurisctic [7] using Johnson approximation strategy) and MD-G (using discretization and symbolic value grouping) for classification quality on well known data tables from the “UC Irvine repository” are shown. The results reported in [5] are summarized in columns labeled by S-ID3 and C4.5 in Table 1). It is interesting to compare those results with regard both to the classification quality. Let us note that the heuristics MD and MD-G are also very efficient with respect to the time complexity.
Names of Tables Australian Breast (L) Diabetes Glass Heart Iris Lympho Monk-1 Monk-2 Monk-3 Soybean TicTacToe Average
Classification accuracies S-ID3 C4.5 MD MD-G 78.26 85.36 83.69 84.49 62.07 71.00 69.95 69.95 66.23 70.84 71.09 76.17 62.79 65.89 66.41 69.79 77.78 77.04 77.04 81.11 96.67 94.67 95.33 96.67 73.33 77.01 71.93 82.02 81.25 75.70 100 93.05 69.91 65.00 99.07 99.07 90.28 97.20 93.51 94.00 100 95.56 100 100 84.38 84.02 97.7 97.70 78.58 79.94 85.48 87.00
Table 1. The quality comparison between decision tree methods. MD: MD-heuristics; MD-G: MD-heuristics with symbolic value partition
4
Association Rule Generation
Given an information table A = (U, A). By descriptors we mean terms of form (a = v), where a ∈ A is an attribute and v ∈ Va is a value in the domain of a (see [8]). The notion of descriptor can be generalized by using terms of form (a ∈ S), where S ⊆ Va is a set of values. By template we mean the conjunction of descriptors, i.e. T = D1 ∧ D2 ∧ ... ∧ Dm , where D1 , ...Dm are either simple or generalized descriptors. We denote by length(T) the number of descriptors being in T. An object u ∈ U is satisfying the template T = (ai1 = v1 ) ∧ ... ∧ (aim = vm ) if and only if ∀j aij (u) = vj . Hence the template T describes the set of objects having the common property: ”the values of attributes aj1 , ..., ajm on these objects are equal to v1 , ..., vm , respectively”. The support of T is defined by support(T) = |{u ∈ U : u satisfies T}|. The long templates with large support are preferred in many Data Mining tasks. Problems of finding optimal large templates (for many optimization functions) are known as being NP-hard with respect to the number of attributes
112
A. Skowron and H.S. Nguyen
involved into descriptors(see e.g. [8]). Nevertheless, the large templates can be found quite efficiently by Apriori and AprioriTid algorithms (see [1,15]). A number of other methods for large template generation has been proposed e.g. in [8]. Association rules and their generations can be defined in many ways (see [1]). Here, according to the presented notation, association rules can be defined as implications of the form (P ⇒ Q), where P and Q are different simple templates, i.e. formulas of the form (ai1 = vi1 ) ∧ . . . ∧ (aik = vik ) ⇒ (aj1 = vj1 ) ∧ . . . ∧ (ajl = vjl )
(1)
These implication can be called generalized association rules, because association rules are originally defined by formulas P ⇒ Q where P and Q are the sets of items (i.e. goods or articles in stock market) e.g. {A, B} ⇒ {C, D, E} (see [1]). One can see that this form can be obtained from 1 by replacing values on descriptors by 1 i.e.: (A = 1) ∧ (B = 1) ⇒ (C = 1) ∧ (D = 1) ∧ (E = 1). Usually, for a given information table A, the quality of the association rule R = P ⇒ Q can be evaluated by two coefficients called support and confidence with respect to A. The support of the rule R is defined by the number of objects from A satisfying the condition (P ∧ Q) i.e. support(R) = support(P ∧ Q). The second coefficient – confidence of R – is the ratio between the support of (P ∧ Q) and the support of P i.e. conf idence(R) = support(P∧Q) support(P) . The following problem has been investigated by many authors (see e.g. [1,15]): For a given information table A, an integer s, and a real number c ∈ [0, 1], find as many as possible association rules R = (P ⇒ Q) such that support(R) ≥ s and conf idence(R) ≥ c. All existing association rule generation methods consists of two main steps: 1. Generate as many as possible templates T = D1 ∧ D2 ... ∧ Dk such that support(T) ≥ s and support(T ∧ D) < s for any descriptor D (i.e. maximal templates among those which are supported by more than s objects). 2. For any template T, search for a partition T = P∧Q such that support(P) < support(T) and P is the smallest template satisfying this condition. c In this paper we show that the second steps can be solved using rough set methods and Boolean reasoning approach. 4.1
Boolean Reasoning Approach for Association Rule Generation
Let us assume that the template T = D1 ∧ D2 ∧ . . . ∧ Dm , which is supported by at least s objects, has been found. For a given confidence threshold c ∈ (0; 1) the decomposition T = P ∧ Q is called c-irreducible if conf idence(P ⇒ Q) ≥ c and for any decomposition T = P0 ∧ Q0 such that P0 is a sub-template of P, conf idence(P0 ⇒ Q0 ) < c. One can prove Theorem 1. Let c ∈ [0; 1]. The problem of searching for the shortest association rule from the template T for a given table S with confidence limited by c (Optimal c-Association Rules Problem) is NP-hard.
Boolean Reasoning Scheme with Some Applications in Data Mining
113
For solving the presented problem, we show that the problem of searching for optimal association rules from the given template is equivalent to the problem of searching for local α-reducts for a decision table, which is well known problem in rough set theory. We construct the new decision table S|T = (U, A|T ∪ d) from the original information table S and the template T as follows: 1. A|T = {aD1 , aD2 , ..., aDm } is a set of attributes corresponding to the de1 if the object u satisfies Di , scriptors of T such that aDi (u) = 0 otherwise. 2. the decision attribute d determines if the object satisfies template T i.e. 1 if the object u satisfies T, d(u) = 0 otherwise. The following theorems describe the relationship between association rules problem and reduct searching problem. Theorem 2. For a given information template T, the set V table S =V(U, A), the of descriptors P the implication Di ∈P Di ⇒ Dj ∈P / Dj is 1. 100%-irreducible association rule from T if and only if P is reduct in S|T . 2. c-irreducible association rule from T if and only if P is α-reduct in S|T , where α = 1 − ( 1c − 1)/( ns − 1), n is the total number of objects from U and s = support(T). Searching for minimal α-reducts is well known problem in Rough Sets theory. One can show, that the problem of searching for all α-reducts as well as the problem of searching for shortest α-reducts is NP-hard. Great effort has been done to solve those problems. In the next papers we present the rough set based algorithms for association rule generation for large data table using SQL queries. 4.2
The Example
The following example illustrates the main idea of our method. Let us consider the following information table A with 18 objects and 9 attributes. Assume that the template T = (a1 = 0) ∧ (a3 = 2) ∧ (a4 = 1) ∧ (a6 = 0) ∧ (a8 = 1) has been extracted from the information table A. One can see that support(T) = 10 and length(T) = 5. The new constructed decision table A|T is presented in Table 2. The discernibility function for A|T can be described as follows f (D1 , D2 , D3 , D4 , D5 ) = (D2 ∨ D4 ∨ D5 ) ∧ (D1 ∨ D3 ∨ D4 ) ∧ (D2 ∨ D3 ∨ D4 ) ∧(D1 ∨ D2 ∨ D3 ∨ D4 ) ∧ (D1 ∨ D3 ∨ D5 ) ∧(D2 ∨ D3 ∨ D5 ) ∧ (D3 ∨ D4 ∨ D5 ) ∧ (D1 ∨ D5 ) After its simplification we obtain six reducts: f (D1 , D2 , D3 , D4 , D5 ) = (D3 ∧ D5 )∨(D4 ∧D5 )∨(D1 ∧D2 ∧D3 )∨(D1 ∧D2 ∧D4 )∨(D1 ∧D2 ∧D5 )∨(D1 ∧D3 ∧D4 ) for the decision table A|T . Thus, we have found from T six association rules with (100%)-confidence. For c = 90%, we would like to find α-reducts for the decision
114
A. Skowron and H.S. Nguyen 1
−1
table A|T , where α = 1 − nc −1 = 0.86. Hence we would like to search for sets of s descriptors covering at least d(n−s)(α)e = d8·0.86e = 7 elements of discernibility matrix M(A|T ). One can see that the following sets of descriptors: {D1 , D2 }, {D1 , D3 }, {D1 , D4 }, {D1 , D5 }, {D2 , D3 }, {D2 , D5 }, {D3 , D4 } have nonempty intersection with exactly 7 members of the discernibility matrix M(A|T ). In Table 3 we present all association rules constructed from those sets. A u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 u15 u16 u17 u18
a1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1
a2 1 1 2 1 1 2 2 2 1 3 1 2 2 3 4 3 1 2
a3 1 2 2 2 2 1 1 2 2 2 3 2 2 2 2 2 2 2
a4 1 1 1 1 2 2 2 1 1 1 1 2 1 2 1 1 1 1
a5 80 81 82 80 81 81 83 81 82 84 80 82 81 81 82 83 84 82
a6 2 0 0 0 1 1 1 0 0 0 0 0 0 2 0 0 0 0
a7 2 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
a8 2 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2
A|T
a9 3 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 u15 u16 u17 u18
D1 D2 D3 D4 D5 a1 = 0 a3 = 2 a4 = 1 a6 = 0 a8 = 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0
d 1 1 1
1 1 1 1 1 1 1
Table 2. An example of information table A and template T supported by 10 objects and the new decision table A|T constructed from A and template T.
M(A|T ) u2 , u3 , u4 , u8 , u9 u10 , u13 , u15 , u16 , u17 u1 D 2 ∨ D4 ∨ D5 u5 D 1 ∨ D3 ∨ D4 u6 D 2 ∨ D3 ∨ D4 u7 D 1 ∨ D2 ∨ D3 ∨ D4 u11 D1 ∨ D3 ∨ D5 u12 D2 ∨ D3 ∨ D5 u14 D3 ∨ D4 ∨ D5 u18 D1 ∨ D5
=
100%
=⇒
=
90%
=⇒
D1 D1 D1 D1
D3 D4 ∧ D2 ∧ D2 ∧ D2 ∧ D3
∧ D5 ∧ D5 ∧ D3 ∧ D4 ∧ D5 ∧ D4
⇒ ⇒ ⇒ ⇒ ⇒ ⇒
D1 D1 D4 D3 D3 D2
∧ D2 ∧ D4 ∧ D2 ∧ D3 ∧ D5 ∧ D5 ∧ D4 ∧ D5
D1 D1 D1 D1 D2 D2 D3
∧ D2 ∧ D3 ∧ D4 ∧ D5 ∧ D3 ∧ D5 ∧ D4
⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒
D3 D3 D2 D2 D1 D1 D1
∧ D4 ∧ D4 ∧ D3 ∧ D3 ∧ D4 ∧ D3 ∧ D2
∧ D5 ∧ D5 ∧ D5 ∧ D4 ∧ D5 ∧ D4 ∧ D5
Table 3. The simplified version of discernibility matrix M(A|T ) and association rules.
5. Conclusions We have presented a general scheme for encoding a wide class of problems. This encoding scheme has been proven to be very useful for solving many problems using propositional reasoning e.g. information reduction, decision rule generation, feature extraction and feature selection, conflict resolving in multi–agent systems. Our approach can be used to consider only discernible pairs with the sufficiently large discernibility degree. Another possible extension is related to
Boolean Reasoning Scheme with Some Applications in Data Mining
115
extension of our knowledge bases by adding a new component corresponding to concordance (indiscernible) pairs of situations and to require to preserve some constrains described by this component. We also plan to extend approach using rough mereology [10]. Acknowledgement: This work was partially supported by the Research Program of the European Union - ESPRIT-CRIT2 No. 20288
References 1. Agrawal R., Mannila H., Srikant R., Toivonen H., Verkamo A.I., 1996. Fast discovery of assocation rules. In V.M. Fayad, G.Piatetsky Shapiro, P. Smyth, R. Uthurusamy (eds): Advanced in Knowledge Discovery and Data Mining, AAAI/MIT Press, pp. 307-328. 2. J. Bazan. A comparison of dynamic non-dynamic rough set methods for extracting laws from decision tables. In: L. Polkowski and A. Skowron (Eds.), Rough Sets in Knowledge Discovery 1: Methodology and Applications, Physica-Verlag, Heidelberg, 1998, 321–365. 3. E.M. Brown. Boolean Reasoning, Kluwer Academic Publishers, Dordrecht, 1990. 4. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Eds.). Advances in Knowledge Discovery and Data Mining. MIT/AAAI Press, Menlo Park, 1996. 5. J. Friedman, R. Kohavi, Y. Yun. Lazy decision trees, Proc. AAAI-96, 717–724. 6. H.S. Nguyen and S.H. Nguyen. Pattern extraction from data, Fundamenta Informaticae 34, 1998, pp. 129–144. 7. H.S. Nguyen and A. Skowron. Boolean reasoning for feature extraction problems, Proc. ISMIS’97, LNAI 1325, Springer–verlag, Berlin, 117–126. 8. Nguyen S. Hoa, A. Skowron, P. Synak. Discovery of data pattern with applications to Decomposition and classification problems. In L. Polkowski, A. Skowron (eds.): Rough Sets in Knowledge Discovery 2. Physica-Verlag, Heidelberg 1998, pp. 55–97. 9. Z. Pawlak. Rough Sets – Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht, 1991. 10. L. Polkowski and A. Skowron. Rough sets: A perspective. In: L. Polkowski and A. Skowron (Eds.). Rough Sets in Knowledge Discovery 1: Methodology and Applications. Physica-Verlag, Heidelberg, 1998, 31–56. 11. J.R. Quinlan. C4.5. Programs for machine learning, Morgan Kaufmann, San Mateo, CA, 1993. 12. B. Selman, H. Kautz and D. McAllester. Ten Challenges in Propositional Reasoning and Search, Proc. IJCAI’97, Japan. 13. Skowron A. Synthesis of adaptive decision systems from experimental data. In A. Aamodt, J. Komorowski (eds), Proc. of the 5th Scandinavian Conference on AI (SCAI’95), IOS Press, May 1995, Trondheim, Norway, 220–238. 14. A. Skowron and C. Rauszer. The discernibility matrices and functions in information systems, in: R. Slowi´ nski (Ed.), Intelligent decision support: Handbook of applications and advances of the rough sets theory, Kluwer Academic Publishers, Dordrecht, 1992, 331-362. 15. Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li. New Parallel Algorithms for Fast Discovery of Association Rules. In Data Mining and Knowledge Discovery : An International Journal, special issue on Scalable High-Performance Computing for KDD, Vol. 1, No. 4, Dec. 1997, pp 343-373.
On the Correspondence between Classes of Implicational and Equivalence Quantifiers Jiˇr´ı Iv´anek Laboratory of Intelligent Systems, Faculty of Informatics and Statistics, University of Economics, W. Churchill Sq. 4, 130 67 Prague, Czech Republic, e-mail: [email protected] Abstract. Relations between two Boolean attributes derived from data can be quantified by truth functions defined on four-fold tables corresponding to pairs of the attributes. In the paper, several classes of such quantifiers (implicational, double implicational, equivalence ones) with truth values in the unit interval are investigated. The method of construction of the logically nearest double implicational and equivalence quantifiers to a given implicational quantifier (and vice versa) is described and approved.
1
Introduction
The theory of observational quantifiers was established in the frame of the GUHA method of mechanized hypothesis formation [4], [5]. It should be stressed that this method is one of the earliest methods of data mining [9]. The method was during years developed and various procedures were implemented e.g. in the systems PC-GUHA [6], Knowledge Explorer [3], and 4FT-Miner [12]. Further investigations of its mathematical and logical foundations are going on nowadays [7], [10], [11]. We concentrate to the most widely used observational quantifiers, called in [11] four-fold table quantifiers. So far this quantifiers were treated in classical logic as 0/1-truth functions. Some possibilities of fuzzy logic approach are now discussed [7]. In the paper, several classes of quantifiers (implicational, double implicational, equivalence ones) with truth values in the unit interval are investigated. Such type of quantifications of rules derived from databases is used in modern methods of knowledge discovery in databases (see e.g. [13]). On the other hand, there is a connection between four-fold table quantifiers and measures of resemblance or similarity applied on Boolean vectors [2]. In Section 2, basic notions and classes of quantifiers are defined, and some examples of quantifiers of different types are given. In Section 3, the method of construction of double implicational quantifiers from implicational ones (and vice versa) is described. This method provides a logically strong one-to-one correspondence between classes of implicational and so called Σ-double implicational quantifiers. An analogical construction is used in Section 4 to introduce similar correspondence between classes of Σ-double implicational and Σ-equivalence quantifiers. Several theorems on this constructions are proved. As a conclusion, triads of affiliated quantifiers are introduced, and their importance in data mining applications is discussed. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 116–124, 1999. c Springer-Verlag Berlin Heidelberg 1999
Classes of Implicational and Equivalence Quantifiers
2
117
Classes of Quantifiers
For two Boolean attributes ϕ and ψ (derived from given data), corresponding four-fold table < a, b, c, d > (Table 1) is composed from numbers of objects in data satisfying four different Boolean combinations of attributes: a is the number of objects satisfying both ϕ and ψ, b is the number of objects satisfying ϕ and not satisfying ψ, c is the number of objects not satisfying ϕ and satisfying ψ, d is the number of objects not satisfying ϕ and not satisfying ψ.
ϕ ¬ϕ
ψ a c
¬ψ b d
Table 1. Four-fold table of ϕ and ψ
To avoid degenerated situations, we shall assume, that all marginals of the four-fold table are non-zero: a + b > 0, c + d > 0, a + c > 0, b + d > 0. Definition 1. 4FT quantifier ∼ is a [0, 1]-valued function defined for all fourfold tables < a, b, c, d >. We shall write ∼ (a, b) if the value of the quantifier ∼ depends only on a, b; ∼ (a, b, c) if the value of the quantifier ∼ depends only on a, b, c; ∼ (a, b, c, d) if the value of the quantifier ∼ depends on all a, b, c, d. For simplicity, we shall omit in this paper specification 4FT. The most common examples of quantifiers are following ones: Example 1. Quantifier ⇒ of basic implication (corresponds to the notion of a confidence of an association rule, see [1],[4],[5]): a . ⇒ (a, b) = a+b Example 2. Quantifier ⇔ of basic double implication (Jaccard 1900, [2],[5]): a . ⇔ (a, b, c) = a+b+c Example 3. Quantifier ≡ of basic equivalence (Kendall, Sokal-Michener 1958, [2],[5]): a+d . ≡ (a, b, c, d) = a+b+c+d If the four-fold table < a, b, c, d > represents the behaviour of the derived attributes ϕ and ψ in given data, then we can interpret above quantifiers in the following way: The quantifier of basic implication calculates the relative frequency of objects satisfying ψ out from all objects satisfying ϕ, so it is measuring in a simple way
118
J. Iv´ anek
the validity of implication ϕ ⇒ ψ in data. The higher is a and the smaller is b, a . the better is validity ⇒ (a, b) = a+b The quantifier of basic double implication calculates the relative frequency of objects satisfying ϕ ∧ ψ out from all objects satisfying ϕ ∨ ψ, so it is measuring in a simple way the validity of bi-implication (ϕ ⇒ ψ) ∧ (ψ ⇒ ϕ) in data. The a . higher is a and the smaller are b, c, the better is validity ⇔ (a, b, c) = a+b+c The quantifier of basic equivalence calculates the relative frequency of objects supporting correlation of ϕ and ψ out from all objects, so it is measuring in a simple way the validity of equivalency ϕ ≡ ψ in data. The higher is a, d and the a+d . smaller are b, c, the better is validity ≡ (a, b, c, d) = a+b+c+d Properties of basic quantifiers are in the core of the general definition of several useful classes of quantifiers [4], [5],[11]: (1) I - class of implicational quantiffiers, (2) DI - class of double implicational quantiffiers, (3) ΣDI - class of Σ-double implicational quantiffiers, (4) E - class of equivalence quantiffiers, (5) ΣE - class of Σ-equivalence quantiffiers. Each class of quantifiers ∼ is characterized in the following definition by a special truth preservation condition of the form: fact that the four-fold table < a0 , b0 , c0 , d0 > is in some sense (implicational, ...) better than < a, b, c, d > implies that ∼ (a0 , b0 , c0 , d0 ) ≥ ∼ (a, b, c, d). Definition 2. Let a, b, c, d, a0 , b0 , c0 , d0 mean frequencies from arbitrary pairs of four-fold tables < a, b, c, d > and < a0 , b0 , c0 , d0 >. (1) A quantifier ∼ (a, b) is implicational, ∼ ∈ I, if always a0 ≥ a ∧ b0 ≤ b implies ∼ (a0 , b0 ) ≥ ∼ (a, b). (2) A quantifier ∼ (a, b, c) is double implicational, ∼ ∈ DI, if always a0 ≥ a ∧ b0 ≤ b ∧ c0 ≤ c implies ∼ (a0 , b0 , c0 ) ≥ ∼ (a, b, c). (3) A quantifier ∼ (a, b, c) is Σ-double implicational, ∼ ∈ ΣDI, if always a0 ≥ a ∧ b0 + c0 ≤ b + c implies ∼ (a0 , b0 , c0 ) ≥ ∼ (a, b, c). (4) A quantifier ∼ (a, b, c, d) is equivalence, ∼ ∈ E, if always a0 ≥ a ∧ b0 ≤ b ∧ c0 ≤ c ∧ d0 ≥ d implies ∼ (a0 , b0 , c0 , d0 ) ≥ ∼ (a, b, c, d). (5) A quantifier ∼ (a, b, c, d) is Σ-equivalence, ∼ ∈ ΣE, if always a0 + d0 ≥ a + d ∧ b0 + c0 ≤ b + c implies ∼ (a0 , b0 , c0 , d0 ) ≥ ∼ (a, b, c, d). Example 4. ⇒ ∈ I, ⇔ ∈ ΣDI, ≡ ∈ ΣE. Proposition 3. I ⊂ DI ⊂ E, ΣDI ⊂ DI, ΣE ⊂ E. In the original GUHA method [4],[5], some statistically motivated quantifiers were introduced. They are based on hypotheses testing, e.g.: Given 0 < p < 1, the question is if the conditional probability corresponding to the examined relation of Boolean attributes ϕ and ψ is ≥ p. This question lead to the test of the null hypothesis that corresponding conditional probability is ≥ p, against the alternative hypothesis that this probability is < p. The following quantifiers are derived from the appropriate statistical test.
Classes of Implicational and Equivalence Quantifiers
119
Example 5. Quantifier ⇒?p of upper critical implication ⇒?p (a, b) =
a X i=0
(a + b)! pi (1 − p)a+b−i i!(a + b − i)!
is implicational [4],[5]. Example 6. Quantifier ⇔?p of upper critical double implication ⇔?p (a, b, c) =
a X i=0
(a + b + c)! pi (1 − p)a+b+c−i i!(a + b + c − i)!
is Σ-double implicational [5],[11]. Example 7. Quantifier ≡?p of upper critical equivalence ≡?p (a, b, c, d) =
a+d X i=0
(a + b + c + d)! pi (1 − p)a+b+c+d−i i!(a + b + c + d − i)!
is Σ-equivalence [5],[11]. Let us note, that all the above mentioned quantifiers are used (among others) in the GUHA procedure 4FT-Miner [12]. Some more examples of double implicational and equivalence quantifiers can be derived from the list of association coefficients (resemblance measures on Boolean vectors) included in [2]. In the next sections, one-to-one correspondence with strong logical properties will be shown i) between classes of quantifiers I, ΣDI by means of the relation: ⇔∗ (a, b, c) = ⇒∗ (a, b + c), and, analogously, ii) between classes of quantifiers ΣDI, ΣE by means of the relation: ≡∗ (a, b, c, d) = ⇔∗ (a + d, b, c). First, let us prove the following auxiliary propositions. Lemma 4. A quantifier ⇔∗ is Σ-double implicational iff the following conditions hold: (i) for all a, b, c, b0 , c0 such that b0 + c0 = b + c ⇔∗ (a, b0 , c0 ) = ⇔∗ (a, b, c) holds, (ii) the quantifier ⇒∗ defined by ⇒∗ (a, b) = ⇔∗ (a, b, 0) is implicational. Proof. For Σ-double implicational quantifiers, (i), (ii) are clearly true. Let ⇔∗ is a quantifier satisfying (i), (ii), and a0 ≥ a ∧ b0 + c0 ≤ b + c. Then ⇔∗ (a0 , b0 , c0 ) = ⇔∗ (a0 , b0 + c0 , 0) = ⇒∗ (a0 , b0 + c0 ) ≥ ≥ ⇒∗ (a, b + c) = ⇔∗ (a, b + c, 0) = ⇔∗ (a, b, c).
120
J. Iv´ anek
Example 8. Quantifier ⇔+ (Kulczynski 1927, see [2]): a a + a+c ) ⇔+ (a, b, c) = 12 ( a+b is double implicational but not Σ-double implicational, ⇔+ ∈ DI − ΣDI; for instance ⇔+ (1, 1, 1) = ⇔+ (1, 2, 0) does not hold. Lemma 5. A quantifier ≡∗ is Σ-equivalence iff the following conditions hold: (i) for all a, b, c, d, a0 , b0 , c0 , d0 such that a0 + d0 = a + d, b0 + c0 = b + c ≡∗ (a0 , b0 , c0 , d0 ) = ≡∗ (a, b, c, d) holds, (ii) the quantifier ⇔∗ defined by ⇔∗ (a, b, c) = ≡∗ (a, b, c, 0) is Σ-double implicational. Proof. For Σ-equivalence quantifiers, (i), (ii) are clearly true. Let ≡∗ is a quantifier satisfying (i), (ii), and a0 + d0 ≥ a + d ∧ b0 + c0 ≤ b + c. Then ≡∗ (a0 , b0 , c0 , d0 ) = ≡∗ (a0 + d0 , b0 , c0 , 0) = ⇔∗ (a0 + d0 , b0 , c0 ) ≥ ≥ ⇔∗ (a + d, b, c) = ≡∗ (a + d, b, c) = ≡∗ (a, b, c, d). Example 9. Quantifier ≡+ (Sokal, Sneath 1963, see [2]): a a d d + a+c + d+b + d+c ) ≡+ (a, b, c, d) = 14 ( a+b is equivalence but not Σ-equivalence, ≡+ ∈ E − ΣE; for instance ≡+ (1, 1, 1, 1) = ≡+ (2, 1, 1, 0) does not hold. We shall use the following definition to state relations between different quantifiers: Definition 6. A quantifier ∼1 is less strict than ∼2 (or ∼2 is more strict than ∼1 ) if for all four-fold tables < a, b, c, d > ∼1 (a, b, c, d) ≥ ∼2 (a, b, c, d). From the (fuzzy) logic point of view, it means that in all models (data) the formula ϕ ∼1 ψ is at least so true as the formula ϕ ∼2 ψ, i.e. the deduction rule ϕ∼2 ψ ϕ∼1 ψ is correct. Example 10. ⇔ is more strict than ⇒ , and less strict than ≡ ; ⇔+ is more strict than ⇔ .
3
Correspondence between Classes of Σ-Double Implicational Quantifiers and Implicational Ones
Let ⇒∗ be an implicational quantifier. There is a natural task to construct some Σ-double implicational quantifier ⇔∗ such that from formula ϕ ⇔∗ ∗ ψ logically ψ ϕ⇔∗ ψ follow both implications ϕ ⇒∗ ψ, ψ ⇒∗ ϕ, i.e. deduction rules ϕ⇔ ϕ⇒∗ ψ , ψ⇒∗ ϕ are correct. Such a quantifier ⇔∗ should be as less strict as possible to be near to ⇒∗ . Following two theorems show how to construct the logically nearest Σ-double implicational quantifier from a given implicational quantifier and vice versa.
Classes of Implicational and Equivalence Quantifiers
121
Theorem 7. Let ⇒∗ be an implicational quantifier and ⇔∗ be the quantifier constructed from ⇒∗ for all four-fold tables < a, b, c, d > by the formula ⇔∗ (a, b, c) = ⇒∗ (a, b + c). Then ⇔∗ is the Σ-double implicational quantifier which is the least strict from the class of all Σ-double implicational quantifiers ∼ satisfying for all four-fold tables < a, b, c, d > the property ∼ (a, b, c) ≤ min(⇒∗ (a, b), ⇒∗ (a, c)). Remark. Let us mention that this means the following: ∗ ψ ϕ⇔∗ ψ (1) deduction rules ϕ⇔ ϕ⇒∗ ψ , ψ⇒∗ ϕ are correct; (2) if ∼ is a Σ-double implicational quantifier such that deduction rules ϕ∼ψ ϕ∼ψ ϕ⇒∗ ψ , ψ⇒∗ ϕ are correct, ϕ∼ψ ϕ⇔∗ ψ
then ∼ is more strict than ⇔∗ , i.e. also
is correct.
Proof. Since ⇒∗ is an implicational quantifier, ⇔∗ is a Σ-double implicational quantifier; moreover, ⇔∗ (a, b, c) = ⇒∗ (a, b + c) ≤ min( ⇒∗ (a, b), ⇒∗ (a, c)) for all four-fold tables < a, b, c, d >. Let ∼ is a Σ-double implicational quantifier satisfying the property ∼ (a, x, y) ≤ min( ⇒∗ (a, x), ⇒∗ (a, y)) for all four-fold tables < a, x, y, d >. Then we obtain using Lemma 4 ∼ (a, b, c) = ∼ (a, b + c, 0) ≤ ⇒∗ (a, b + c) = ⇔∗ (a, b, c) for all four-fold tables < a, b, c, d >, which means that ∼ is more strict than ⇔∗ . a Example 11. (1) For the basic implication ⇒ (a, b) = a+b , the basic double a implication ⇔ (a, b, c) = a+b+c is the least strict Σ-double implicational quan∗
∗
ϕ⇔ ψ ϕ⇔ ψ tifier satisfying deduction rules ϕ⇒ , ψ⇒ ϕ . ψ (2) For the upper critical implication Pa (a+b)! i a+b−i , ⇒?p (a, b) = i=0 i!(a+b−i)! p (1 − p) the upper critical double implication Pa (a+b+c)! i a+b+c−i ⇔?p (a, b, c) = i=0 i!(a+b+c−i)! p (1 − p) is the least strict Σ-double implicational quantifier satisfying deduction rules ϕ⇔∗ ψ ϕ⇔∗ ψ , . ϕ⇒? ψ ψ⇒? ϕ p
p
Theorem 8. Let ⇔∗ be a Σ-double implicational quantifier and ⇒∗ be the quantifier constructed from ⇔∗ for all four-fold tables < a, b, c, d > by the formula ⇒∗ (a, b) = ⇔∗ (a, b, 0). Then ⇒∗ is the implicational quantifier which is the most strict from the class of all implicational quantifiers ∼ satisfying for all four-fold tables < a, b, c, d > the property min(∼ (a, b), ∼ (a, c)) ≥ ⇔∗ (a, b, c).
122
J. Iv´ anek
Remark. Let us mention that this means the following: ∗ ψ ϕ⇔∗ ψ (1) deduction rules ϕ⇔ ∗ ϕ⇒ ψ , ψ⇒∗ ϕ are correct; (2) if ∼ is an implicational quantifier such that deduction rules ϕ⇔∗ ψ ϕ⇔∗ ψ ϕ∼ψ , ψ∼ϕ are correct, then ∼ is less strict than ⇒∗ , i.e. also
ϕ⇒∗ ψ ϕ∼ψ
is correct.
Proof. Since ⇔∗ is a Σ-double implicational quantifier, ⇒∗ is an implicational quantifier; moreover, ⇔∗ (a, b, c) = ⇔∗ (a, b + c, 0) ≤ min( ⇔∗ (a, b, 0), ⇔∗ (a, c, 0)) = min( ⇒∗ (a, b), ⇒∗ (a, c)) for all four-fold tables < a, b, c, d >. Let ∼ is an implicational quantifier satisfying the property min(∼ (a, b), ∼ (a, c)) ≥ ⇔∗ (a, b, c) for all four-fold tables < a, b, c, d >. Then we obtain ∼ (a, b) ≥ ⇔∗ (a, b, 0) = ⇒∗ (a, b) for all four-fold tables < a, b, c, d >, which means that ∼ is less strict than ⇒∗ .
4
Correspondence between Classes of Σ-Equivalence Quantifiers and Σ-Double Implicational Ones
This section will be a clear analogy with the previous one: Let ⇔∗ be an Σ-double implicational quantifier. There is a natural task to construct some Σ-equivalence ≡∗ such that the formula ϕ ≡∗ ψ logically follows both from∗ the formula ϕ ⇔∗ ψ, and from the formula ¬ϕ ⇔∗ ¬ψ, i.e. deduction ϕ⇔ ψ ¬ϕ⇔∗ ¬ψ rules ϕ≡∗ ψ , ϕ≡∗ ψ are correct. Such a quantifier ≡∗ should be as strict as possible to be near to ⇔∗ . Following theorems show how to construct the logically nearest Σ-equivalence quantifier from a given Σ-double implicational quantifier and vice versa. The proofs of these theorems are similar to the proofs of Theorems 7,8, so we shall omit them for the lack of space. Theorem 9. Let ⇔∗ be a Σ-double implicational quantifier and ≡∗ be the quantifier constructed from ⇔∗ for all four-fold tables < a, b, c, d > by the formula ≡∗ (a, b, c, d) = ⇔∗ (a + d, b, c). Then ≡∗ is the Σ-equivalence which is the most strict from the class of all Σ-equivalences∼ satisfying for all four-fold tables < a, b, c, d > the property ∼ (a, b, c, d) ≥ max( ⇔∗ (a, b, c), ⇔∗ (d, b, c)). a Example 12. (1) For the basic double implication ⇔ (a, b, c) = a+b+c , the basic a+d equivalence ≡ (a, b, c, d) = a+b+c+d , is the most strict Σ-equivalence satisfying ¬ϕ⇔ ¬ψ ψ deduction rules ϕ⇔ ϕ≡∗ ψ , ϕ≡∗ ψ . (2) For the upper critical double implication Pa (a+b+c)! i a+b+c−i , ⇔?p (a, b, c) = i=0 i!(a+b+c−i)! p (1 − p)
Classes of Implicational and Equivalence Quantifiers
the upper critical equivalence Pa+d (a+b+c+d)! i a+b+c+d−i ≡?p (a, b, c, d) = i=0 i!(a+b+c+d−i)! p (1 − p) is the most strict Σ-equivalence satisfying deduction rules
123
ϕ⇔?p ψ ¬ϕ⇔?p ¬ψ ϕ≡∗ ψ , ϕ≡∗ ψ .
Theorem 10. Let ≡∗ be an Σ-equivalence quantifier and ⇔∗ be the quantifier constructed from ≡∗ for all four-fold tables < a, b, c, d > by the formula ⇔∗ (a, b, c) = ≡∗ (a, b, c, 0). Then ⇔∗ is the Σ-double implicational quantifier which is the least strict from the class of all Σ-double implicational quantifiers ∼ satisfying for all four-fold tables < a, b, c, d > the property max(∼ (a, b, c), ∼ (d, b, c)) ≤ ≡∗ (a, b, c, d).
5
Conclusions
The theorems proved in the paper show that quantifiers from classes I, ΣDI, ΣE compose logically affiliated triads ⇒∗ , ⇔∗ , ≡∗ , where ⇒∗ is implicational quantifier, ⇔∗ is Σ-double implicational quantifier, ≡∗ is Σ-equivalence. Examples of such triads included in this paper are: Example 13. Triad of basic quantifiers ⇒ , ⇔ , ≡ , where a a ⇒ (a, b) = a+b , ⇔ (a, b, c) = a+b+c , ≡ (a, b, c, d) =
a+d a+b+c+d .
Example 14. Triad of statistically motivated upper critical quantifiers ⇒?p , ⇔?p , ≡?p , where Pa (a+b)! i a+b−i ⇒?p (a, b) = , i=0 i!(a+b−i)! p (1 − p) P a (a+b+c)! ? i a+b+c−i ⇔p (a, b, c) = , i=0 i!(a+b+c−i)! p (1 − p) Pa+d (a+b+c+d)! i ? a+b+c+d−i . ≡p (a, b, c, d) = i=0 i!(a+b+c+d−i)! p (1 − p) Let us stress that to each given quantifier from classes I, ΣDI, ΣE, such triad can be constructed. This can naturally extend the metodological approach used to the particular quantifier’s definition for covering all three types of relations (implication, double implication, equivalence). We proved that following deduction rules are correct for the triads: ϕ ⇔∗ ψ ϕ ⇔∗ ψ ϕ ⇔∗ ψ ¬ϕ ⇔∗ ¬ψ , , , . ϕ ⇒∗ ψ ψ ⇒∗ ϕ ϕ ≡∗ ψ ϕ ≡∗ ψ These deduction rules can be used in knowledge discovery and data mining methods in various ways: (1) to organize efectively search for rules in databases (discovering some rules is a reason to skip over in search, because of some other rules simply follows from
124
J. Iv´ anek
discovered ones; nonvalidity of some rules means that some others are also non valid, ...); (2) to filter results of data mining procedure (results which follows from others are not so interesting for users); (3) to order rules according different (but affiliated) quantifications. In practice, some of the above described ideas were used in the systems Combinational Data Analysis, ESOD [8], Knowledge Explorer [3], and 4FTMiner [12]. This research has been supported by grant VS96008 of the Ministry of Education, Youth and Sports of the Czech Republic. The author is grateful to J.Rauch and R.Jirouˇsek for their valuable comments on the preliminary version of the paper.
References 1. Aggraval, R. et al.: Fast Discovery of Association Rules. In Fayyad, V.M. et al.: Advances in Knowledge Discovery and Data Mining. AAAI Press / MIT Press 1996, p.307-328. 2. Batagelj, V., Bren, M.: Comparing Resemblance Measures. J. of Classification 12 (1995), p. 73-90. 3. Berka, P., Iv´ anek, J.: Automated Knowledge Acquisition for PROSPECTOR-like Expert Systems. In Machine Learning. ECML-94 Catania (ed. Bergadano, Raedt). Springer 1994, p.339-342. 4. H´ ajek,P., Havr´ anek,T.: Mechanising Hypothesis Formation - Mathematical Foundations for a General Theory. Springer-Verlag, Berlin 1978, 396 p. 5. H´ ajek,P., Havr´ anek,T., Chytil M.: Metoda GUHA. Academia, Praha 1983, 314 p. (in Czech) 6. H´ ajek, P., Sochorov´ a, A., Zv´ arov´ a, J.: GUHA for personal computers. Computational Statistics & Data Analysis 19 (1995), p. 149 - 153 7. H´ ajek, P., Holeˇ na, M.: Formal Logics of Discovery and Hypothesis Formation by Machine. In Discovery Science (Arikawa,S. and Motoda,H., eds.), Springer-Verlag, Berlin 1998, p.291-302 8. Iv´ anek, J., Stejskal, B.: Automatic Acquisition of Knowledge Base from Data without Expert: ESOD (Expert System from Observational Data). In Proc. COMPSTAT’88 Copenhagen. Physica-Verlag, Heidelberg 1988, p.175-180. 9. Rauch,J.: GUHA as a Data Mining Tool. In: Practical Aspects of Knowledge Management. Schweizer Informatiker Gesellshaft Basel, 1996 10. Rauch, J.: Logical Calculi for Knowledge Discovery in Databases. In Principles of Data Mining and Knowledge Discovery, (Komorowski,J. and Zytkow,J., eds.), Springer-Verlag, Berlin 1997, p. 47-57. 11. Rauch,J.: Classes of Four-Fold Table Quantifiers. In Principles of Data Mining and Knowledge Discovery, (Quafafou,M. and Zytkow,J., eds.), Springer Verlag, Berlin 1998, p. 203-211. 12. Rauch,J.: 4FT-Miner - popis procedury. Technical Report LISp-98-09, Praha 1999. 13. Zembowicz,R. - Zytkow,J.: From Contingency Tables to Various Forms of Knowledge in Databases. In Fayyad, U.M. et al.: Advances in Knowledge Discovery and Data Mining. AAAI Press/ The MIT Press 1996, p. 329-349.
Querying Inductive Databases via Logic-Based User-De ned Aggregates Fosca Giannotti and Giuseppe Manco CNUCE - CNR Via S. Maria 36. 56125 Pisa - Italy
fF.Giannotti,[email protected]
Abstract. We show how a logic-based database language can support
the various steps of the KDD process by providing: a high degree of expressiveness, the ability to formalize the overall KDD process and the capability of separating the concerns between the speci cation level and the mapping to the underlying databases and datamining tools. We generalize the notion of Inductive Data Bases proposed in [4, 12] to the case of Deductive Databases. In our proposal, deductive databases resemble relational databases while user de ned aggregates provided by the deductive database language resemble the mining function and results. In the paper we concentrate on association rules and show how the mechanism of user de ned aggregates allows to specify the mining evaluation functions and the returned patterns.
1
Introduction
The rapid growth and spread of knowledge discovery techniques has highlighted the need to formalize the notion of knowledge discovery process. While it is clear which are the objectives of the various steps of the knowledge discovery process, little support is provided to reach such objectives, and to manage the overall process. The role of domain, or background, knowledge is relevant at each step of the KDD process: which attributes discriminate best, how can we characterize a correct/useful pro le, what are the interesting exception conditions, etc., are all examples of domain dependent notions. Notably, in the evaluation phase we need to associate with each inferred knowledge structure some quality function [HS94] that measures its information content. However, while it is possible to de ne quantitative measures for certainty (e.g., estimated prediction accuracy on new data) or utility (e.g., gain, speed-up, etc.), notions such as novelty and understandability are much more subjective to the task, and hence dicult to de ne. Here, in fact, the speci c measurements needed depend on a number of factors: the business opportunity, the sophistication of the organization, past history of measurements, and the availability of data. The position that we maintain in this paper is that a coherent formalism, capable of dealing uniformly with induced knowledge and background, or domain, •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 125−135, 1999. Springer−Verlag Berlin Heidelberg 1999
126
F. Giannotti and G. Manco
knowledge, would represent a breakthrough in the design and development of decision support systems, in several challenging application domains. Other proposal in the current literature have given experimental evidence that the knowledge discovery process can take great advantage of a powerful knowledge-representation and reasoning formalism [14, 11, 15, 5]. In this context, the notion of inductive database, proposed in [4, 12], is a rst attempt to formalize the notion of interactive mining process. An inductive database provides a uni ed and transparent view of both inferred (deductive) knowledge, and all the derived patterns, (the induced knowledge) over the data. The objective of this paper is to demonstrate how a logic-based database language, such as LDL++ [17], can support the various steps of the KDD process by providing: a high degree of expressiveness, the ability to formalize the overall KDD process and the capability of separating the concerns between the speci cation level and the mapping to the underlying databases and data mining tools. We generalize the notion of Inductive Databases proposed in [4, 12] to the case of Deductive Databases. In our proposal, deductive databases resemble relational databases while user de ned aggregates provided by LDL++ resemble the mining function and results. Such mechanism provides a exible way to customize, tune and reason on both the evaluation function and the extracted knowledge. In the paper we show how such a mechanism can be exploited in the task of association rules mining. The interested reader is referred to an extended version [7] of this paper, which covers the bayesian classi cation data mining task. 2
Logic Database Languages
Deductive databases are database management systems whose query languages and storage structures are designed around a logical model of data. The underlying technology is an extension to relational databases that increases the power of the query language. Among the other features, the rule-based extensions support the speci cation of queries using recursion and negation. We adopt the LDL++ deductive database system, which provides, in addition to the typical deductive features, a highly expressive query language with advanced mechanisms for non-deterministic, non-monotonic and temporal reasoning [9, 18]. In deductive databases, the extension of a relation is viewed as a set of facts, where each fact corresponds to a tuple. For example, let us consider the predicate assembly(Part Subpart) containing parts and their immediate subparts. The predicate partCost(BasicPart Supplier Cost) describes the basic parts, i.e., parts bought from external suppliers rather than assembled internally. Moreover, for each part the predicate describes the supplier, and for each supplier the price charged for it. Examples of facts are: ;
;
;
assembly(bike; frame): partCost(top tube; reed; 20): assembly(bike; wheel): partCost(fork; smith; 10): assembly(wheel; nipple):
Querying Inductive Databases via Logic−Based User−Defined Aggregates
127
Rules constitute the main construct of LDL++ programs. For instance, the rule multipleSupp(S) partCost(P1 S ) partCost(P2 S ) P1 6= P2 describes suppliers that sell more than one part. The rule corresponds to the SQL join query ;
;
;
;
;
;
:
SELECT P1.Supplier FROM partCost P1, partCost P2 WHERE P1.Supplier = P2.Supplier AND P1.BasicPart <> P2.BasicPart
In addition to the standard relational features, LDL++ provides recursion and negation. For example, the rule allSubparts(P S) assembly(P S) allSubparts(P S) allSubparts(P S1) assembly(S1 S) computes the transitive closure of the relation assembly. The following rule computes the least cost for each basic part by exploiting negation: cheapest(P C) partCost(P C) :cheaper(P C) cheaper(P C) partCost(P C1) C1 C ;
;
;
:
;
;
; ;
;
;
;
;
; ;
;
;
<
:
:
:
2.1 Aggregates
A remarkable capability is that of expressing distributive aggregates (i.e., aggregates computable by means of a distributive and associative operator), which are de nable by the user [18]. For example, the following rule illustrates the use of a sum aggregate, which aggregates the values of the relation sales along the dimension Dealer: supplierTot(Date Place sumhSalesi) sales(Date Place Dealer Sales) Such rule corresponds to the SQL statement ;
;
;
;
;
:
SELECT Date, Place, SUM(Sales) FROM sales GROUP BY Date, Place
From a semantic viewpoint, the above rule is a syntactic sugar for a program that exploits the notions of nondeterministic choice and XY-strati cation [6, 17, 9]. In order to compute the following aggregation predicate q(Y aggrhXi) p(X Y) we exploit the capability of imposing a nondeterministic order among the tuples of the relation p, ordP(Y nil nil) p(X Y) ordP(Z X Y) ordP(Z X) p(Y Z) choice(X Y) choice(Y X) ;
; ;
;
;
;
;
; ;
:
:
;
;
;
;
;
;
:
128
F. Giannotti and G. Manco
Here nil is a fresh constant, conveniently used to simplify the program. If the base relation p is formed by tuples for a given value of , then there are ! possible outcomes for the query ordP(X Y), namely a set: fordP(s nil nil) ordP(s nil t1) ordP(s t1 t2 ) ordP(s tk,1 tk)g for each permutation f(t1 s) (tk s)g of the tuples of P. Therefore, in each possible outcome of the mentioned query, the relation ordP is a total (intransitive) ordering of the tuples of p. The double choice constraint in the recursive rule speci es that the successor and predecessor of each tuple of p is unique. As shown in [17], we can then exploit such an ordering to de ne distributive aggregates, inductively de ned as (f g) = ( ) and ( [ f g) = ( ( ) ). By de ning the base and inductive cases by means of ad-hoc user-de ned predicates single and multi, we can then obtain an incremental computation of the aggregation function: aggrP(Aggr Z nil C) ordP(Z nil X) X 6= nil single(Aggr X C) aggrP(Aggr Z Y C) ordP(Z X Y) aggrP(Aggr X C1) multi(Aggr Y C1 C) Finally, the originary rule can be translated into q(Y C) ordP(Y X) :ordP(Y X ) aggrP(aggr Y X C) Example 1 ( [18]). The aggregate sum can be easily de ned by means of the following rules: single(sum X X) multi(sum X SO SN) SN = SO + X k
s
Y
k
;
;
;
;
;
;
;
;:::;
;
;
;
;
;
x
g x
;
;
;
;
;
; ;
;
;:::;
;
;
;
;
;
x
h f S ;x
;
;
;
f S
;
;
;
;
;
f
;
;
;
;
;
;
:
;
;
;
;
;
;
;
;
:
:
:
;
;
:
ut
In [18], a further extension to the approach is proposed, in order to deal with more complex aggregation functions. Practically, we can manipulate the results of the aggregation function by means of two predicates freturn and ereturn. The rule de nining the aggregation predicate is translated into the following: q(Z R) ordP(Z X Y) aggrP(aggr Z X C) ereturn(aggr Y C R) q(Z R) ordP(Z X Y) :ordP(Z Y ) aggrP(aggr Z Y C) freturn(aggr C R) where the rst rule de nes early returns (i.e., results of intermediate computations), and the second rule de nes nal returns, i.e., nal results. Example 2 ([18]). The aggregate maxpair considers tuples ( ), where is a real number, and returns the value with the greater value of . The aggregate can be de ned by means of single, multi and freturn: single(maxpair (C P) (C P)) multi(maxpair (C P) (CO PO) (C P)) P PO multi(maxpair (C P) (CO PO) (CO PO)) P PO freturn(maxpair (CO PO) CO) ;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
:
;
;
ci ; ni
ci
;
:
ni
ni
;
;
;
:
;
;
;
;
;
;
;
;
;
;
;
;
;
;
:
;
:
;
<
:
ut
Querying Inductive Databases via Logic−Based User−Defined Aggregates 3
129
Logic-Based Inductive Databases
In [4], an inductive database schema is de ned as a pair R = (R (QR e; V )), where R is a database schema, QR is a collection of patterns, V is a set of result values and e is an evaluation function mapping each instance r of R and each pattern 2 QR in V . An inductive database instance is then de ned as a pair (r; s), where r is an instance of R and s QR. A typical KDD process operates on both the components of an inductive database, by querying both components of the pair (assuming that s is materialized as a table, and that the value e(r; ) is available for each value of s). A simple yet powerful way of formalizing such ideas in a query language is that of exploiting user-de ned aggregates. Practically, we can formalize the inductive part of an inductive database (i.e., the triple (QR ; e; V )) by means of rules that instantiate the following general schema: ;
r(Y1 ; : : : ; Ym):
s(u d aggrhe; X1 ; : : : ; Xn i)
;
(1)
Intuitively, this rule de nes the format of any subset s of QR . The patterns in s are obtained from a rearranged subset X1; : : : ; Xn of the tuples Y1 ; : : : ; Ym in r. The structure of s is de ned by the formal speci cation of the aggregate u d aggr, in particular by the freturn rule. The tuples resulting from the evaluation of such rule, represent patterns in QR and depend by the evaluation function e. The computation of the evaluation function must be speci ed by u d aggr as well. Example 3. Consider the patterns \the items in the corresponding column of the relation transaction(Tid; Item; Price; Qty) with the average value more than a given threshold". The inductive database has R transaction, QR = fiji 2 dom(R[Item])g, V = IR and e(r; i) = avg(fp qj(t; i; p; q) 2 rg. The above inductive schema is formalized, according to (1) with the following rule:
h(; Itm; Val)i)
s(avgTh
;
;
;
;
transaction( Itm Prc Qty) Val = Prc
Qty:
Where the aggregate avgThres is de ned, as usual, by means of the predicates
; ; ; ; ; ; ; : multi(avgThres; (T; I; VN); (T; I; VO; NO); (T; I; V; N)) V = VN + VO; N = NO + 1: multi(avgThres; (T; I; VN); (T; I; VO; NO); (T; I; VO; NO)): multi(avgThres; (T; I; VN); (T; IO; VO; NO); (T; I; VN; 1)) I 6= IO: freturn(avgThres; (T; I; V; N); (I; A)) A = V=N; A T:
single(avgThres (T I V) (T I V 1))
For each item, both the sum and the count of the occurrences is computed. When all the tuples have been considered, the average value of each item is computed, and returned as answer if and only if it is greater than the given threshold. ut
130
F. Giannotti and G. Manco
The advantage of such an approach is twofold. First, we can directly exploit the schema (1) to de ne the evaluation function e. Second, the \inductive" predicate s itself can be used in the de nition of more complex queries. This de nes a uniform way of providing support for both the deductive and the inductive components.
4 Association Rules As shown in [2], the problem of nding association rules consist of two problems: the problem of nding frequent itemsets and consequently the problem to nd rules from frequent itemsets. Frequent itemsets are itemsets that appear in the database with a given frequency. So, from a conceptual point of view, they can be seen as the results of an aggregation function over the set possible values of an attribute. Hence, we can re ne the idea explained in the previous section, by de ning a predicate p by means of the rule p(X1
; : : : ; Xn; patternsh(min supp; [Y1; : : : ; Ym])i)
q(Z1
; : : : ; Zm ):
In this rule, the variables X1 ; : : : ; Xn; Y1 ; : : : ; Ym are a rearranged subset of the variables Z1 ; : : : ; Zk of q. The aggregate patterns computes the set of predicates p(s; f ) where: 1. s = fl1 ; : : : ; l g is a rearranged subset of the values of Y1 ; : : : ; Ym in a tuple resulting from the evaluation of q. 2. f is the support of the set s, such that f min supp. l
It is easy to provide a (naive) de nition of the patterns aggregate:
;
;
; ; ; subset(SSet; Set): multi(patterns; (Sp; SetN); (SSetO; Sp; N); (SSetO; Sp; N)) single(patterns (Sp Set) (SSet Sp 1))
:
;
:
subset(SSetO SetN) multi(patterns (Sp SetN) (SSetO Sp N) (SSetO Sp N + 1)) subset(SSetO SetN) multi(patterns (Sp SetN) (SSetO Sp N) (SSet Sp 1)) subset(SSetO SetN) subset(SSet SetN) subset(SSet SSetO)
;
;
;
; ; ;
;
;
;
; ; ;
;
; ; ;
:
; ;
:
;
freturn(patterns (SSet Sp N) (SSet N))
N
; ; ;
;
;
;
:
:
;
Sp:
For each tuple, the set of possible subsets are generated. The single predicate initializes the rst subset that can be computed from the rst tuple, by setting their frequency to 1. As soon as following tuples are examined (with the multi predicate), the frequency of the subsets computed before the tuple under consideration is incremented (provided that it is a subset of the current tuple), and the frequency of new subsets obtained from the current tuple are preset to 1.
;
Querying Inductive Databases via Logic−Based User−Defined Aggregates
131
The freturn predicate de nes the output format and conditions for the aggregation predicate: a suitable answer is a pair (SubSet N) such that SubSet is an itemset of frequency N Sp, where Sp is the minimal support required. A typical example application consists in the computation of the frequent itemsets of a basket relation: frequentPatterns(patternsh(m S)i) basketSet(S) basketSet(hEi) basket(T E) where the predicate basketSet collects the baskets in a set structure1 . Rules can be easily generated from frequent patterns by means of rules like ;
>
;
;
rules(L; R; S; C)
:
:
frequentPatterns(A; S); frequentPatterns(R; S1 ); subset(R; A); difference(A; R; L); C = S=S1:
( 1) r
Notice, however, that such an approach, though semantically clean, is very inecient, because of the large amount of computations needed at each step2 . In [10] we propose a technique which allows a compromise between loose and tight coupling, by adopting external specialized algorithms (and hence specialized data structures), but preserving the integration with the features of the language. In such proposal, inductive computations may be considered as aggregates, so that the proposed representation formalism is unaected. However, the inductive task is performed by an external ad-hoc computational engine. Such an approach has the main advantage of ensuring ad-hoc optimizations concerning the mining task transparently and independently from the deductive engine. In our case the patterns aggregate is implemented with some typical algorithm for the computation of the association rules. (e.g., Apriori algorithm [2]). The aggregation speci cation can hence be seen as a middleware between the core algorithm and the data set (de ned by the body of the rule) against which the algorithm is applied. The rest of the section shows some examples of complex queries whithin the resulting logic language. In the following we shall refer to the table with schema and contents exempli ed in 1. Example 4. \Find patterns with at least 3 occurrences from the daily transactions of each customer": frequentPatterns(patternsh(3 S)i) transSet(D C S) transSet(D C hIi) transaction(D C I P Q) By querying frequentPatterns(F S) we obtain, among the answers, the tuples (f g 3) and (f g 3). ut ;
;
;
;
;
;
;
;
;
:
:
;
pasta ;
pasta; wine ;
Again, in LDL++ the capability of de ning set-structures (and related operations) is guaranteed by the choice construct and by XY-strati cation. 2 Practically, the aggregate computation generates 2 I sets of items, where is the set of dierent items appearing in the tuples considered during the computation. Pruning of unfrequent subsets is made at the end of the computation of all subsets. Notice, however, that clever strategies can be de ned (e.g., computation of frequent maximal patterns [3]). 1
j j
I
132
F. Giannotti and G. Manco
transaction(12-2-97, transaction(12-2-97, transaction(12-2-97, transaction(12-2-97, transaction(12-2-97, transaction(12-2-97, transaction(12-2-97, transaction(13-2-97, transaction(13-2-97, transaction(13-2-97, transaction(13-2-97, transaction(13-2-97, transaction(13-2-97, transaction(15-2-97, transaction(15-2-97,
cust1, beer, 10, 10). cust1, chips, 3, 20). cust1, wine, 20, 2). cust2, wine, 20, 2). cust2, beer, 10, 10). cust2, pasta, 2, 10). cust2, chips, 3, 20). cust2, jackets, 100, 1). cust2, col shirts, 30, 3). cust3, wine, 20, 1). cust3, beer, 10, 5). cust1, chips, 3, 20). cust1, beer,10,2). cust1,pasta,2,10). cust1,chips,3,10). Table 1.
transaction(16-2-97, transaction(16-2-97, transaction(16-2-97, transaction(16-2-97, transaction(16-2-97, transaction(16-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97,
A sample transaction table.
cust1,jackets,120,1). cust2,wine,20,1). cust2,pasta,4,8). cust3, chips, 3, 20). cust3,col shirts,25,3). cust3,brown shirts,40,2). cust2,beer,8,12). cust2,beer,10,10). cust2,chips,3,20). cust2,chips,3,20). cust3,pasta,2,10). cust1,pasta,3,5). cust1,wine,25,1). cust1, chips, 3, 20). cust1, beer, 10, 10).
Example 5. \Find patterns with at least 3 occurrences from the transactions of each customers": frequentPatterns(patternsh(3; S)i) transSet(C; S): transSet(C; hIi) transaction(D; C; I; P; Q): Dierently from the previous example, where transactions were grouped by customer and by date, the previous rules group transactions by customer. We then compute the frequent patterns on the restructured transactions transSet(cust1 fbeer chips jackets pasta wineg) transSet(cust2 fbeer chips col shirts jackets pasta wineg) transSet(cust3 fbeer brown shirts chips col shirts pasta wineg) obtaining, e.g., the pattern (fbeer; chips; pasta; wineg; 3). t u Example 6. \Find association rules with a minimum support 3 from daily transactions of each customer". This can be formalized by rule (r1 ). Hence, by querying rules(L; R; S; C), we obtain the association rule (fpastag; fwineg; 3; 0:75). We can further postprocess the results of the aggregation query. For example, the query rules(fA; Bg; fbeerg; S; C) computes \two-to-one" those rules where the consequent is the beer item. An answer is (fchips; wineg; fbeerg; 3; 1). ut Example 7. The query \ nd patterns from daily transactions of high-spending customers (i.e., customers with at least 70 of total expense ad at most 3 items brought), such that each pattern has at least 3 occurrences" can be formalized as follows: frequentPatterns(patternsh(3; S)i) transSet(D; C; S; I; V); V > 70; I 3: transSet(D; C; hIi; counthIi; sumhVi) transaction(D; C; I; P; Q); V = P Q: ;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
The query frequentPatterns(F S) returns the patterns (beer 3), (chips 4) and (beer chips 3) that characterize the class of high-spending customers. ut ;
;
;
;
;
Querying Inductive Databases via Logic−Based User−Defined Aggregates
133
Example 8 ([10]). The query \ nd patterns from daily transactions of each customer, at each generalization level, such that each pattern has a given occurrency depending from the generalization level" is formalized as follows: itemsGeneralization(0; D; C; I; P; Q) transaction(D; C; I; P; Q): itemsGeneralization(I + 1; D; C; AI; P; Q) itemsGeneralization(I; D; C; S; P; Q); category(S; AI):
h i)
itemsGeneralization(I; D; C; S
h
i
itemsGeneralization(I; D; C; S; P; Q):
freqAtLevel(I; patterns (Supp; S) ) itemsGeneralization(I; D; C; S); suppAtLevel(I; S):
where the suppAtLevel predicate tunes the support threshold at a given item hierarchy. The query is the result of a tighter coupling of data preprocessing and result interpretation and postprocessing: we investigate the behaviour of rules over an item hierarchy. Suppose that the following tuples de ne a part-of hierarchy: category(beer; drinks) category(wine; drinks) category(pasta; food) category(chips; food) category(jackets; wear) category(col shirts; wear) category(brown shirts; wear)
Then, by querying freqAtLevel(I F S) we obtain, e.g., (0 (1 food 9), (1 drinks 7) and (1 drinks food 6). ;
;
;
;
;
;
;
;
;
beer; chips; wine; 3),
;
ut
Example 9. The query \ nd rules that are interestingly preserved by drillingdown an item hierarchy" is formalized as follows: rulesAtLevel(I; L; R; S; C) preservedRules(L; R; S; C)
freqAtLevel(I; A; S); freqAtLevel(I; R; S1); subset(R; A); difference(A; R; L); C = S=S1 :
rulesAtLevel(I + 1; L1 ; R1 ; S1 ; C1 ); rulesAtLevel(I; L; R; S; C); setPartOf(L; L1); setPartOf(R; R1); C > C1:
Preserved rules are de ned as those rules valid at any generalization level, such that their con dence is greater than their generalization3. ut
5 Final Remark We have shown that the mechanism of user-de ned aggregates is powerful enough to model the notion of inductive database, and to specify exible query answering capabilities. 3
The choice for such an interest measure is clearly arbitrary and subjective. Other signi cant interest measures can be speci ed (e.g., the interest measure de ned in [16]).
134
F. Giannotti and G. Manco
A major limitation in the proposal is eciency: it has been experimentally shown that specialized algorithms (on specialized data structures) have a better performance than database-oriented approaches (see, e.g., [1]). Hence, in order to improve performance considerably, a thorough modi cation of the underlying database abstract machine should be investigated. Notice in fact that, with respect to ad hoc algorithms, when the programs speci ed in the previous sections are executed on a Datalog++ abstract machine, the only available optimizations for such programs are the traditional deductive databases optimizations [8]. Such optimizations techniques, however, need to be further improved by adding ad-hoc optimizations. For the purpose of this paper, we have been assuming to accept a reasonable worsening in performance, by describing the aggregation formalism as a semantically clean representation formalism, and demanding the computational eort to external ad-hoc engines [10]. This, however, is only a partial solution to the problem, in that more re ned optimization techniques can be adopted. For example, in example 6, we can optimize the query by observing that directly computing rules with three items (even by counting the transactions with at least three items) is less expensive than computing the whole set of association rules, and then selecting those with three items. Some interesting steps in this direction have been made: e.g., [13] proposes an approach to the optimization of datalog aggregation-based queries, and in [13] a detailed discussion of the problem of the optimized computation of optimized computation of constrained association rules is made. However, the computational feasibility of the proposed approach to more general cases is an open problem.
References 1. R. Agrawal, S. Sarawagi, and S. Thomas. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications. In Procs. of ACM-SIGMOD'98, 1998. 2. R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th Int'l Conference on Very Large Databases, 1994. 3. R. Bayardo. Eciently Mining Long Patterns from Databases. In Proc. ACM Conf. on Management of Data (Sigmod98), pages 85{93, 1998. 4. J-F. Boulicaut, M. Klemettinen, and H. Mannila. Querying Inductive Databases: A Case Study on the MINE RULE Operator. In Proc. 2nd European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD98), volume 1510 of Lecture Notes in Computer Science, pages 194{202, 1998. 5. U.M. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press/the MIT Press, 1996. 6. F. Giannotti, D. Pedreschi, and C. Zaniolo. Semantics and Expressive Power of Non Deterministic Constructs for Deductive Databases. To appear in Journal of Logic Programming. 7. F. Giannotti and G. Manco. Querying inductive databases via logic-based userde ned aggregates. Technical report, CNUCE-CNR, June 1999. Available at http://www-kdd.di.unipi.it. 8. F. Giannotti, G[iuseppe Manco, M. Nanni, and D. Pedreschi. Nondeterministic, Nonmonotonic Logic Databases. Technical report, Department of Computer Science Univ. Pisa, September 1998. Submitted for publication.
Querying Inductive Databases via Logic−Based User−Defined Aggregates
135
9. F. Giannotti, G. Manco, M. Nanni, and D. Pedreschi. Query Answering in Nondeterministic, Nonmonotonic, Logic Databases. In Procs. of the Workshop on Flexible Query Answering, number 1395 in Lecture Notes in Arti cial Intelligence, march 1998. 10. F. Giannotti, G. Manco, M. Nanni, D. Pedreschi, and F. Turini. Integration of deduction and induction for mining supermarket sales data. In Proceedings of the International Conference on Practical Applications of Knowledge Discovery (PADD99), April 1999. 11. J. Han. Towards On-Line Analytical Mining in Large Databases. Sigmod Records, 27(1):97{107, 1998. 12. H. Mannila. Inductive databases and condensed representations for data mining. In International Logic Programming Symposium, pages 21{30, 1997. 13. R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory Mining and Pruning Optimizations of Constrained Associations Rules. In Proc. ACM Conf. on Management of Data (Sigmod98), June 1998. 14. S. Ceri R. Meo, G. Psaila. A New SQL-Like Operator for Mining Association Rules. In Proceedings of The Conference on Very Large Databases, pages 122{133, 1996. 15. W. Shen, K. Ong, B. Mitbander, and C. Zaniolo. Metaqueries for Data Mining. In Advances in Knowledge Discovery and Data Mining, pages 375{398. AAAI Press/The MIT Press, 1996. 16. R. Srikant and R. Agrawal. Mining Generalized Association Rules. In Proc. of the 21th Int'l Conference on Very Large Databases, 1995. 17. C. Zaniolo, N. Arni, and K. Ong. Negation and Aggregates in Recursive Rules: The LDL++ Approach. In Proc. 3rd Int. Conf. on Deductive and Object-Oriented Databases (DOOD93), volume 760 of Lecture Notes in Computer Science, 1993. 18. C. Zaniolo and H. Wang. Logic-Based User-De ned Aggregates for the Next Generation of Database Systems. In The Logic Programming Paradigm: Current Trends and Future Directions. Springer Verlag, 1998.
Peculiarity Oriented Multi-database Mining Ning Zhong1 , Y.Y. Yao2 , and Setsuo Ohsuga3 1 3
Dept. of Computer Science and Sys. Eng., Yamaguchi University 2 Dept. of Computer Science, University of Regina Dept. of Information and Computer Science, Waseda University
Abstract.
The paper proposes a way of mining peculiarity rules from
multiply statistical and transaction databases. We introduce the peculiarity rules as a new type of association rules, which can be discovered
from a relatively small number of the peculiar data by searching the relevance among the peculiar data. We argue that the peculiarity rules represent a typically unexpected, interesting regularity hidden in statistical and transaction databases. We describe how to mine the peculiarity rules in the multi-database environment and how to use the RVER (Reverse Variant Entity-Relationship) model to represent the result of multi-database mining. Our approach is based on the database reverse engineering methodology and granular computing techniques. Keywords:
Multi-Database Mining, Peculiarity Oriented, Relevance,
Database Reverse Engineering, Granular Computing (GrC).
1 Introduction Recently, it has been recognized in the KDD (Knowledge Discovery and Data Mining) community that multi-database mining is an important research topic [3, 14, 19]. So far most of the KDD methods that have been developed are on the single universal relation level. Although theoretically, any multi-relational database can be transformed into a single universal relation, practically this can lead to many issues such as universal relations of unmanageable sizes, in ltration of uninteresting attributes, losing of useful relation names, unnecessary join operation, and inconveniences for distributed processing. In particular, some concepts, regularities, causal relationships, and rules cannot be discovered if we just search a single database since the knowledge hides in multiply databases basically. Multi-database mining involves many related topics including interestingness checking, relevance, database reverse engineering, granular computing, and distributed data mining. Liu et al. proposed an interesting method for relevance measure and an ecient implementation for identifying relevant databases as the rst step for multi-database mining [10]. Ribeiro et al. described a way for extending the INLEN system for multi-database mining by the incorporation of primary and foreign keys as well as the development and processing of knowledge segments [11]. Wrobel extended the concept of foreign keys into foreign links because multi-database mining is also interested in getting to non-key attributes •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 136−146, 1999. Springer−Verlag Berlin Heidelberg 1999
Peculiarity Oriented Multi−database Mining
137
[14]. Aronis et al. introduced a system called WoRLD that uses spreading activation to enable inductive learning from multiple tables in multiple databases spread across the network [3]. Database reverse engineering is a research topic that is closely related to multi-database mining. The objective of database reverse engineering is to obtain the domain semantics of legacy databases in order to provide meaning of their executable schemas' structure [6]. Although database reverse engineering has been investigated recently, it was not researched in the context of multi-database mining. In this paper we take a uni ed view of multi-database mining and database reverse engineering. We use the RVER (Reverse Variant Entity-Relationship) model to represent the result of multi-database mining. The RVER model can be regarded as a variant of semantic networks that are a kind of well-known method for knowledge representation. From this point of view, multi-database mining can be regarded as a kind of database reverse engineering. A challenge in multi-database mining is semantic heterogeneity among multiple databases since no explicit foreign key relationships exist among them usually. Hence, the key issue is how to nd/create the relevance among different databases. In our methodology, we use granular computing techniques based on semantics, approximation, and abstraction [7, 18]. Granular computing techniques provide a useful tool to nd/create the relevance among dierent databases by changing information granularity. In this paper, we propose a way of mining peculiarity rules from multiply statistical and transaction databases, which is based on the database reverse engineering methodology and granular computing techniques.
2 Peculiarity Rules and Peculiar Data In this section, we rst de ne peculiarity rules as a new type of association rules and then describe a way of nding peculiarity rules. 2.1
Association Rules vs. Peculiarity Rules
Association rules are an important class of regularity hidden in transaction databases [1, 2]. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y . So far, two categories of the association rules, the general rule and the exception rule, have been investigated [13]. A general rule is a description of a regularity for numerous objects and represents the well-known fact with common sense, while an exception rule is for a relatively small number of objects and represents exceptions to the well-known fact. Usually, the exception rule should be associated with a general rule as a set of rule pairs. For example, the rule \using a seat belt is risky for a child" which represents exceptions to the general rule with common sense \using a seat belt is safe". The peculiarity rules introduced in this paper can be regarded as a new type of association rules for a dierent purpose. A peculiarity rule is discovered from
138
N. Zhong, Y.Y. Yao, and S. Ohsuga
the peculiar data by searching the relevance among the peculiar data. Roughly speaking, a data is peculiar if it represents a peculiar case described by a relatively small number of objects and is very dierent from other objects in a data set. Although it looks like the exception rule from the viewpoint of describing a relatively small number of objects, the peculiarity rule represents the well-known fact with common sense, which is a feature of the general rule. We argue that the peculiarity rules are a typical regularity hidden in statistical and transaction databases. Sometimes, the general rules that represent the well-known fact with common sense cannot be found from numerous statistical or transaction data, or although they can be found, the rules may be uninteresting ones to the user since data are rarely specially collected/stored in a database for the purpose of mining knowledge in most organizations. Hence, the evaluation of interestingness (including surprisingness, unexpectedness, peculiarity, usefulness, novelty) should be done before and/or after knowledge discovery [5, 9, 12]. In particular, unexpected (common sense) relationships/rules may be hidden a relatively small number of data. Thus, we may focus some interesting data (the peculiar data), and then we nd more novel and interesting rules (peculiarity rules) from the data. For example, the following rules are the peculiarity ones that can be discovered from a relation called Japan-Geography (see Table 1) in a Japan-Survey database: rule1 rule2
! !
: ArableLand(large) & Forest(large) : ArableLand(small) & Forest(small)
Table 1. Region
Hokkaido
Area
(
)
P opulationDensity low :
(
)
P opulationDensity high :
Japan-Geography
Population PopulationDensity PeasantFamilyN ArableLand Forest . . .
82410.58
5656
67.8
93
1209
Aomori
9605.45
1506
156.8
87
169
623
...
...
...
...
...
...
...
...
...
Tiba
5155.64
5673
1100.3
116
148
168
...
2183.42
11610
5317.2
21
12
80
...
...
...
...
...
...
...
...
1886.49
8549
4531.6
39
18
59
...
...
...
...
...
...
...
...
Tokyo Osaka ... ...
5355 . . .
In order to discover the rules, we rst need to search the peculiar data in the relation Japanese-Geography. From Table 1, we can see that the values of the attributes ArableLand and Forest for Hokkaido (i.e. 1209 Kha and 5355 Kha) and for Tokyo and Osaka (i.e. 12 Kha, 18 Kha, and 80 Kha, 59 Kha) are very dierent from other values in the attributes. Hence, the values are regarded as the peculiar data. Furthermore, rule and rule are generated by searching the relevance among the peculiar data. Note that we use the qualitative representation for 1
2
Peculiarity Oriented Multi−database Mining
139
the quantitative values in the above rules. The transformation of quantitative to qualitative values can be done by using the following background knowledge on information granularity: Basic granules: bg1 = bg4 =
f f
g g
high, low ; bg2 = far, close ; bg5 =
Speci c granules:
f f
g f g . . . . . ..
large, smal l ; bg3 =
biggest-cities =
f
kansei-area =
Osaka, Kyoto, Nara, ... ;
g
Tokyo, Osaka ; kanto-area =
f
g
many, little ;
long, short ;
g
f
. . . . . ..
g
Tokyo, Tiba, Saitama, ... ;
That is, ArableLand = 1209, Forest = 5355 and PopulationDensity = 67.8 for Hokkaido are replaced by the granules, \large" and \low", respectively. Furthermore, Tokyo and Osaka are regarded as a neighborhood (i.e. the biggest cities in Japan). Hence, rule2 is generated by using the peculiar data for both Tokyo and Osaka as well as their granules (i.e. \small" for ArableLand and Forest, and \high" for PopulationDensity). 2.2
Finding the Peculiar Data
There are many ways of nding the peculiar data. In this section, we describe an attribute-oriented method. Let X = fx1 ; x2 ; . . . ; xn g be a data set related to an attribute in a relation, and n is the number of dierent values in an attribute. The peculiarity of xi can be evaluated by the Peculiarity Factor, PF (xi ); ( )=
P F xi
Xq n
(
)
N xi ; xj :
(1)
j =1
It evaluates whether xi occurs relatively small number and is very dierent from other data xj by calculating the sum of the square root of the conceptual distance between xi and xj . The reason why the square root is used in Eq. (1) is that we prefer to evaluate more near distances for relatively large number of data so that the peculiar data can be found from relatively small number of data. Major merits of the method are { {
It can handle both the continuous and symbolic attributes based on a uni ed semantic interpretation; Background knowledge represented by binary neighborhoods can be used to evaluate the peculiarity if such background knowledge is provided by a user.
If X is a data set of a continuous attribute and no background knowledge is available, in Eq. (1), ( ) = jxi 0 xj j: (2) Table 2 shows an example for the calculation. On the other hand, if X is a data set of a symbolic attribute and/or the background knowledge for representing the N xi ; xj
140
N. Zhong, Y.Y. Yao, and S. Ohsuga
conceptual distances between xi and xj is provided by a user, the peculiarity factor is calculated by the conceptual distances, N (xi ; xj ): Table 3 shows an example in which the binary neighborhoods shown in Table 4 are used as the background knowledge for representing the conceptual distances of dierent type of restaurants [7, 15]. However, all the conceptual distances are 1, as default, if background knowledge is not available. Table 2.
An example of the peculiarity factor for a continue attribute Region ArableLand 1209
Hokkaido
Table 3.
PF
)
134.1
Tokyo
12
Osaka
18
Yamaguchi
162
60.5
Okinawa
147
59.4
An example of the peculiarity
factor for a symbolic attribute
Restaurant
Type
PF
Wendy
American
2.2
Le Chef
French
Great Wall
Chinese
Kiku
Japanese
1.6
South Sea
Chinese
1.6
)
=
2.6 1.6
=
60.9 60.3
Table 4.
The binary neighborhoods for a symbolic attribute
Type
Type
N
Chinese
Japanese
1
Chinese
American 3
Chinese
French
4
American
French
2
American Japanese French
Japanese
3 3
After the evaluation for the peculiarity, the peculiar data are elicited by using a threshold value,
threshold = mean of P F (xi ) + 2 variance of P F (xi) (3) where can be speci ed by a user. That is, if P F (xi ) is over the threshold value, xi is a peculiar data. Based on the preparation stated above, the process of nding the peculiar data can be outlined as follows:
Calculate the peculiarity factor PF (xi) in Eq. (1) for all values in a data set (i.e. an attribute). Step 2. Calculate the threshold value in Eq. (3) based on the peculiarity factor obtained in Step 1: Step 3. Select the data that is over the threshold value as the peculiar data. Step 4. If current peculiarity level is enough, then goto Step 6: Step 5. Remove the peculiar data from the data set and thus, we get a new data set. Then go back to Step 1: Step 6. Change the granularity of the peculiar data by using background knowledge on information granularity if the background knowledge is available. Step 1.
Peculiarity Oriented Multi−database Mining
141
Furthermore, the process can be done in a parallel-distributed mode for multiple attributes, relations and databases since this is an attribute-oriented nding method.
2.3 Relevance among the Peculiar Data A peculiarity rule is discovered from the peculiar data by searching the relevance among the peculiar data. Let X (x) and Y (y) be the peculiar data found in two attributes X and Y respectively. We deal with the following two cases:
{ If the X (x) and Y (y ) are found in a relation, the relevance between X (x) and Y (y) is evaluated in the following equation:
= P1 (X (x)jY (y))P2 (Y (y)jX (x)): (4) That is, the larger the product of the probabilities of P1 and P2 ; the stronger the relevance between X (x) and Y (y ). { If the X (x) and Y (y) are found in two dierent relations, we need to use a value (or its granule) in a key (or foreign key/link) as the relevance factor, K (k ), to nd the relevance between X (x) and Y (y ). Thus, the relevance between X (x) and Y (y) is evaluated in the following equation: R1
R2
= P1 (K (k )jX (x))P2 (K (k )jY (y)):
(5)
Furthermore, Eq. (4) and Eq. (5) are suitable for handling more than two peculiar data found in more than two attributes if X (x) (or Y (y )) is a granule of the peculiar data. 3
Mining Peculiarity Rules in Multi-Database
Building on the preparatory in Section 2, this section describes a methodology of mining peculiarity rules in multi-database.
3.1 Multi-Database Mining in Dierent Levels Generally speaking, the task of multi-database mining can be divided into two levels: 1. Mining from multiple relations in a database. 2. Mining from multiple databases. First, we need to extend the concept of foreign keys into foreign links because we are also interested in getting to non-key attributes for data mining from multiple relations in a database. A major work is to nd the peculiar data in multiple relations for a given discovery task while foreign link relationships exist. In other words, our task is to select n relations, which contain the peculiar data, among m relations (m n) with foreign links.
142
N. Zhong, Y.Y. Yao, and S. Ohsuga
We again use the Japan-Survey database as an example. There are many relations (tables) in this database such as Japan-Geography, Economy, AlcoholicSales, Crops, Livestock-Poultry, Forestry, Industry, and so on. Table 5 and Table 6 show two of them as examples (Table 1 is another one (Japan-Geography)). The method for selecting n relations among m relations can be brie y described as follows: Table 5. Economy
Region
Hokkaido
PrimaryInd SecondaryInd TertiaryInd . . .
9057
34697
96853
...
Aomori
2597
6693
22722
...
...
...
...
...
...
Tiba
3389
44257
76277
187481
484294
...
839 ...
...
...
397
99482
209492
...
...
...
Tokyo Osaka ... ...
... ... ... ...
Table 6. Alcoholic-Sales
Region
Sake
Hokkaido
42560
Aomori
18527
60425
...
...
...
...
...
Tiba
47753
Tokyo Osaka ... ...
Beer
...
257125 . . .
205168 . . .
150767 838581 ...
...
...
...
100080 577790
... ... ... ...
Focus on a relation as the main table and nd the peculiar data from this table. Then elicit the peculiarity rules from the peculiar data by using the methods stated in Section 2.2 and 2.3. For example, if we select the relation called Japan-Geography shown in Table 1 as the main table, rule1 and rule2 stated in Section 2.1 are a result for the step. Step 2. Find the value(s) of the focused key corresponding to the mined peculiarity rule in Step 1 and change its granularity of the value(s) of the focused key if the background knowledge on information granularity is available. For example, \Tokyo" and \Osaka" that are the values of the key attribute region can be changed into a granule, \biggest cities". Step 3. Find the peculiar data in the other relations (or databases) corresponding to the value (or its granule) of the focused key. Step 4. Select n relations that contain the peculiar data, among m relations (m n). In other words, we just select the relations that contain the peculiar data that are relevant to the peculiarity rules mined from the main table. Step 1.
Peculiarity Oriented Multi−database Mining
143
Here we need to nd the related relations by using foreign keys (or foreign links). For example, since the (foreign) key attribute is Region for the relations in the Japan-Survey database, and the value in the key, Region = Hokkaido, which is related to the mined rule1 ; we search the peculiar data in other relations that are relevant to the mined rule1 by using Region = Hokkaido as a relevance factor. The basic method for searching the peculiar data is similar to the one stated in Section 2.2. However, we just check the peculiarity of the data that are relevant to the value (or its granule) of the focused key in the relations. Furthermore, selecting n relations among m relations can be done in a parallel-distributed cooperative mode. Let \j" denote a relevance among the peculiar data (but not a rule currently, and can be used to induce rules as to be stated in Section 3.2). Thus, we can see that the peculiar data are found in the relations, Crops, Livestock-Poultry, Forestry, Economy, corresponding to the value of the focused key, Region = Hokkaido: In the relation, Crops,
Region(Hokkaido) j (WheatOutput(high) & RiceOutput(high)). In the relation, Livestock-Poultry, Region(Hokkaido) j (MilchCow(many) & MeatBull(many) & MilkOutput(many) & Horse(many)). In the relation, Forestry, Region(Hokkaido) j (TotalOutput(high) & SourceOutput(high)). In the relation, Economy, Region(Hokkaido) j PrimaryIndustry(high).
Hence the relations, Crops, Livestock-Poultry, Forestry, Economy are selected. On the other hand, the peculiar data are also found in the relations, AlcoholicSales and Economy, corresponding to the value of the focused key, Region = biggest-cities: In the relation, Alcoholic-Sales, Region(biggest-cities) j (Sake-sales(high) & RiceOutput(high)). In the relation, Economy, Region(biggest-cities) j TertiaryIndustry(high). Furthermore, the methodology stated above can be extended for mining from multiple databases. For example, if we found that the turnover was a marked drop in some day from a supermarket transaction database, maybe we cannot understand why. However, if we search a weather database, we can nd that there was a violent typhoon this day in which the turnover of the supermarket was a marked drop. Hence, we can discover the reason why the turnover was a marked drop. A challenge in multi-database mining is semantic heterogeneity among multiple databases since no explicit foreign key relationships exist among them usually. Hence, the key issue is how to nd/create the relevance among dierent databases. In our methodology, we use granular computing techniques based on semantics, approximation, and abstraction for solving the issue [7, 18].
144
3.2
N. Zhong, Y.Y. Yao, and S. Ohsuga
Representation and Re-learning
We use the RVER (Reverse Variant Entity-Relationship) model to represent the peculiar data and the conceptual relationships among the peculiar data discovered from multiply relations (databases). Figure 1 shows the general framework of the RVER model. The RVER model can be regarded as a variant of semantic networks that are a kind of well-known method for knowledge representation. From this point of view, multi-database mining can be regarded as a kind of database reverse engineering. Figure 2 shows a result mined from the JapanSurvey database; Figure 3 shows the result mined from two databases on the supermarkets at Yamaguchi prefecture and the weather of Japan. The point of which the RVER model is different from an ordinary ER model is that we just represent the attributes that are relevant to the peculiar data and the related peculiar data (or their granules) in the RVER model. Thus, the RVER model provides all interesting information that is relevant to some focusing (e.g. Region = Hokkaido and Region = biggest-cities in the Japan-Geography database) for learning the advanced rules among multiple relations (databases). Re-learning means learning the advanced rules (e.g., if-then rules and firstorder rules) from the RVER model. For example, the following rules can be learned from the RVER models shown in Figure 2 and Figure 3: rule3 : ArableLand(1arge) & Forest(1arge) -+ PrimaryIndustry(high). r ul e4 : Weat h e r(t yph oo n) + Turno v e r(v e ry-low) .
A peculiarity rule
the focused key value
Fig. 1. The RVER model
Peculiarity Oriented Multi−database Mining
145
ationDensity(low)c eLand(hrge)& Forest(larg
Fig. 2. The RVER model related to Region = Hokkaido
Fig. 3. The RVER model mined from two databases
4
Conclusion
We presented a way of mining peculiarity rules from multiply statistical and transaction databases. The peculiarity rules are defined as a new type of association rules. We described a variant of E R model and semantic networks as a way t o represent peculiar data and their relationship among multiple relations (databases). We can change the granularity of the peculiar data dynamically in the discovery process. Some of databases such as Japan-survey, web-log, weather, supermarket have been tested or have been testing for our approach.
146
N. Zhong, Y.Y. Yao, and S. Ohsuga
Since this project is very new, we just nished the rst step. Our future work includes developing a systematic method to mine the rules from multiply databases where there are no explicitly foreign key (link) relationships, and to induce the advanced rules from the RVER models discovered from multiple databases.
References 1. Agrawal R. et al. \Database Mining: A Performance Perspective", IEEE Trans. Knowl. Data
Eng., 5(6) (1993) 914-925. 2. Agrawal R. et al. \Fast Discovery of Association Rules", Advances in Knowledge Discovery and Data Mining, AAAI Press (1996) 307-328. 3. Aronis, J.M. et al \The WoRLD; Knowledge Discovery from Multiple Distributed Databases", Proc. 10th International Florida AI Research Symposium (FLAIRS-97) (1997) 337-341. 4. Fayyad, U.M., Piatetsky-Shapiro, G et al (eds.) Advances in Knowledge Discovery and Data Mining. AAAI Press (1996). 5. Freitas, A.A. \On Objective Measures of Rule Surprisingness" J. Zytkow and M. Quafafou (eds.) Principles of Data Mining and Knowledge Discovery. Lecture Notes AI 1510, Springer-Verlag (1998) 1-9. 6. Chiang, Roger H.L. et al (eds.) \A Framework for the Design and Evaluation of Reverse Engineering Methods for Relational Databases", Data & Knowledge Engineering, Vol.21 (1997) 57-77. 7. Lin, T.Y. \Granular Computing on Binary Relations 1: Data Mining and Neighborhood Systems ", L. Polkowski and A. Skowron (eds.) Rough Sets in Knowledge Discovery 1, In Studies in Fuzziness and Soft Computing series, Vol. 18, Physica-Verlag (1998) 107-121. 8. Lin, T.Y., Zhong, N., Dong, J., and Ohsuga, S. \Frameworks for Mining Binary Relations in Data", L. Polkowski and A. Skowron (eds.) Rough Sets and Current Trends in Computing, LNAI 1424, Springer-Verlag (1998) 387-393. 9. Liu, B., Hsu W., and Chen, S. \Using General Impressions to Analyze Discovered Classi cation Rules", Proc. Third International Conference on Knowledge Discovery and Data Mining (KDD-97), AAAI Press (1997) 31-36. 10. Liu, H., Lu H., and Yao, J. \Identifying Relevant Databases for Multidatabase Mining", X. Wu et al. (eds.) Research and Development in Knowledge Discovery and Data Mining, Lecture Notes in AI 1394, Springer-Verlag (1998) 210-221. 11. Ribeiro, J.S., Kaufman, K.A., and Kerschberg, L. \Knowledge Discovery from Multiple Databases", Proc First Inter. Conf. on Knowledge Discovery and Data Mining (KDD-95), AAAI Press (1995) 240-245. 12. Silberschatz, A. and Tuzhilin, A. \What Makes Patterns Interesting in Knowledge Discovery Systems", IEEE Trans. Knowl. Data Eng., 8(6) (1996) 970-974. 13. Suzuki E.. \Autonomous Discovery of Reliable Exception Rules", Proc Third Inter. Conf. on Knowledge Discovery and Data Mining (KDD-97), AAAI Press (1997) 259-262. 14. Wrobel, S. \An Algorithm for Multi-relational Discovery of Subgroups", J. Komorowski and J. Zytkow (eds.) Principles of Data Mining and Knowledge Discovery. LNAI 1263, Springer-Verlag (1997) 367-375. 15. Yao, Y.Y. \Granular Computing using Neighborhood Systems", Roy, R., Furuhashi, T., and Chawdhry, P.K. (eds.) Advances in Soft Computing: Engineering Design and Manufacturing, Springer-Verlag (1999) 539-553. 16. Yao, Y.Y. and Zhong, N. \An Analysis of Quantitative Measures Associated with Rules", Zhong, N. and Zhou, L. (eds.) Methodologies for Knowledge Discovery and Data Mining, LNAI 1574, Springer-Verlag (1999) 479-488. 17. Yao, Y.Y. and Zhong, N. \Potential Applications of Granular Computing in Knowledge Discovery and Data Mining", Proc. The 5th.International Conference on Information Systems Analysis and Synthesis (IASA'99), edited in the invited session on Intelligent Data Mining and Knowledge Discovery (1999) (in press). 18. Zadeh, L. A. \Toward a Theory of Fuzzy Information Granulation and Its Centrality in Human Reasoning and Fuzzy Logic", Fuzzy Sets and Systems, Elsevier Science Publishers, 90 (1997) 111-127. 19. Zhong N. and Yamashita S. \A Way of Multi-Database Mining", Proc. the IASTED International Conference on Arti cial Intelligence and Soft Computing (ASC'98), IASTED/ACTA Press (1998) 384-387.
Knowledge Discovery in Medical Multi-databases: A Rough Set Approach Shusaku Tsumoto Department of Medicine Informatics, Shimane Medical University, School of Medicine, 89-1 Enya-cho Izumo City, Shimane 693-8501 Japan E-mail: [email protected]
Abstract. Since early 1980’s, due to the rapid growth of hospital information systems (HIS), electronic patient records are stored as huge databases at many hospitals. One of the most important problems is that the rules induced from each hospital may be different from those induced from other hospitals, which are very difficult even for medical experts to interpret. In this paper, we introduce rough set based analysis in order to solve this problem. Rough set based analysis interprets the conflicts between rules from the viewpoint of supporting sets, which are closely related with dempster-shafer theory(evidence theory) and outputs interpretation of rules with evidential degree. The proposed method was evaluated on two medical databases, the experimental results of which show that several interesting relations between rules, including interpretation on difference and the solution of conflicts between induced rules, are discovered.
1
Introduction
Since early 1980’s, due to the rapid growth of hospital information systems (HIS), electronic patient records are stored as huge databases at many hospitals. One of the most important problems is that the rules induced from each hospital may be different from those induced from other hospitals, which are very difficult even for medical experts to interpret. In this paper, we introduce rough set based analysis in order to solve this problem. Rough set based analysis interprets the conflicts between rules from the viewpoint of supporting sets, which are closely related with dempster-shafer theory(evidence theory) and outputs interpretation of rules with evidential degree. The proposed method was evaluated on two medical databases, the experimental results of which show that several interesting relations between rules, including interpretation on difference and the solution of conflicts between induced rules, are discovered. The paper is organized as follows: Section 2 will make a brief description about distributed data analysis. Section 3 and 4 discusses the definition of rules and rough set model of distributed data analysis. Section 5 gives experimental results. Section 6 discusses the problems of our work and related work, and finally, Section 7 concludes our paper. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 147–155, 1999. c Springer-Verlag Berlin Heidelberg 1999
148
2
S. Tsumoto
Distributed Data Analysis
In distributed rule induction, the following three cases should be considered. (1) One database induces rules, whose attribute-value pairs do not appear in other database(independent type). (2) Rules induced from one database are overlapped with rules induced from other databases(boundary type). (3) Rules induced from one database are described by the subset of attribute-value pairs, which are used in rules induced from other databases(subcategory type). In the first case, it would be very difficult to interpret all the results because each database do no share the regularities with other databases. In the second case, shared information will be much more important than other information. In the third case, subset information will be important. It is notable that this classification on distributed data analysis can be applied to discussion on collaboration between domain experts and rule discovery methods: Empirical studies on medical data mining[2,11] show that medical experts try to interpret unexpected patterns with their domain knowledge, which can be viewed as hypothesis generation. In [2], gender is an attribute unexpected by experts, which led to a new hypothesis that body size will be closely related with complications of angiography. In [11,12], gender and age are unexpected attributes, which triggered reexamination of datasets and generated a hypothesis that immunological factors will be closely related with meningitis. These actions will be summerized into the following three patterns: 1. If induced patterns are completely equivalent to domain knowledge, then the patterns are commonsense. 2. If induced patterns partially overlap with domain knowledge, then the patterns may include unexpected or interesting subpatterns. 3. If induced patterns are completely different from domain knowledge, then the patterns are difficult to interpret. Then, the next step will be validation of a generated hypothesis: a dataset will be collected under the hypothesis in a prospective way. After the data collection, statistical analysis will be applied to detect the significance of this hypothesis. If the hypothesis is confirmed with statistical significance, these results will be reported. Thus, such a kind of interaction between human experts and rule discovery methods can be viewed as distributed data analysis.
3 3.1
Probabilistic Rules Accuracy and Coverage
In the subsequent sections, we adopt the following notations, which is introduced in [8]. Let U denote a nonempty, finite set called the universe and A denote a nonempty, finite set of attributes, i.e., a : U → Va for a ∈ A, where Va is called the domain of a, respectively.Then, a decision table is defined as an information
Knowledge Discovery in Medical Multi-databases: A Rough Set Approach
149
system, A = (U, A ∪ {d}). The atomic formulas over B ⊆ A ∪ {d} and V are expressions of the form [a = v], called descriptors over B, where a ∈ B and v ∈ Va . The set F (B, V ) of formulas over B is the least set containing all atomic formulas over B and closed with respect to disjunction, conjunction and negation. For each f ∈ F (B, V ), fA denote the meaning of f in A, i.e., the set of all objects in U with property f , defined inductively as follows. 1. If f is of the form [a = v] then, fA = {s ∈ U |a(s) = v} 2. (f ∧ g)A = fA ∩ gA ; (f ∨ g)A = fA ∨ gA ; (¬f )A = U − fa By the use of this framework, classification accuracy and coverage, or true positive rate is defined as follows. Definition 1. Let R and D denote a formula in F (B, V ) and a set of objects which belong to a decision d. Classification accuracy and coverage(true positive rate) for R → d is defined as: αR (D) =
|RA ∩ D| |RA ∩ D| (= P (D|R)), and κR (D) = (= P (R|D)), |RA | |D|
where |A| denotes the cardinality of a set A, αR (D) denotes a classification accuracy of R as to classification of D, and κR (D) denotes a coverage, or a true positive rate of R to D, respectively. It is notable that these two measures are equal to conditional probabilities: accuracy is a probability of D under the condition of R, coverage is one of R under the condition of D. 3.2
Definition of Rules
By the use of accuracy and coverage, a probabilistic rule is defined as: α,κ
R →d
s.t.
R = ∧j ∨k [aj = vk ], αR (D) ≥ δα , κR (D) ≥ δκ .
This rule is a kind of probabilistic proposition with two statistical measures, which is an extension of Ziarko’s variable precision model(VPRS) [14].1
4
Rough Set Model of Distributed Data Analysis
4.1
Definition of Characterization Set
In order to model these three reasoning types, a statistical measure, coverage κR (D) plays an important role in modeling, which is a conditional probability of a condition (R) under the decision D (P (R|D)). 1
This probabilistic rule is also a kind of Rough Modus Ponens[6].
150
S. Tsumoto
Let us define a characterization set of D, denoted by L(D) as a set, each element of which is an elementary attribute-value pair R with coverage being larger than a given threshold, δκ . That is, Lδκ (D) = {[ai = vj ]|κ[ai =vj ] (D) > δκ }. Then, according to the descriptions in section 2, three types of differences will be defined as below: 1. Independent type: Lδκ (Di ) ∩ Lδκ (Dj ) = φ, 2. Boundary type: Lδκ (Di ) ∩ Lδκ (Dj ) 6= φ, and 3. Subcatgory type: Lδκ (Di ) ⊆ Lδκ (Dj ), where i and j denotes a table i and j. All three definitions correspond to the negative region, boundary region, and positive region[4], respectively, if a set of the whole elementary attribute-value pairs will be taken as the universe of discourse. Thus, here we can apply the technique which is similar to inductiong of decision rules from the partition of equivalence relations. In the cases of boundary and subcategory type, the lower and upper limits of characterization are defined as: Lδk appa (D) = ∩i Lδk appa (Di ) Lδk appa (D) = ∪i Lδk appa (Di ) Concerning independent type, the lower limit is empty: Lκ (D) = and only the upper limit of characterization is defined. The lower limit of characterization is a set whose elements are included in all the databases, which can be viewed as information shared by all the datasets. The upper limit of characterization is a set whose elements are included in at least one database, which can be viwed as possible information shared by datasets. It is notable that the size of those limits is dependent on the choice of the threshold δκ . 4.2
Characterization as Exclusive Rules
Characteristics of characterization set depends on the value of δκ . If the threshold is set to 1.0, then a characterization set is equivalent to a set of attributes in exclusive rules[9]. That is, the meaning of each attribute-value pair in L1.0 (D) covers all the examples of D. Thus, in other words, some examples which do not satisfy any pairs in L1.0 (D) will not belong to a class D. Construction of rules based on L1.0 are discussed in Subsection 4.4, which can also be found in [10,12]. The differences between these two papers are the following: in the former paper, independent type and subcategory type for L1.0 are focused on to represent diagnostic rules and applied to discovery of decision rules in medical databases. On the other hand, in the latter paper, a boundary type for L1.0 is focused on and applied to discovery of plausible rules.
Knowledge Discovery in Medical Multi-databases: A Rough Set Approach
4.3
151
Rough Inclusion
Concerning the boundary type, it is important to consider the similarities between classes. In order to measure the similarity between classes with respect to characterization, we introduce a rough inclusion measure µ, which is defined as follows: T |S T | . µ(S, T ) = |S| It is notable that if S ⊆ T , then µ(S, T ) = 1.0, which shows that this relation extends subset and superset relations. This measure is introduced by Polkowski and Skowron in their study on rough mereology[7], which focuses on set-inclusion to characterize a hierarchical structure based on a relation between a subset and superset. Thus, application of rough inclusion to capturing the relations between classes is equivalent to constructing rough hierarchical structure between classes, which is also closely related with information granulation proposed by Zadeh[13].
5
An Algorithm for Analysis
An algorithms for searching for the lower and upper limit of characterization and induction of rules based on these limits are given in Fig. 1 and Fig. 2. Since subcategory type and independent type can be viewed as special types of boundary type with respect to rough inclusion, rule induction algorithms for subcategory type and independent type are given if the thresholds for µ are set up to 1.0 and 0.0, respectively. Rule discovery(Fig 1.) consists of the following three procedures. First, the characterization of each given class is extracted from each database and the lower and upper limit of characterization is calculated. Second, from these limits, rule induction method(Fig.2) will be applied. Finally, all the characteriztion are classified into several groups with respect to rough inclusion and the degree of similarity will be output.
6 6.1
Experimental Results Applied Datasets
For experimental evaluation, a new system, called PRIMEROSE-REX5 (Probabilistic Rule Induction Method for Rules of Expert System ver 5.0), is developed with the algorithms discussed above. PRIMEROSE-REX5 was applied to the following three medical domains, whose information is shown in Table 1.
152
S. Tsumoto
procedure Rule Discovery (T otal P rocess); var i : integer; M, L, R : List; LD : List; /* A list of all databases */ begin Calculate αR (Di ) and κR (Di ) for each elementary relation R and each class Di ; (i: A dataset i) Make a list L(Di ) = {R|κR (D) ≥ δκ }) for Di ; Calculate L(D) = ∩L(Di ) and overlineL(D) = ∪L(Dj ). Apply Rule Induction methods for L(D) and L(D). while (LD 6= φ) do begin i := f irst(LD ); M := LD − {i}; while (M 6= φ) do begin j := f irst(M ); if (µ(L(Dj ), L(Di )) ≥ δµ ) then L2 (D) := L2 (D) + {(i, j, δµ )}; M := M − Dj ; end Store L2 (D) as a similarity of dataset with respect to δµ LD := LD − i; end end {Rule Discovery }; Fig. 1. An Algorithm for Rule Discovery Table 1. Databases Domain Tables Samples Classes Attributes Headache 10 52119 45 147 CVD 4 7620 22 285 Meningitis 5 1211 4 41
6.2
Discovery in Experiments
Characterization of Headache. Although all the rules from the lower and upper limit were not interesting for domain experts, several interesting and unexpected relations on the degree of similarity were found in characterization sets. Ten hospitals are grouped in three groups. Table 2 shows several information about these groups, each differentiated factor of which are regions. The first group is mainly located on the countryside, most of the people are farmers. The second one is mainly located in the housing area. Finally, the third group is in the business area. Those groups included several interesting features for differential diagnosis of headache. In the first group, hypertension was one of the most important attributes for differential diagnosis. In the housing area, the nature of headache
Knowledge Discovery in Medical Multi-databases: A Rough Set Approach
153
procedure Induction of Classif ication Rules; var i : integer; M, Li : List; begin L1 := Ler ; /* Ler : List of Elementary Relations ((L) or (L)) */ i := 1; M := {}; for i := 1 to n do /* n: Total number of attributes */ begin while ( Li 6= {} ) do begin Select one pair R = ∧[ai = vj ] from Li ; Li := Li − {R}; if (αR (D) ≥ δα ) and (κR (D) ≥ δκ ) then do Sir := Sir + {R}; /* Include R as Inclusive Rule */ else M := M + {R}; end Li+1 := (A list of the whole combination of the conjunction formulae in M ); end end {Induction of Classif ication Rules }; Fig. 2. An Algorithm for Classification Rules
was important for differential diagnosis. Finally, in the business area, the location of headache was important. According to domain experts’ comments, these attributes are closely related with working environments. This analysis suggests that the differences in upper limit and lower limit also include information, which lead to knowledge discovery. Table 2. Characterization in Headache Location Important Features in Upper Limit G1 Countryside Hypertension=yes G2 Housing Nature=chronic, acute G3 Business Location=neck,occipital
Rules of CVD. Concerning the database on CVD, several interesting rules are derived both from the lower limit and the upper limit. The most interesting results of lower limit are the following rules for thalamus hemorrahge: [Sex = F emale] ∧ [Hemiparesis = Lef t] ∧ [LOC : positive] → T halamus ¬[Risk : Hypertension] ∧ ¬[Sensory = no] → ¬T halamus
154
S. Tsumoto
Interestingly, LOC(loss of consciousness) under the condition of [Sex = F emale] ∧ [Hemiparesis = Lef t] is an important factor to diagnose thalamic damage. In this domain, any strong correlations between these attributes and others, like the database of meningitis, have not been found yet. It will be our future work to find what factor will be behind these rules. Rules of Meningitis. In the domain of meningitis, the following rules from the lower limit of charcacterization, which medical experts do not expect, are obtained. [W BC < 12000] ∧ [Sex = F emale] ∧ [Age < 40] ∧ [CSF CELL < 1000] → V irus [Age ≥ 40] ∧ [W BC ≥ 8000] ∧ [Sex = M ale] ∧ [CSF CELL ≥ 1000] → Bacteria The most interesting points are that these rules have information about age and sex, which often seems to be unimportant attributes for differential diagnosis. The first discovery is that women do not often suffer from bacterial infection, compared with men, since such relationships between sex and meningitis has not been discussed in medical context[1]. Examined the database of meningitis closely, it is found that most of the above patients suffer from chronic diseases, such as DM, LC, and sinusitis, which are the risk factors of bacterial meningitis. The second discovery is that [age < 40] is also an important factor not to suspect viral meningitis, which also matches the fact that most old people suffer from chronic diseases. These results were also re-evaluted in medical practice. Recently, the above two rules were checked by additional 21 cases who suffered from meningitis (15 cases: viral and 6 cases: bacterial meningitis ) at a hospital which is different from the hospitals where datasets were collected. Surprisingly, the above rules misclassfied only three cases (two are viral, and the other is bacterial), that is, the total accuracy is equal to 18/21 = 85.7% and the accuracies for viral and bacterial meningitis are equal to 13/15 = 86.7% and 5/6 = 83.3%. The reasons of misclassification are the following: a case of bacterial infection is a patient who have a severe immunodeficiency, although he is very young. Two cases of viral infection are patients who also have suffered from herpes zoster. It is notable that even those misclassficiation cases can be explained from the viewpoint of the immunodeficiency: that is, it is confirmed that immunodefiency is a key word for menigitis. The validation of these rules is still ongoing, which will be reported in the near future.
7
Discussion: Conflict Analysis
It is easy to see the relations of independent type and subcategory type. While independent type suggests different mechanisms of diseases, subcategory type
Knowledge Discovery in Medical Multi-databases: A Rough Set Approach
155
does the same etiology. The difficult one is boundary type, where several symptoms are overlapped in each Lδκ (D). In this case, relations between Lδκ (Di ). and Lδκ (Dj ) should be examined. One approach to these complicated relations is conflict analysis[5]. In this analysis, several concepts which shares several attribute-value pairs, are analyzed with respect to qualitative similarity measure that can be viewed as an extension of rough inclusion. It will be our future work to introduce this methodology to analyze relations of boundary type and to develop an induction algorithms for these relations.
References 1. Adams RD and Victor M: Principles of Neurology, 5th edition. McGraw-Hill, New York, 1993. 2. Harris, J.M. Coronary Angiography and Its Complications - The Search for Risk Factors, Archives of Internal Medicine, 144, 337-341,1984. 3. Lin, T.Y. Fuzzy Partitions: Rough Set Theory, in Proceedings of Seventh International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems(IPMU’98), Paris, pp. 1167-1174, 1998. 4. Pawlak, Z., Rough Sets. Kluwer Academic Publishers, Dordrecht, 1991. 5. Pawlak, Z. Conflict analysis. In: Proceedings of the Fifth European Congress on Intelligent Techniques and Soft Computing (EUFIT’97), pp.1589–1591, Verlag Mainz, Aachen, 1997. 6. Pawlak, Z. Rough Modus Ponens. Proceedings of IPMU’98 , Paris, 1998. 7. Polkowski, L. and Skowron, A.: Rough mereology: a new paradigm for approximate reasoning. Intern. J. Approx. Reasoning 15, 333–365, 1996. 8. Skowron, A. and Grzymala-Busse, J. From rough set theory to evidence theory. In: Yager, R., Fedrizzi, M. and Kacprzyk, J.(eds.) Advances in the Dempster-Shafer Theory of Evidence, pp.193-236, John Wiley & Sons, New York, 1994. 9. Tsumoto, S. Automated Induction of Medical Expert System Rules from Clinical Databases based on Rough Set Theory Information Sciences 112, 67-84, 1998. 10. Tsumoto, S. Extraction of Experts’ Decision Rules from Clinical Databases using Rough Set Model Journal of Intelligent Data Analysis, 2(3), 1998. 11. Tsumoto, S., Ziarko, W., Shan, N., Tanaka, H. Knowledge Discovery in Clinical Databases based on Variable Precision Rough Set Model. Proceedings of the Eighteenth Annual Symposium on Computer Applications in Medical Care, Journal of the American Medical Informatics Associations 2, supplement, pp.270-274,1995. 12. Tsumoto, S. 1999. Knowledge Discovery in Clinical Databases – An Experiment with Rule Induction and Statistics– In: Ras, Z.(ed.) Proceedings of the Eleventh International Symposium on Methodologies for Intelligent Systems (ISMIS’99), Springer Verlag (in press). 13. Zadeh, L.A., Toward a theory of fuzzy information granulation and its certainty in human reasoning and fuzzy logic. Fuzzy Sets and Systems 90, 111-127, 1997. 14. Ziarko, W., Variable Precision Rough Set Model. Journal of Computer and System Sciences. 46, 39-59, 1993.
Automated Discovery of Rules and Exceptions from Distributed Databases Using Aggregates Rónán Páircéir, Sally McClean and Bryan Scotney School of Information and Software Engineering, Faculty of Informatics, University of Uster, Cromore Road, Coleraine, BT52 1SA, Northern Ireland. {r.pairceir, si.mcclean, bw.scotney }@ulst.ac.uk
Abstract. Large amounts of data pose special problems for Knowledge Discovery in Databases. More efficient means are required to ease this problem, and one possibility is the use of sufficient statistics or “aggregates”, rather than low level data. This is especially true for Knowledge Discovery from distributed databases. The data of interest is of a similar type to that found in OLAP data cubes and the Data Warehouse. This data is numerical and is described in terms of a number of categorical attributes (Dimensions). Few algorithms to date carry out knowledge discovery on such data . Using aggregate data and accompanying meta-data returned from a number of distributed databases, we use statistical models to identify and highlight relationships between a single numerical attribute and a number of Dimensions. These are initially presented to the user via a graphical interactive middle-ware, which allows drilling down to a more detailed level. On the basis of these relationships, we induce rules in conjunctive normal form. Finally, exceptions to these rules are discovered.
1
Introduction
The evolution of database technology has resulted in the development of efficient tools for manipulating and integrating data. Frequently these data are distributed on different computing systems in various sites. Distributed Database Management Systems provide a superstructure, which integrates either homogeneous or heterogeneous DBMS [1]. In recent years, there has been a convergence between Database Technology and Statistics, partly through the emerging field of Knowledge Discovery in Databases. In Europe this development has been particularly encouraged by the EU Framework IV initiative, with DOSIS projects IDARESA [2] and ADDSIA [3], which retrieve aggregate data from distributed statistical databases via the internet. In order to alleviate some of the problems associated with mining large sets of low level data, one option is to use a set of sufficient statistics in place of the data itself [4]. In this paper we show how the same results can be obtained by replacing the low level data with our aggregate data. This is especially important in the distributed database situation, where issues associated with slow data transfer and privacy may preclude the transfer of the low level data [5]. The type of data we deal with here is •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 156−164, 1999. Springer−Verlag Berlin Heidelberg 1999
Automated Discovery of Rules and Exceptions from Distributed Databases
157
very similar to the multidimensional data stored in the Data Warehouse (DW) [6, 7]. These data consists of two attribute value types: Measures or numerical data, and Dimensions or categorical data. Some of the Dimensions may have an associated hierarchy to specify grouping levels. This paper deals with such data in statistical databases, but should be easily adapted to a distributed DW implementation [8]. In our statistical databases, aggregate data is stored in the form of Tandem Objects [9], consisting of two parts: a macro relation and its corresponding meta relations (containing statistical metadata for tasks such as attribute value re-classification and currency conversion). Using this aggregate data, it is possible, with models taken from the field of statistics, to study the relation between a response attribute and one or more explanatory attributes. We use Analysis of Variance (ANOVA) models [10] to discover rules and exceptions from aggregate data retrieved from a number of distributed statistical databases. Paper Layout Section 2 contains an extended example. Section 3 shows how the data are retrieved and integrated for final use. The statistical modelling and computation are discussed in section 4, along with the method of displaying the resulting discovered knowledge. Section 5 concludes with a summary and possibilities for further work.
2
An Extended Example
Within our statistical database implementation, the user selects a single Measure and a number of Dimensions from a subject domain for inclusion in the modelling process. The user may restrict the attribute values from any attribute domain, for example, GENDER= Male. In this example the Measure selected is COST (of Insurance Claim) and the Dimensions of interest are COUNTRY {Ireland, England, France}, REGION {City, County}, GENDER {Male, Female} and CAR-CLASS {A, B, C}. A separate distributed database exists for each country. Once the Measure and Dimensions have been entered, the query is sent to the domain server, where it is decomposed in order to retrieve the aggregate data from the distributed databases. As part of the IDARESA project [2], operators have been developed to create, retrieve and harmonise the aggregate data in the Tandem Objects (See Section 3). The Macro relation in the Tandem Object consists of the Dimensions and the single Measure (in this case COST), which is summarised within the numerical attributes N, S and SS. S contains the sum of COST values aggregated over the Dimension set, SS is the equivalent for sums of squares of COST values and N is the count of low level tuples involved in the aggregate. Once the retrieved data have been returned to the domain server and integrated into one Macro relation, the final operation on the data before the statistical analysis is the DATA CUBE operator [11]. Some example tuples from the final Macro relation are shown in Table 2.1.
COST = µ + G + P + C + R(C ) + GP + GC + GR(C ) + PC + PR(C ) + GPC + ε ijk ln
il ( k )
i
jk
j
k
ij
l (k )
jl ( k )
ijk
ijk ln
ik
(1)
158
R. Páircéir, S. McClean, and B. Scotney
Table 2.1.
Example tuples from Final Macro relation
COUNTRY
REGION
GENDER
Ireland England All Ireland
City County All City
Male Female Male All
CARCLASS A B A All
COST_ N 12000 10000 72000 54000
COST_ S 0.730 0.517 4.320 2.850
COST_ SS 43.21 25.08 261.23 161.41
The relevant Meta-data retrieved indicates that all the Dimensions are fixed variables for the statistical model, and that a hierarchy exists from REGION → COUNTRY. This information is required to automatically fit the correct ANOVA model. For our illustrative example, the model is shown above in (1).
0.05
Gender
0.04
Region(Country) Country
0.03 0.02
Car-class Gender/Country Gender/Car-class
0.01 0
Gender/Region(Country) Car-class/Country Car-class/Region(Country)
Fig. 2.1. Significant Effects graph for the Insurance example Once the model parameters have been calculated and validated for appropriateness, the results are presented to the user. The first step involves a graph showing attribute level relationships between the Dimensions and the COST Measure. These relationships (also known as effects) are presented in terms of main Dimension effects, two-, and three- way interaction effects. Only those relationships (effects) that are statistically significant are shown in the graph, with the height of each bar representing the significance of the corresponding effect. The legend contains an entry for all effects, so that the user may drill-down on any one desired. In the Insurance example, GENDER, COUNTRY and REGION within COUNTRY each show a statistically significant relationship with COST, as can be seen from the Significant Effects graph in Figure 2.1. None of the three-way effects (e.g. GENDER/REGION(COUNTRY)) have a statistically significant relationship with the COST Measure. The user can interact with this graphical representation. By clicking on a particular bar or effect in the legend of the graph, the user can view a breakdown of COST values for that effect, either in a table or a graphical format. This illustrates to the
Automated Discovery of Rules and Exceptions from Distributed Databases
159
user, at a more detailed level, the relationship between an attribute’s domain values and the COST Measure. These are conveyed in terms of deviations from the overall mean, in descending order. In this way, the user guides what details he wants to look at, from a high level attribute view to lower more detailed levels. A graph of the breakdown of attribute values for GENDER is shown in Figure 2.2. From this it can be seen that there is a large difference between COST claims for Males and Females. Breakdown for Attribute Gender
Female -5.15
7.59 Male
-10
-5
0
5
10
Deviation from overall mean of 51.34
Fig. 2.2. Deviations from mean for GENDER values On the basis of these relationships, rules in conjunctive normal form (CNF) are constructed. The rules involving GENDER are shown in (2) and (3) below. Based on the records in the databases, we can say statistically at a 95% level of confidence that the true COST lies within the values shown in the rule consequent. GENDER{Male} GENDER{Female}
→ →
COST between {57.63} and {60.23} COST between {44.63} and {47.75}
(2) (3)
The final step involves presenting to the user any attribute value combinations at aggregate levels which deviate from the high level rules discovered. For example, a group of 9,000 people represented by the following conjunction of attribute values (4) represents an exception to the high level rules: COUNTRY{Ireland} ∧ GENDER{Female} ∧ REGION{City} → ACTUAL VALUE COST between {50.12} and {57.24} → EXPECTED VALUE COST between {41.00} and {48.12}
(4)
This can be seen to be an exception, as the corresponding Expected and Actual COST ranges do not overlap. The information in this exception rule may be of interest for example in setting prices for Insurance for Females. Before making any decisions, this exception should be investigated in detail. We find such exceptions at aggregate levels only. It is not possible at this stage to study exceptions for low-level values as
160
R. Páircéir, S. McClean, and B. Scotney
these are resident at the different distributed databases, and in many situations privacy issues prevent analysis at this level in any case.
3
Aggregate Data Retrieval and Integration
The data at any one site may consist of low level “micro” data and/or aggregate “macro” data, along with accompanying statistical metadata (required for example for harmonisation of the data at the domain server). This view of micro and macro data is similar to the base data and materialised views held in the Data Warehouse [7]. In addition, textual (passive) metadata for use in documentation are held in an objectoriented database. An integrated relational strategy for micro and macrodata is provided by the MIMAD model [9] which is used in our implementation. To retrieve aggregate data from the distributed data sites, IDARESA has developed a complete set of operators to work with Tandem Objects [2]. Within a Tandem Object, a macro relation R describes a set of macro objects (statistical tables) where C1,..Cn represent n Dimensions and S1,…Sm are m summary attributes (N, S and SS) which summarise an underlying Measure. The IDARESA operators are implemented using SQL which operates simultaneously on a Macro relation and on its accompanying meta relations. In this way, whenever a macro relation is altered by an operator, the accompanying meta relations are always adjusted appropriately. The summary attributes in the macro relation form a set of “sufficient statistics” in the form of count (N), sum (S) and sums of squares (SS) for the desired aggregate function. An important concept is the additive property of these summary attributes [9] defined as follows: σ ( α UNION β) = σ (α) + σ (β)
(5)
where α and β are macro relations which are macro compatible and σ() is an application of a summary attribute function (e.g. SUM) over the Measure in α and β. Using the three summary attributes, it is possible to compute a large number of statistical procedures [9], including ANOVA models. Thus it is possible to combine aggregates over these summary statistics at a central site for our knowledge discovery purposes. The user query is decomposed by a Query Agent which sends out Tandem Object requests to the relevant distributed sites. If the data at a site is in the micro data format an IDARESA operator called MIMAC (Micro to Macro Create) is used to construct a Tandem Object with the required Measure and Dimensions, along with accompanying meta relations. If the data are already in a macro data format, IDARESA operators TAP (Tandem Project) and TASEL (Tandem Select) are used to obtain the required Tandem Object. Once this initial Tandem Object has been created at each site, operators TAREC (Tandem Reclassify) and TACO (Tandem Convert) may be applied to the macro relations using information in the meta relations. TAREC can be used in two ways: the first is in translating attribute domain values to a single common language for all distributed macro relations (e.g. changing French words for male and female in the GENDER attribute to English).; the second use is on reclassi-
Automated Discovery of Rules and Exceptions from Distributed Databases
161
fying a Dimension’s domain values so that all the macro relations contain attributes with the same domain set (e.g. the French database might classify Employed as “Parttime” and “Full-time” separately. These need to be reclassified and aggregated to the value “Employed” which is the appropriate classification used by the other Countries involved in the query). The operator TACO is used to convert the Measure summary attributes to a common scale for all sites (e.g. converting COST from local currency to ECU for each site using conversion information in the meta relations). The final harmonised Tandem Object from each site is communicated to the Domain Server. The Macro relations are now Macro compatible [2] and can therefore be integrated into a single aggregate macro relation using the TANINT (Tandem Integration) operator. The meta relations are also integrated accordingly. The final task is to apply the DATA CUBE operator [11] to the Macro relation. The data is now in a suitable format for the statistical modelling. 3.1
Implementation Issues
In our prototype the micro data and Tandem Objects are stored in MS SQL Server. Access to remote distributed servers is achieved via the Internet in a Java environment. A well acknowledged three tier architecture has been adopted for the design. The logical structure consists of a front-end user (the client), a back-end user (the server), and middleware which maintains communication between the client and the server. The distributed computing middleware capability called remote method invocation (RMI) is used here. A query is transformed into a series of nested IDARESA operators and passed to the Query Agent for assembly into SQL and execution.
4
Statistical Modelling and Results Display
ANOVA models [10] are versatile statistical tools for studying the relation between a numerical attribute and a number of explanatory attributes. Two factors have enabled us to construct these models from our distributed data. The first is the fact that we can combine distributed primitive summary attributes (N, S and SS) from each distributed database seamlessly using the MIMAD model and IDARESA operators described in Section 3. The second factor is that it is possible to use these attributes to compute the coefficients of an ANOVA model in a computationally efficient way. The ANOVA model coefficients also enable us to identify exceptions in the aggregate data. The term “exception” here is defined as an aggregate Measure value which differs in a statistically significant manner from its expected value calculated from the model. While it is not the focus of this paper to detail the ANOVA computations, a brief description follows. The simplest example of an ANOVA model is shown in equation (6), similar to the model in equation (1) which contains more Dimensions and a hierarchy between the Dimensions COUNTRY and REGION. In equation (6), Measureijk represents a numerical Measure value corresponding to th valuei of Dimension A and valuej of Dimension B. k represents the k example or replicate for this Dimension set. The µ term in the model represents the overall aver-
162
R. Páircéir, S. McClean, and B. Scotney
age or mean value for the Measure. The A and B single Dimension terms are used in the model to see if these Dimensions have a relationship (Main effect) with the Measure. The (AB) term, representing a 2-way interaction effect between Dimensions A and B, is used to see if there is a relationship between the Measure and values for Dimension A, which hold only when Dimension B has a certain value. The final term in the model is an error term, which is used to see if any relationships are real in a statistically significant way.
Measure
ijk
= µ + A + B j + ( AB) i
ij
+ε
ijk
(6)
In order to discover exceptions at aggregate levels, the expected value for a particular Measure value as calculated from the model, is subtracted from the actual Measure value. If the difference is statistically significant in terms of the model error, this value is deemed to be an exception. When calculating an expected value for Measureijk, the model reduces nicely to the average of the k values where A=i and B=j, saving considerable computing time in the calculation of exceptions. The model reduces similarly in calculating an exception at any aggregate level (e.g. the expected Measure value for aggregate GENDER{Male} and COUNTRY {Ireland} is simply the average over all tuples with these attribute values). It is important to note that if an interaction effect (e.g. AB) is deemed to be statistically significant, then the main effects involved in this interaction effect (A and B) are disregarded and all the focus centers on the interaction effect. In such a situation, when effects are converted to CNF rules, main effects based on a significant interaction effect are not shown. In our ANOVA model implementations, we do not model higher than 3-way interaction effects as these are seldom if ever significant [10]. 4.1 Presentation of Results The first step in the results presentation is at the attribute level based on the statistically significant Main and Interaction Effects. Statistical packages present ANOVA results in a complicated table suitable for statisticians. Our approach summarises the main details of this output in a format more suited to a user not overly familiar with statistical modelling and analysis. We present the statistically significant effects in an interactive graphical way, as shown in Figure 2.1. The scale of the graph is the probability that an effect is real. Only those effects significant above a 95% statistical level are shown. The more significant an effect, the stronger the relationship between Dimensions in the effect and the Measure. As a drill-down step from the attribute level, the user can interact with the graph to obtain a breakdown of Measure value means for any effect. This allows the user to understand an effect’s relationship with the Measure in greater detail. The user can view this breakdown either graphically, as shown in Figure 2.2. or in a table format. The breakdown consists of the mean Measure deviation values from the overall Measure mean, for the corresponding effect’s Dimension values (e.g. Figure 2.2 shows that the mean COST for GENDER{Male} deviates from the overall mean of 51.34 by +7.59 units). Showing the breakdown as deviations from the overall mean
Automated Discovery of Rules and Exceptions from Distributed Databases
163
facilitates easy comparison of the different Measure means. The significant effects are next converted into a set of rules in conjunctive normal form, with an associated range within which we can statistically state that the true Measure value lies. This range is based on a statistical confidence interval. This set of rules in CNF summarises the knowledge discovered using the ANOVA analysis. The final pieces of knowledge which are automatically presented to the user, are the exceptions to the discovered rules. These are Measure values corresponding to all the different Dimension sets at the aggregate level, which differ in a statistically significant way from their expected ANOVA model values. An example of an exception is (4) in Section 2. These are also presented in CNF, with their expected and actual range values. One factor which is also important to a user interested in finding exceptions, is to know in what way they are exceptions. This is possible through an examination of the rules which are relevant to the exception. For the example in Section 2, assume that (2) and (3) are the only significant rules induced. In order to see why (4) is an exception, we look at rules which are related to it. We define a rule and an exception to be related if the rule antecedent is nested within the exception antecedent. In this case the antecedent in rule (3) GENDER{Female} is nested in the exception (4). Comparing the Measure value range for the rule {44.6 - 47.75} with that of the exception {50.12 - 57.24}, it can be seen that they do not overlap. Therefore it can be stated in this simple illustration that GENDER is in some sense a cause of exception (4). This conveys more knowledge to the user about the exception. Further work is required on this last concept to automate the process in some suitable way. 4.2
Related Work
In the area of supervised learning, a lot of research has been carried out on the discovery of rules in CNF, and some work is proceeding on the discovery of exceptions and deviations for this type of data [13, 14]. A lot less work in the knowledge discovery area has been carried out in relation to a numerical attribute described in terms of categorical attributes. Some closely related research involves a paper on exploring exceptions in OLAP data cubes [14]. The authors there use an ANOVA model to enable a user to navigate through exceptions using an OLAP tool, highlighting drilldown options which contain interesting exceptions. Their work bears similarity only to the exception part of our results presentation, whereas we present exceptions to our rules at aggregate levels in CNF. Some work on knowledge discovery in distributed databases has been carried out in [5, 15].
5 Summary and Further Work Using aggregate data and accompanying meta-data returned from a number of distributed databases, we used ANOVA models to identify and highlight relationships between a single numerical attribute and a number of Dimensions. On the basis of these relationships which are presented to the user in a graphical fashion, rules were induced in conjunctive normal form and exceptions to these rules were discovered.
164
R. Páircéir, S. McClean, and B. Scotney
Further work can be carried out on the application of Aggregate data to other knowledge discovery techniques applied to the distributed setting, conversion of our rules into linguistic summaries of the relationships and exceptions and investigation of models which include a mix of Measures and Dimensions.
References 1. 2. 3.
4. 5. 6. 7. 8.
9.
10. 11.
12. 13. 14. 15.
Bell, D., Grimson, J.: Distributed database systems. Wokingham : AddisonWesley, (1992) c M Clean S., Grossman, W. and Froeschl, K.: Towards Metadata-Guided Distributed Statistical Processing. NTTS’98 Sorrento, Italy (1998): 327-332 Lamb, J., Hewer, A., Karali, I., Kurki-Suonio, M., Murtagh, F., Scotney, B., Smart C., Pragash K. : The ADDSIA (Access to Distributed Databases for Statistical Information and Analysis) Project. DOSIS project paper 1, NTTS-98, Sorrento, Italy. 1-20 (1998) Graefe, G, Fayyad, U., Chaudhuri, S.: On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases. KDD (1998): 204-208 Aronis, J., Kolluri, V., Provost, F., and Buchanan, B.: The WoRLD: Knowledge Discovery from multiple distributed databases. In Proc FLAIRS’97 (1997) Chaudhuri, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26(1): 65-74 (1997) Shoshani, A.: OLAP and Statistical Databases: Similarities and Differences. PODS 97: 185-196 (1997) Albrecht, J. and Lehrner, W.: On-Line Analytical Processing in Distributed Data Warehouses. International Database Engineering and Applications Symposium (IDEAS'98), Cardiff, Wales, U.K (1998) Sadreddini MH, Bell D., and McClean SI.: A Model for integration of Raw Data and Aggregate Views in Heterogeneous Statistical Databases. Database Technology vol 4,no 2, 115-127 (1991). Neter, J.: Applied linear statistical models. - 3rd ed. - Chicago, Ill.; London: Irwin, (1996). Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing Group-By,Cross-Tab, and SubTotal. ICDE 1996: 152-159 (1996) Liu, H., Lu, h., Feng, L. and Hussain, F.: Efficient Search of Reliable Exceptions. PAKDD 99 Beijing, China (1999) Arning,A., Agrawal, R. and Raghavan, P.: A linear Method for Deviation Detection in Large Databases KDD, Portland, Oregon, USA (1996) Sarawagi, S., Agrawal, R., Megiddo, N.: Discovery-Driven Exploration of OLAP Data Cubes. EDBT 98: 168-182 (1998) Ras, Z., Zytkow J.:Discovery of Equations and the Shared Operational Semantics in Distributed Autonomous Databases. PAKDD99 Beijing, China (1999)
Text Mining via Information Extraction Ronen Feldman , Yonatan Aumann, Moshe Fresko, Orly Liphstat, Binyamin Rosenfeld, Yonatan Schler Department of Mathematics and Computer Science Bar-Ilan University Ramat-Gan, ISRAEL Tel: 972-3-5326611 Fax: 972-3-5326612 [email protected] Abstract. Knowledge Discovery in Databases (KDD), also known as data mining, focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Given a collection of text documents, most approaches to text mining perform knowledge-discovery operations on labels associated with each document. At one extreme, these labels are keywords that represent the results of non-trivial keyword-labeling processes, and, at the other extreme, these labels are nothing more than a list of the words within the documents of interest. This paper presents an intermediate approach, one that we call text mining via information extraction, in which knowledge discovery takes place on a more focused collection of events and phrases that are extracted from and label each document. These events plus additional higher-level entities are then organized in a hierarchical taxonomy and are used in the knowledge discovery process. This approach was implemented in the Textoscope system. Textoscope consists of a document retrieval module which converts retrieved documents from their native formats into SGML documents used by Textoscope; an information extraction engine, which is based on a powerful attribute grammar which is augmented by a rich background knowledge; a taxonomy-creation tool by which the user can help specify higher-level entities that inform the knowledge-discovery process; and a set of knowledge-discovery tools for the resulting event-labeled documents. We evaluate our approach on a collection of newswire stories extracted by Textoscope’s own agent. Our results confirm that Text Mining via information extraction serves as an accurate and powerful technique by which to manage knowledge encapsulated in large document collections.
1 Introduction Traditional databases store information in the form of structured records and provide methods for querying them to obtain all records whose content satisfies the user’s query. More recently however, researchers in Knowledge Discovery in Databases (KDD) have provided a new family of tools for accessing information in databases. The goal of such work, often called data mining, has been defined as •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 165−173, 1999. Springer−Verlag Berlin Heidelberg 1999
166
R Feldman et al.
“the nontrivial extraction of implicit, previously unknown, and potentially useful information from given data. Work in this area includes applying machinelearning and statistical-analysis techniques towards the automatic discovery of patterns in databases, as well as providing user-guided environments for exploration of data. Most efforts in KDD have focused on data mining from structured databases, despite the tremendous amount of online information that appears only in collections of unstructured text. This paper focuses on the problem of text mining, performing knowledge discovery from collections of unstructured text. One common technique [3,4,5] has been to assume that associated with each document is a set of labels and then perform knowledge-discovery operations on the labels of each document. The most common version of this approach has been to assume that labels correspond to keywords, each of which indicates that a given document is about the topic associated with that keyword. However, to be effective, this requires either: manual labeling of documents, which is infeasible for large collections; hand-coded rules for recognizing when a label applies to a document, which is difficult for a human to specify accurately and must be repeated anew for every new keyword; or automated approaches that learn from labeled documents rules for labeling future documents, for which the state of the art can guarantee only limited accuracy and which also must be repeated anew for every new keyword. A second approach has been to assume that a document is labeled with each of the words that occurs within it. However, as was shown by Rajman and Besançon [6] and is further supported by the results presented here, the results of the mining process are often rediscoveries of compound nouns (such as that “Wall” and “Street” or that “Ronald” and “Reagan” often co-occur) or of patterns that are at too low a level (such as that “shares” and “securities” cooccur). In this paper we instead present a middle ground, in which we perform Information extraction on each document to find events and entities that are likely to have meaning in the domain, and then perform mining on the extracted events labeling each document. Unlike word-based approaches, the extracted events are fewer in number and tend to represent more meaningful concepts and relationships in the domain of the document. A possible event can be that a company did a joint venture with a group of companies or that a person took position at a company. Unlike keyword approaches, our information-extraction method eliminates much of the difficulties in labeling documents when faced with a new collection or new keywords. While we rely on a generic capability of recognizing proper names which is mostly domain-independent, when the system is to be used in new domains, some work is needed for defining additional event schemas. Textoscope provides a complete editing/compiling/debugging environment for defining the new event schemas. This environment enables easy creation and manipulation of information extraction rules. This paper describes Textoscope, a system that embodies this approach to text mining via information extraction. The overall structure of Textoscope is shown in Figure 1. The first step is to convert documents (either internal documents or
Text Mining via Information Extraction
167
external documents fetched by using the Agent) into an SGML format understood by Textoscope. The resulting documents are then processed to provide additional linguistic information about the contents of each document – such as through partof-speech tagging. Documents are next labeled with terms extracted directly from the documents, based on syntactic analysis of the documents as well as on their patterns of occurrence in the overall collection. The terms and additional higherlevel entities are then placed in a taxonomy through interaction with the user as well as via information provided when documents are initially converted into Textoscope’s SGML format. Finally, KDD operations are performed on the event-labeled documents.
Taxonomy Editor
FTP
Reader/SGML Converter
Information Extraction
Text Mining ToolBox
Visualization Tools
Agent
Other Online Sources
Fig. 1. Textoscope architecture. Examples of document collections suitable for text mining are documents on the company’s Intranet, patent collections, newswire streams, results returned from a search engine, technical manuals, bug reports, and customer surveys. In the remainder of this paper we describe Textoscope’s various components. The the linguistic preprocessing steps, Textoscope’s Information extraction engine, its tool for creating a taxonomic hierarchy for the extracted events, and, finally, a sample of its suite of text mining tools. We give examples of mining results on a collection of newswire stories fetched by our agent.
2 Information Extraction Information Extraction (IE) aims at extracting instances of predefined templates from textual documents. IE has grown to be a very active field of research thanks to the MUC (Message Understanding Conference) initiative. MUC was initiated by DARPA in the late 80’s in response to the information overload of on-line texts. One of the popular uses of IE is proper name extraction, i.e., extraction of company names, personal names, locations, dates, etc. The main components of an IE system are tokenization, zoning (recognizing paragraph and sentence limits),
168
R Feldman et al.
morphological and lexical processing, parsing and domain semantics [1,7]. Typically, IE systems do not use full parsing of the document since that is too time consuming and error prone. The methods typically used by IE systems are based on shallow parsing and use a set of predefined parsing rules. This “knowledgebased” approach may be very time consuming and hence a good support environment for writing the rules is needed. Textoscope preprocesses the documents by using its own internal IE engine. The IE engine makes use of a set of predefined extraction rules. The rules can make use of a rich set of functions that are used for string manipulation, set operations and taxonomy construction. We have three major parts to the rules file. First we define all the events that we want to extract from the text. An example of an event is “Company1 Acquired Company2”, or “Person has Position in Company”. The second part are word classes, collections of words that have a similar semantic property. Examples of word classes are company extensions (like “inc”, “corporation” “gmbh” “ag” etc.) and a list of common personal first names. The third and last part are rules that are used to extract events out of the documents. There are two types of rules, event-generation rules and auxiliary rules. Each event-generating rule has three parts, a pattern, a set of constraints (on components of the pattern), and a set of events that are generated from the pattern. An auxiliary rule contains just a pattern. The system supports three types of patterns, AND-patterns , sequential pattern (which has a similar semantics to a prolog DCG rule), and skip patterns. Skip patterns enable the IE engine to skip a series of tokens until a member of a word class is found. Here is an example of an event generating rule that uses an auxiliary rule: @ListofProducts = ( @ProductList are [ registered ] trademarks of @Company @! ) > ProductList: Products = 0. @ProductList = ( @Product , @ProductList1 @!). @ProductList1 = ( @Product, @ProductList1 @!). @ProductList1 = ( @Product [ , ] and @Product @! ). In this case we look for a list of entities that is followed by the string “are registered trademarks” or “are trademarks”. Each of the entities must conform to the syntax of a @Product. We have used many resources found on the WWW to acquire lists of common objects such as countries, states, cities, business titles (e.g., CEO, VP of Product Development, etc.), technology terms etc. Technology terms for instance were extracted from publicly available glossaries. We have used our IE engine (with a specially designed rule-set) to automatically extract the terms from the HTML source of the glossaries. In addition, we have used words lists of the various part of speech categories (nouns, verbs, adjectives, etc.). These word lists are used inside the rules to direct the parsing.
Text Mining via Information Extraction
169
Each document is processed using the IE engine and the generated events are inserted into the document repository. In addition to the events inserted, each document is annotated with terms that are generated by using term extraction algorithms [2,5]. This enables the system to use co-occurrence between terms to infer relations that were missed by the IE engine. The user can select the granularity level of the co-occurrence computation, either document level, paragraph level or sentence level. Clearly, if the granularity level is selected to be document-level, the precision will decrease, while the recall will increase. On the other hand, selecting a sentence-level granularity will yield higher precision and lower recall. The default granularity level is the sentence level, terms will be considered to have relationship only if they co-occur within the same sentence. In all the analysis modules of the Textoscope system the user can select whether relationships will be based solely on the events extracted by the IE engine, on the term extraction, or a combination of two. One of the major issues that we have taken into account while designing the IE Rule Language was allowing the specification of common text processing actions within the language rather than resorting to external code written in C/C++. In addition to recognizing events, the IE engine allows the additional analysis of text fractions that were identified as being of interest. For instance, if we have identified that a given set of tokens is clearly a company name (by having as a suffix one of the predefined company extensions), we can insert into a dynamic set called DCompanies the full company name and any of its prefixes that still constitute a company name. Consider the string “Microsoft Corporation”, we will insert to DCompanies both “Microsoft Corporation”, and “Microsoft”. Dynamic sets are handled at five levels: there are system levels sets, there are corpus level sets, there are document level sets, paragraph level sets and sentence level sets. System level sets enable knowledge transfer between corpuses, while corpus level sets enable knowledge transfer between documents in the same corpus. Document level sets are used in cases where the knowledge acquired should be used just for the analysis of the rest of the document and it is not applicable to other documents. Paragraph and sentence level sets are used in discourse analysis, and event linking. The IE engine can learn the type of an entity by the context in which the entity appears. As an example, consider a list of entities some of which are unidentified. If the engine can determine the type of at least one of them, then the types of all other entities are determined to be the same. For instance, given the string “joint venture among affiliates of Time Warner, MediaOne Group, Microsoft, Compaq and Advance/Newhouse.”, since the system has already identified Microsoft as being a company, it determined that Time Warner, MediaOne Group, Compaq and Advance/Newhouse are companies as well. The use of the list-processing rules provided a considerable boost the accuracy of the IE engine. For instance, in the experiment described in Section 4, it caused recall to increase from 82.3% to 92.6% while decreasing precision from 96.5% to 96.3%.
Textoscope provides a rich support environment for editing and debugging the extraction rules. On the editing front, Textoscope provides a visual editor for
170
R Feldman et al.
building the rules that enables the user to create rules without having to memorize the exact syntax. On the debugging front, Textoscope provides two main utilities. First, it provides a visual tool that enables one to see all the events that were extracted from the document. The user can click on any of the events and then see the exact text where this event was extracted from. In addition the system provides an interactive derivation tree of the event, so that the user can explore exactly how the event was generated. An example of such a derivation tree is shown in Figure 2. Here we parsed the sentence “We see the Nucleus Prototype Mart as the missing link to quickly deploying high value business data warehouse solutions, said David Rowe, Director of Data Warehousing Practice at GE Capital Consulting”, and extracted the event that David Rowe is the Director of Data Warehousing Practice at a company called GE Capital Consulting. Each node in the derivation tree is annotated by an icon that symbolizes the nature of the associated grammar feature. The second debugging tool provides the user with the ability to use a tagged training set and rate each of the rules according to their contribution to the precision and recall of the system. Rules that cause precision to be lower and do not contribute towards a higher recall can be either deleted or modified.
Fig. 2. An Interactive Derivation Tree of an Extracted Event The second debugging tool provides the user with the ability to use a tagged training set and rate each of the rules according to their contribution to the precision and recall of the system. Rules that cause precision to be lower and do not contribute towards a higher recall can be either deleted or modified. The events that were generated by the IE engine are used also for the automatic construction of the taxonomy. Each field in each of the events is used as a source of values for the corresponding node in the taxonomy. For instance, we use the Company field from the event “Person, Position, Company” to construct the Company node in the taxonomy. The system contains several meta rules that enable the construction of a multi-level taxonomy. Such a rule can be, for instance, that Banks are Companies and hence the Bank node will be placed under the Company node in the Taxonomy.
Text Mining via Information Extraction
171
Textoscope constructs a thesaurus that contains lists of synonyms. The thesaurus is constructed by using co-reference and a set of rules for deciding that two terms actually refer to the same entity. Example of a synonym list that is constructed by the system is { “IBM”, “International Business Machines Corp” and “Big Blue” }. Textoscope also includes a synonym editor that enables the user to add/modify/delete synonym lists. This enables the user to change the automatically created thesaurus and customize it to her own needs.
3 Results We tested the accuracy of the IE engine by analyzing collections of documents that were extracted by the Agent from MarketWatch.com. We started by extracting 810 articles from MarketWatch.com which mentioned “ERP”. We have created 30 different events focused around companies, technologies, products and alliances. We have defined more than 250 word classes and have used 750 rules to extract those 30 event types. The rule scoring tool described in Section 3 was proved to be very useful in the debugging and refinement of the rule set. After the construction of the initial rule set we were able to achieve an F-Score of 89.3%. Using the rule scoring utility enabled us to boost the F-Score to 96.7% in several hours. In order to test the rule set, we have used our agent again to extract 2780 articles that mentioned “joint venture” from MarketWatch.com. We were able to extract 15,713 instances of these events. We have achieved a 96.3 precision and 92.6 recall on the company, people, technology and product categories and hence an FScore of 94.4% (β = 1) where β + 1 PR . These results are in par with the 2
F= 2 β P + R
results achieved by the FASTUS system [1] and the NETOWL system (www.netowl.com). We will now show how Textoscope enables us to analyze the events and terms that were extracted from the 2780 articles. Textoscope provides a set of visual maps that depict the relationship between entities in the corpus. The context graph shown in Figure 3 depicts the relationship between “technologies”. The weights of the edges (number of documents in which the technologies appear in the same context) are coded by the color of the edge, the darker the color, the more frequent the connection. The graph clearly reveals the main technology clusters, which are shown as disconnected components of the graph: a security cluster and internet technologies cluster. We can see strong connections between electronic commerce and internet security, between ERP and data warehousing, and between ActiveX and internet security. In Figure 4, we can view some of the company clusters that were involved in some sort of alliance (“joint venture”, “strategic alliance”, “commercial alliance”, etc. ).
172
R Feldman et al.
The Context Graph provides a powerful way to visualize relationship encapsulated in thousands of documents.
Fig. 3. Context Graph (technologies)
Fig. 4. Joint Venture Clusters
4 Summary Text mining based on Information Extraction attempts to hit a midpoint, reaping some benefits from each of these extremes while avoiding many of their pitfalls. On the one hand, there is no need for human effort in labeling documents, and we
Text Mining via Information Extraction
173
are not constrained to a smaller set of labels that lose much of the information present in the documents. Thus the system has the ability to work on new collections without any preparation, as well as the ability to merge several distinct collections into one (even though they might have been tagged according to different guidelines which would prohibit their merger in a tagged-based system). On the other hand, the number of meaningless results is greatly reduced and the execution time of the mining algorithms is also reduced relative to pure wordbased approaches. Text mining using Information Extraction thus hits a useful middle ground on the quest for tools for understanding the information present in the large amount of data that is only available in textual form. The powerful combination of precise analysis of the documents and a set of visualization tools enable the user to easily navigate and utilize very large document collections.
References 1.
Appelt, Douglas E., Jerry R. Hobbs, John Bear, David Israel, and Mabry Tyson, 1993. ‘‘FASTUS: A Finite-State Processor for Information Extraction from Real-World Text’’, Proceedings. IJCAI-93, Chambery, France, August 1993.
2.
Daille B., Gaussier E. and Lange J.M., 1994. Towards Automatic Extraction of Monolingual and Bilingual Terminology, In Proceedings of the International Conference on Computational Linguistics, COLING’94, pages 515-521.
3.
Feldman R., and Hirsh H., 1996. Exploiting Background Information in Knowledge Discovery from Text. Journal of Intelligent Information Systems. 1996.
4.
Feldman R., Aumann Y., Amir A., Klösgen W. and Zilberstien A., 1997. Maximal Association Rules: a New Tool for Mining for Keyword co-occurrences in Document Collections, In Proceedings of the 3rd International Conference on Knowledge Discovery, KDD-97, Newport Beach, CA.
5.
Feldman R. and Dagan I., 1995. KDT – Knowledge Discovery in Texts. In Proceedings of the First International Conference on Knowledge Discovery, KDD-95.
6.
Rajman M. and Besançon R., 1997. Text Mining: Natural Language Techniques and Text Mining Applications. In Proceedings of the seventh IFIP 2.6 Working Conference on Database Semantics (DS-7), Chapam & Hall IFIP Proceedings serie. Leysin, Switzerland, Oct 7-10, 1997.
7.
Soderland S., Fisher D., Aseltine J., and Lehnert W., "Issues in Inductive Learning of Domain-Specific Text Extraction Rules," Proceedings of the Workshop on New Approaches to Learning for Natural Language Processing at the Fourteenth International Joint Conference on Artificial Intelligence, 1995.
TopCat: Data Mining for Topic Identification in a Text Corpus? Chris Clifton1 and Robert Cooley2
??
1 2
The MITRE Corporation, 202 Burlington Rd, Bedford, MA 01730-1420 USA [email protected] University of Minnesota, 6-225D EE/CS Building, Minneapolis, MN 55455 USA [email protected]
Abstract. TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on “traditional” data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized “ground truth” news corpus showing this technique is effective in identifying topics in collections of news articles.
1
Introduction
Data mining has emerged to address problems of understanding ever-growing volumes of information for structured data, finding patterns within the data that are used to develop useful knowledge. On-line textual data is also growing rapidly, creating needs for automated analysis. There has been some work in this area [14,10,16], focusing on tasks such as: association rules among items in text [9], rules from semi-structured documents [18], and understanding use of language [5,15]. In this paper the desired knowledge is major topics in a collection; data mining is used to discover patterns that disclose those topics. The basic problem is as follows: Given a collection of documents, what topics are frequently discussed in the collection? The goal is to help a human understand the collection, so a good solution must identify topics in some manner that is meaningful to a human. In addition, we want results that can be used for further exploration. This gives a requirement that we be able to identify source texts relevant to a given topic. This is related to document clustering [21], but the requirement for a topic identifier brings it closer to rule discovery mechanisms. The way we apply data mining technology on this problem is to treat a document as a “collection of entities”, allowing us to map this into a market ? ??
This work supported by the Community Management Staff’s Massive Digital Data Systems Program. This work was performed while the author was at the MITRE Corporation.
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 174–183, 1999. c Springer-Verlag Berlin Heidelberg 1999
TopCat: Data Mining for Topic Identification in a Text Corpus
175
basket problem. We use natural language technology to extract named entities from a document. We then look for frequent itemsets: groups of named entities that commonly occurred together. Next, we further cluster on the groups of named entities; capturing closely-related entities that may not actually occur in the same document. The result is a refined set of clusters. Each cluster is represented as a set of named entities and corresponds to an ongoing topic in the corpus. An example topic is: ORGANIZATION Justice Department, PERSON Janet Reno, ORGANIZATION Microsoft. This is recognizable as the U.S. antitrust case against Microsoft. Although not as informative as a narrative description of the topic, it is a compact, human-understandable representation. It also meets our “find the original documents” criteria, as the topic can used as a query to find documents containing some or all of the extracted named entities (see Section 3.4).
2
Problem Statement
The TopCat project started with a specific user need. The GeoNODE project at MITRE [12] is developing a system for analysis of news in a geographic context. One goal is to visualize ongoing topics in a geographic context; this requires identifying ongoing topics. We had experience with identifying association rules among entities/concepts in text, and noticed that some of the rules were recognizable as belonging to major news topics. This led to the effort to develop a topic identification mechanism based on data mining techniques. There are related topic-based problems being addressed. The Topic Detection and Tracking (TDT) project [1] looks at clustering and classifying news articles. Our problem is similar to the Topic Detection (clustering) problem, except that we must generate a human-understandable “label” for a topic: a compact identifier that allows a person to quickly see what the topic is about. Even though our goals are slightly different, the test corpus developed for the TDT project (a collection of news articles manually classified into topics) provides a basis for us to evaluate our work. A full description of the corpus can be found in [1]. For this evaluation, we use the topic detection criteria developed for TDT2 (described in Section 4). This requires that we go beyond identifying topics, and also match documents to a topic. One key item missing from the TDT2 evaluation criteria is that the T opicID must be useful to a human. This is harder to evaluate, as not only is it subjective, but there are many notions of “useful”. We later argue that the T opicID produced by TopCat is useful to and understandable by a human.
3
Process
TopCat follows a multi-stage process, first identifying key concepts within a document, then grouping these to find topics, and finally mapping the topics back to documents and using the mapping to find higher-level groupings. We identify key concepts within a document by using natural language techniques to extract
176
C. Clifton and R. Cooley
named people, places, and organizations. This gives us a structure that can be mapped into a market basket style mining problem.1 We then generate frequent itemsets, or groups of named entities that commonly appear together. Further clustering is done using a hypergraph splitting technique to identify groups of frequent itemsets that contain considerable overlap, even though not all of the items may appear together often enough to qualify as a frequent itemset. The generated topics, a set of named entities, can be used as a query to find documents related to the topic (Section 3.4). Using this, we can identify topics that frequently occur in the same document to perform a further clustering step (identifying not only topics, but also topic/subtopic relationships). We will use the following cluster, capturing professional tennis stories, as an example throughout this section. PERSON Andre Agassi PERSON Pete Sampras PERSON Marcelo Rios
PERSON Martina Hingis PERSON Venus Williams PERSON Anna Kournikova
PERSON Mary Pierce PERSON Serena
This is a typical cluster (in terms of size, support, etc.) and allows us to illustrate many of the details of the TopCat process. It comes from merging two subsidiary clusters (described in Section 3.5), formed from clustering seven frequent itemsets (Section 3.3). 3.1
Data Preparation
TopCat starts by identifying named entities in each article (using the Alembic[7] system). This serves several purposes. First, it shrinks the data set for further processing. It also gives structure to the data, allowing us to treat documents as a set of typed and named entities. This gives us a natural database schema for documents that maps into the traditional market basket data mining problem. Third, and perhaps most important, it means that from the start we are working with data that is rich in meaning, improving our chances of getting human understandable results. We eliminate frequently occurring terms (those occurring in over 10% of the articles, such as United States), as these are used across too many topics to be useful in discriminating between topics. We also face a problem with multiple names for the same entity (e.g., Marcelo Rios and Rios). We make use of coreference information from Alembic to identify different references to the same entity within a document. From the group of references for an entity within a document, we use the globally most common version of the name where most groups containing that name contain at least one other name within the current group. Although not perfect, this does give a global identifier for an entity that is both reasonably global and reasonably unique. We eliminate composite articles (those about multiple unrelated topics, such as daily news summaries). We found most composite articles could be identified 1
Treating a document as a “basket of words” did not produce as meaningful topics. Named entities stand alone, but raw words need sequence.
TopCat: Data Mining for Topic Identification in a Text Corpus
177
by periodic recurrence of the same headline; we ignore any article with a headline that occurs at least monthly.
3.2
Frequent Itemsets
The foundation of the topic identification process is frequent itemsets. In our case, a frequent itemset is a group of named entities that occur together in multiple articles. What this really gives us is correlated items, rather than any notion of a topic. However, we found that correlated named entities frequently occurred within a recognizable topic. Discovery of frequent itemsets is a well-understood data mining problem, arising in the market basket association rule problem [4]. A document can be viewed as a market basket of named entities; existing research in this area applies directly to our problem. (We use the query flocks technology of [20] for finding frequent itemsets using the filtering criteria below). One problem with frequent itemsets is that the items must co-occur frequently, causing us to ignore topics that occur in only a few articles. To deal with this, we use a low support threshold of 0.05% (25 occurrences in the TDT corpus). Since we are working with multiple sources, any topic of importance is mentioned multiple times; this level of support captures all topics of any ongoing significance. However, this gives too many frequent itemsets (6028 2-itemsets in the TDT corpus). We need additional filtering criteria to get just the “important” itemsets.2 We use interest[6], a measure of correlation strength (specifically, the ratio of the probability of a frequent itemset occurring in a document to the multiple of the independent probabilities of occurrence of the individual items) as an additional filter. This emphasizes relatively rare items that generally occur together, and de-emphasizes common items. We select all frequent itemsets where either the support or interest are at least one standard deviation above the average, or where both support and interest are above average (note that this is computed independently for 2-itemsets, 3-itemsets, etc.) For 2-itemsets, this brings us from 6028 to 1033. We also use interest to choose between “contained” and “containing” itemsets (i.e., any 3-itemset contains three 2-itemsets with the required support.) An n−1itemset is used only if it has greater interest than the corresponding n-itemset, and an n-itemset is used only if it has greater interest than at least one of its contained n − 1-itemsets. This brings us to 416 (instead of 1033) 2-itemsets. The difficulty with using frequent itemsets for topic identification is that they tend to be over-specific. For example, the “tennis player” frequent itemsets consist of the following: 2
The problems with traditional data mining measures for use with text corpuses have been noted elsewhere as well, see [8] for another approach.
178
C. Clifton and R. Cooley Type1 PERSON PERSON PERSON PERSON PERSON PERSON PERSON
Value1 Andre Agassi Andre Agassi Anna Kournikova Marcelo Rios Martina Hingis Martina Hingis Martina Hingis
Type2 PERSON PERSON PERSON PERSON PERSON PERSON PERSON
Value2 Marcelo Rios Pete Sampras Martina Hingis Pete Sampras Mary Pierce Serena Venus Williams
Support Interest .00063 261 .00100 190 .00070 283 .00076 265 .00057 227 .00054 228 .00063 183
These capture individual matches of significance, but not the topic of “championship tennis” as a whole. 3.3
Clustering
We experimented with different frequent itemset filtering techniques, but were always faced with an unacceptable tradeoff between the number of itemsets and our ability to capture a reasonable breadth of topics. Further investigation showed that some named entities we should group as a topic would not show up as a frequent itemset under any measure; no article contained all of the entities. Therefore, we chose to perform clustering of the named entities in addition to the discovery of frequent itemsets. The hypergraph clustering method of [11] takes a set of association rules and declares the items in the rules to be vertices, and the rules themselves to be hyperedges. Clusters can be quickly found by using a hypergraph partitioning algorithm such as hMETIS [13]. We adapted the hypergraph clustering algorithm described in [11] in several ways to fit our particular domain. Because TopCat discovers frequent itemsets instead of association rules, the rules do not have any directionality and therefore do not need to be combined prior to being used in a hypergraph. The interest of each itemset was used for the weight of each edge. Since interest tends to increase dramatically as the number of items in a frequent itemset increases, the log of the interest was used in the clustering algorithm to prevent the larger itemsets from completely dominating the process. Upon investigation, we found that the stopping criteria presented in [11] only works for domains that form very highly connected hypergraphs. Their algorithm continues to recursively partition a hypergraph until the weight of the edges cut compared to the weight of the edges left in either partition falls below a set ratio (referred to as fitness). This criteria has two fundamental problems: it will never divide a loosely connected hypergraph into the appropriate number of clusters, as it stops as soon as if finds a partition that meets the fitness criteria; and it always performs at least one partition (even if the entire hypergraph should be left together.) To solve these problems, we use the cut-weight ratio (the weight of the cut edges divided by the weight of the uncut edges in a given partition). This is defined as follows. Let P be a partition with a set of m edges e, and c the set of n edges cut in the previous split of the hypergraph: cutweight(P ) =
n W eight(ci ) Σi=1 m W eight(e ) Σj=1 j
TopCat: Data Mining for Topic Identification in a Text Corpus
179
473 David Cone
162 Yankee Stadium
191 George Steinbrenner
Joe Torre
Daryl Strawberry
441
Tampa
161
Fig. 1. Hypergraph of New York Yankees Baseball Frequent Itemsets
A hyperedge remains in a partition if 2 or more vertices from the original edge are in the partition. For example, a cut-weight ratio of 0.5 means that the weight of the cut edges is half of the weight of the remaining edges. The algorithm assumes that natural clusters will be highly connected by edges. Therefore, a low cut-weight ratio indicates that hMETIS made what should be a natural split between the vertices in the hypergraph. A high cut-weight ratio indicates that the hypergraph was a natural cluster of items and should not have been split. Once the stopping criteria has been reached, vertices are “added back in” to clusters if they are contained in an edge that “overlaps” to a significant degree with the vertices in the cluster. The minimum amount of overlap required is defined by the user. This allows items to appear in multiple clusters. For our domain, we found that the results were fairly insensitive to the cutoff criteria. Cut-weight ratios from 0.3 to 0.8 produced similar clusters, with the higher ratios partitioning the data into a few more clusters than the lower ratios. The TDT data produced one huge hypergraph containing half the clusters. Most of the rest are independent hypergraphs that become single clusters. One that does not become a single cluster is shown in Figure 1. Here, the link between Joe Torre and George Steinbrenner (shown dashed) is cut. Even though this is not the weakest link, the attempt to balance the graphs causes this link to be cut, rather than producing a singleton set by cutting a weaker link. This is a sensible distinction. During spring 1999, the Yankees manager (Torre) and players were in Tampa, Florida for spring training, while the owner (Steinbrenner) was handling repairs to a crumbling Yankee Stadium in New York. 3.4
Mapping to Documents
The preceding process gives us reasonable topics. However, to evaluate this with respect to the TDT2 instrumented corpus, we must map the identified topics back to a set of documents. We use the fact that the topic itself, a set of named entities, looks much like a boolean query. We use the TFIDF metric[17] to
180
C. Clifton and R. Cooley
generate a distance measure between a document and a topic, then choose the closest topic for each document. This is a flexible measure; if desired, we can use cutoffs (a document isn’t close to any topic), or allow multiple mappings. 3.5
Combining Clusters Based on Document Mapping
Although the clustered topics appeared reasonable, we were over-segmenting with respect to the TDT “ground truth” criteria. For example, we separated men’s and women’s tennis; the TDT human-defined topics had this as a single topic. We found that the topic-to-document mapping provided a means to deal with this. Many documents were close to multiple topics. In some cases, this overlap was common and repeated; many documents referenced both topics (the tennis example was one of these). We used this to merge topics, giving the final “tennis” topic shown in Section 1. There are two types of merge. In the first (marriage), the majority of documents similar to either topic are similar to both. In the second (parent/child ), the documents similar to the child are also similar to the parent, but the reverse does not necessarily hold. (The tennis clusters were a marriage merge.) The marriage similarity between clusters a and b is defined as: P i∈documents T F IDFia ∗ T F IDFib /N P M arriageab = P i∈documents T F IDFia /N ∗ i∈documents T F IDFib /N Based on the TDT2 training set, we chose a cutoff of 30 (M arriageab ≥ 30) for merging clusters. Similar clusters are merged by taking a union of their named entities. The parent child relationship is calculated as follows: P T F IDFip ∗ T F IDFic /N P P arentChildpc = i∈documents i∈documents T F IDFic /N We calculate the parent/child relationship after the marriage clusters have been merged. In this case, we used a cutoff of 0.3. Merging the groups is again accomplished through a union of the named entities. Note that there is nothing document-specific about these methods. The same approach could be applied to any market basket problem.
4
Experimental Results
The TDT2 evaluation criteria is based on the probability of failing to retrieve a document that belongs with the topic, and the probability of erroneously matching a document to the topic. These are combined to a single number CDet as describe in [3]. The mapping between TopCat-identified topics and reference topics is defined to be the mapping that minimizes CDet for that topic (as specified by the TDT2 evaluation process).
TopCat: Data Mining for Topic Identification in a Text Corpus
181
Using the TDT2 evaluation data (May and June 1998), the CDet score was 0.0055. This was comparable to the results from the TDT2 topic detection participants[2], which ranged from 0.0040 to 0.0129, although they are not directly comparable (as the TDT2 topic detection is on-line, rather than retrospective). Of note is the low false alarm probability we achieved (0.002); further improvement here would be difficult. The primary impediment to a better overall score is the miss probability of 0.17. The primary reason for the high miss probability is the difference in specificity between the human-defined topics and the TopCat-discovered topics. (Only two topics were missed entirely; one contained a single document, the other three documents.) Many TDT2-defined topics matched multiple TopCat topics. Since the TDT2 evaluation process only allows a single system-defined topic to be mapped to the human-defined topic, over half the TopCat-discovered topics were not used (and any document associated with those topics was counted as a “miss” in the scoring). TopCat often identified separate topics, such as (for the conflict with Iraq) Madeleine Albright/Iraq/Middle East/State, in addition to the “best” topic (lowest CDet score) shown at the top of Table 1. Although various TopCat parameters could be changed to merge these, many similar topics that the “ground truth” set considers separate (such as the world ice skating championships and the winter Olympics) would be merged as well. The miss probability is a minor issue for our problem. Our goal is to identify important topics, and to give a user the means to follow up on that topic. The low false alarm probability means that a story selected for follow-up will give good information on the topic. For the purpose of understanding general topics and trends in a corpus, it is more important to get all topics and a few good articles for each topic than to get all articles for a topic.
5
Conclusions and Future Work
We find the identified topics both reasonable in terms of the TDT2 defined accuracy, and understandable identifiers for the subject. For example, the most important three topics (based on the support of the frequent itemsets used to generate the topics) are shown in Table 1. The first (Iraqi arms inspections) also gives information on who is involved (although knowing that Richard Butler was head of the arms inspection team, Bill Richardson is the U.S. Ambassador to the UN, and Saddam Hussein is the leader of Iraq may require looking at the documents; this shows the usefulness of mapping the topic identifier to documents.) The third is also reasonably understandable: Events in and around Yugoslavia. The second is an amusing proof of the first half of the adage “Everybody talks about the weather, but nobody does anything about it.” The clustering methods of TopCat are not limited to topics in text, any market basket style problem is amenable to the same approach. For example, we could use the hypergraph clustering and relationship clustering on mail-order purchase data. This extends association rules to higher-level “related purchase” groups. Association rules provide a few highly-specific actionable items, but are
182
C. Clifton and R. Cooley Table 1. Top 3 Topics for January through June 1998 Topic 1 LOCATION Baghdad LOCATION Britain LOCATION China LOCATION Iraq ORG. Security Council ORG. United Nations PERSON Kofi Annan PERSON Saddam Hussein PERSON Richard Butler PERSON Bill Richardson LOCATION Russia LOCATION Kuwait LOCATION France ORG. U.N.
Topic 2 Topic 3 LOCATION Alaska LOCATION Albania LOCATION Anchorage LOCATION Macedonia LOCATION Caribbean LOCATION Belgrade LOCATION Great Lakes LOCATION Bosnia LOCATION Gulf Coast LOCATION Pristina LOCATION Hawaii LOCATION Yugoslavia LOCATION New England LOCATION Serbia LOCATION Northeast PERSON Slobodan Milosevic LOCATION Northwest PERSON Ibrahim Rugova LOCATION Ohio Valley ORG. Nato LOCATION Pacific Northwest ORG. Kosovo Liberation LOCATION Plains Army LOCATION Southeast LOCATION West PERSON Byron Miranda PERSON Karen Mcginnis PERSON Meteorologist Dave Hennen PERSON Valerie Voss
not as useful for high-level understanding of general patterns. The methods presented here can be used to give an overview of patterns and trends of related purchases, to use (for example) in assembling a targeted specialty catalog. The cluster merging of Section 3.5 defines a topic relationship. We are exploring how this can be used to browse news sources by topic. Another issue is the use of information other than named entities to identify topics. One possibility is to add actions (e.g., particularly meaningful verbs such as “elected”). We have made little use of the type of named entity. However, what the named entity processing really gives us is a typed market basket (e.g., LOCATION or PERSON as types.) Another possibility is to use generalizations (e.g., a geographic “thesaurus” equating Prague and Brno with the Czech Republic) in the mining process[19]. Further work on expanded models for data mining could have significant impact on data mining of text.
References 1. 1998 topic detection and tracking project (TDT-2). http://www.nist.gov/speech/tdt98/tdt98.htm. 2. The topic detection and tracking phase 2 (TDT2) evaluation. ftp://jaguar.ncsl.nist.gov/tdt98/tdt2 dec98 official results 19990204/index.htm. 3. The topic detection and tracking phase 2 (TDT2) evaluation plan. http://www.nist.gov/speech/tdt98/doc/tdt2.eval.plan.98.v3.7.pdf. 4. Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., May 26–28 1993. 5. Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and Inkeri Verkamo. Mining in the phrasal frontier. In 1st European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD’97), Trondheim, Norway, June 25–27 1997.
TopCat: Data Mining for Topic Identification in a Text Corpus
183
6. Sergey Brin, Rajeev Motwani, and Craig Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the 1997 ACM SIGMOD Conference on Management of Data, Tucson, AZ, May 13-15 1997. 7. David Day, John Aberdeen, Lynette Hirschman, Robyn Kozierok, Patricia Robinson, and Marc Vilain. Mixed initiative development of language processing systems. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C., March 1997. 8. Ronen Feldman, Yonatan Aumann, Amihood Amir, Amir Zilberstein, and Wiolli Kloesgen. Maximal association rules: a new tool for mining for keyword cooccurrences in document collections. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 167–170, August 14– 17 1997. 9. Ronen Feldman and Haym Hirsh. Exploiting background information in knowledge discovery from text. Journal of Intelligent Information Systems, 9(1):83–97, July 1998. 10. Ronen Feldman and Haym Hirsh, editors. IJCAI’99 Workshop on Text Mining, Stockholm, Sweden, August 2 1999. 11. Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar. Clustering based on association rule hypergraphs. In Proceedings of the SIGMOD’97 Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, 1997. 12. Rob Hyland, Chris Clifton, and Rod Holland. GeoNODE: Visualizing news in geospatial context. In Proceedings of the Federal Data Mining Symposium and Exposition ’99, Washington, D.C., March 9-10 1999. AFCEA. 13. George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekar. Multilevel hypergraph partitioning: Applications in VLSI domain. In Proceedings of the ACM/IEEE Design Automation Conference, 1997. 14. Yves Kodratoff, editor. European Conference on Machine Learning Workshop on Text Mining, Chemnitz, Germany, April 1998. 15. Brian Lent, Rakesh Agrawal, and Ramakrishnan Srikant. Discovering trends in text databases. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 227–230, August 14–17 1997. 16. Dunja Mladeni´c and Marko Grobelnik, editors. ICML-99 Workshop on Machine Learning in Text Data Analysis, Bled, Slovenia, June 30 1999. 17. Gerard Salton, James Allan, and Chris Buckley. Automatic structuring and retrieval of large text files. Communications of the ACM, 37(2):97–108, February 1994. 18. Lisa Singh, Peter Scheuermann, and Bin Chen. Generating association rules from semi-structured documents using an extended concept hierarchy. In Proceedings of the Sixth International Conference on Information and Knowledge Management, Las Vegas, Nevada, November 1997. 19. Ramakrishnan Srikant and Rakesh Agrawal. Mining generalized association rules. In Proceedings of the 21st International Conference on Very Large Databases, Zurich, Switzerland, September 23-25 1995. 20. Dick Tsur, Jeffrey D. Ullman, Serge Abiteboul, Chris Clifton, Rajeev Motwani, Svetlozar Nestorov, and Arnon Rosenthal. Query flocks: A generalization of association rule mining. In Proceedings of the 1998 ACM SIGMOD Conference on Management of Data, pages 1–12, Seattle, WA, June 2-4 1998. 21. Oren Zamir, Oren Etzioni, Omid Madan, and Richard M. Karp. Fast and intuitive clustering of web documents. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 287–290, August 14–17 1997.
Vhohfwlrq dqg Vwdwlvwlfdo Ydolgdwlrq ri Ihdwxuhv dqg Surwrw|shv P1 Vheedq4>5 / G1D1 ]ljkhg5 ) V1 Gl Sdopd5
= U1D1S1L1G1 oderudwru| 0 Zhvw Lqglhv dqg Jxldqd Xqlyhuvlw|/ Iudqfh1 Pduf1VheedqCxqly0dj1iu 2 = H1U1L1F1 oderudwru| 0 O|rq 5 Xqlyhuvlw|/ Iudqfh1 ~}ljkhg/vheedq/vglsdopdCxqly0o|rq51iu
Devwudfw1 Ihdwxuhv dqg surw|shv vhohfwlrq duh wzr pdmru sureohpv lq gdwd plqlqj/ hvshfldoo| iru pdfklqh ohduqlqj dojrulwkpv1 Wkh jrdo ri erwk vhohfwlrqv lv wr uhgxfh vwrudjh frpsoh{lw|/ dqg wkxv frpsxwdwlrqdo frvwv/ zlwkrxw vdfulflqj dffxudf|1 Lq wklv duwlfoh/ zh suhvhqw wzr lqfuhphqwdo dojrulwkpv xvlqj jhrphwulfdo qhljkerukrrg judskv dqg d qhz vwdwlvwlfdo whvw wr vhohfw/ vwhs e| vwhs/ uhohydqw ihdwxuhv dqg surwrw|shv iru vxshu0 ylvhg ohduqlqj sureohpv1 Wkh ihdwxuh vhohfwlrq surfhgxuh zh suhvhqw frxog eh dssolhg ehiruh dq| pdfklqh ohduqlqj dojrulwkp lv xvhg1
4
Lqwurgxfwlrq
Zh ghdo lq wklv sdshu zlwk ohduqlqj iurp h{dpsohv $ ghvfulehg e| sdluv ^[+$,> \ +$,`/ zkhuh [+$, lv d yhfwru ri s ihdwxuh ydoxhv dqg \ +$, lv wkh fruuh0 vsrqglqj fodvv odeho1 Wkh jrdo ri d ohduqlqj dojrulwkp lv wr exlog d fodvvlfdwlrq ixqfwlrq * iurp d vdpsoh d ri q h{dpsohv $m >+m @4===q, 1 Iurp d wkhruhwlfdo vwdqgsrlqw/ wkh vhohfwlrq ri d jrrg vxevhw ri ihdwxuhv [ lv ri olwwoh lqwhuhvw = d Ed|hvldq fodvvlhu +edvhg rq wkh wuxh glvwulexwlrqv, lv prqrwrqlf/ l1h1/ dgglqj ihdwxuhv fdq qrw ghfuhdvh wkh prgho*v shuirupdqfh ^43`1 Wklv wdvn kdv krzhyhu uhfhlyhg sohqw| ri dwwhqwlrq iurp vwdwlvwlfldqv dqg uhvhdfkhuv lq Pdfklqh Ohduqlqj vlqfh wkh prqrwrqlflw| dvvxpswlrq uduho| krogv lq sudfwlfdo vlwxdwlrqv zkhuh wkh wuxh glvwulexwlrqv duh xqnrzq1 Luuhohydqw ru zhdno| uhohydqw ihdwxuhv pd| wkxv uhgxfh wkh dffxudf| ri wkh prgho1 Wkuxq hw do1 ^4;` vkrzhg wkdw wkh F718 dojrulwkp jhqhudwhv ghhshu ghflvlrq wuhhv zlwk orzhu shuirupdqfhv zkhq zhdno| uhohydqw ihdwxuhv duh qrw ghohwhg1 Dkd ^4` dovr vkrzhg wkdw wkh vwrudjh ri wkh LE6 dojrulwkp lqfuhdvhv h{srqhqwldoo| zlwk wkh qxpehu ri luuhohydqw ihdwxuhv1 Vhohfwlrq ri uhohydqw surwrw|sh vxevhwv kdv dovr ehhq pxfk vwxglhg lq Pdfklqh Ohduqlqj1 Wklv whfkqltxh lv ri sduwlfxodu lqwhuhvw zkhq xvlqj qrq sdud0 phwulf fodvvlfdwlrq phwkrgv vxfk dv n0qhduhvw0qhljkeruv ^;`/ Sdu}hq*v zlqgrzv ^45` ru pruh jhqhudoo| phwkrgv edvhg rq jhrphwulfdo prghov wkdw kdyh d uhsxwd0 wlrq iru kdylqj kljk frpsxwdwlrqdo dqg vwrudjh frvwv1 Lq idfw/ wkh fodvvlfdwlrq ri d qhz h{dpsoh riwhq uhtxluhv glvwdqfh fdofxodwlrqv zlwk doo srlqwv vwruhg lq •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 184−192, 1999. Springer−Verlag Berlin Heidelberg 1999
Selection and Statistical Validation of Features and Prototypes
185
phpru|1 Wklv ohg uhvhdufkhuv wr exlog vwudwhjlhv wr uhgxfh wkh vl}h ri wkh ohduq0 lqj vdpsoh +vhohfwlqj rqo| wkh ehvw h{dpsohv zklfk zloo eh fdoohg surwrw|shv,/ nhhslqj dqg shukdsv lqfuhdvlqj fodvvlfdwlrq shuirupdqfhv ^;`/ ^:` dqg ^4:`1 Zh suhvhqw lq wklv duwlfoh wzr kloo folpelqj dojrulwkpv wr vhohfw uhohydqw ihdwxuhv dqg surwrw|shv/ xvlqj prghov iurp frpsxwdwlrqdo jhrphwu|1 Wkh uvw dojrulwkp vwhs e| vwhs vhohfwv uhohydqw ihdwxuhv lqghshqghqwo| ri d jlyhq ohduqlqj dojrulwkp +wkh fodvvlfdwlrq dffxudf| lv qrw xvhg wr lghqwli| wkh ehvw ihdwxuhv exw rqo| wr vwrs wkh vhohfwlrq dojrulwkp,1 Wklv ihdwxuh vhohfwlrq whfkqltxh lv edvhg rq wkh lghd wkdw shuirupdqfhv ri d ohduqlqj dojrulwkp/ zkdwhyhu wkh do0 jrulwkp pd| eh/ qhfhvvdulo| ghshqg rq wkh jhrphwulfdo vwuxfwxuhv ri fodvvhv wr ohduq1 Zh sursrvh fkdudfwhul}lqj wkhvh vwuxfwxuhv lq LUs xvlqj prghov lqvsluhg iurp frpsxwdwlrqdo jhrphwu|1 Dw hdfk vwdjh/ zh vwdwlvwlfdoo| phdvxuh wkh vhsd0 udelolw| ri wkhvh vwuxfwxuhv lq wkh fxuuhqw uhsuhvhqwdwlrq vsdfh/ dqg yhuli| li wkh nhsw ihdwxuhv doorz wr exlog d prgho pruh h!flhqw wkdq wkh suhylrxv rqh1 Xqolnh wkh uvw/ wkh vhfrqg dojrulwkp xvhv wkh fodvvlfdwlrq ixqfwlrq wr vhohfw surwrw|shv lq wkh ohduqlqj vdpsoh1 Lw whvwv wkh txdolw| ri vhohfwhg h{dpsohv/ yhuli|lqj rq wkh rqh kdqg wkdw wkh| doorz wr rewdlq rq d ydolgdwlrq vdpsoh d vxffhvv udwh vljqlfdqwo| forvh wr wkh rqh rewdlqhg zlwk wkh ixoo vdpsoh/ dqg rq wkh rwkhu kdqg wkdw wkh| frqvwlwxwh rqh ri wkh ehvw ohduqlqj vxevhwv zlwk wklv vl}h1
5
Ghqlwlrqv lq Frpsxwdwlrqdo Jhrphwu|
Wkh dssurdfk zh sursrvh lq wklv duwlfoh xvhv qhljkerukrrg judskv1 Lqwhuhvwhg uhdghuv zloo qg pdq| prghov ri qhljkerukrrg judskv lq ^46`/ vxfk dv Ghodxqd|*v Wuldqjxodwlrq/ Uhodwlyh Qhljkerukrrg Judsk/ dqg wkh Plqlpxp Vsdqqlqj Wuhh +Ilj1 4,1
Minimum Spanning Tree
Gabriel’s Graph
Relative Neighborhood Graph
Delaunay’s Triangulation
Ilj1 41 Qhljkerukrrg Vwuxfwxuhv
186
M. Sebban, D.A. Zighed, and S. Di Palma
Ghqlwlrq 41 = D judsk J + /D, lv frpsrvhg ri d vhw ri yhuwlfhv qrwhg olqnhg e| d vhw ri hgjhv qrwhg D1 QE = Lq wkh fdvh ri dq rulhqwhg judsk/ D zloo eh wkh vhw ri dufv1 Lq rxu sdshu/ zh rqo| frqvlghu qrq0rulhqwhg judskv/ l1h1 d olqn ehwzhhq wzr srlqwv ghqhv dq hgjh1 Wklv fkrlfh pdnhv hyhu| qhljkerukrrg uhodwlrq v|pphwulfdo1
6 614
Vhohfwlrq ri Uhohydqw Ihdwxuhv Lqwurgxfwlrq
Jlyhq d uhsuhvhqwdwlrq vsdfh [ frqvwlwxwhg e| s ihdwxuhv [4 > [5 > ===> [s / dqg d vdpsoh ri q h{dpsohv qrwhg $4 > $5 > ==> $q = /d ohduqlqj phwkrg doorzv wr exlog d fodvvlfdwlrq ixqfwlrq *4 wr suhglfw wkh vwdwh ri \ 1 3 Frqvlghu qrz d vxevhw [3 @ i[4 > [5 > ===> [s3 j ri doo ihdwxuhv/ zlwk s ? s/ dqg qrwh *5 wkh fodvvlfdwlrq ixqfwlrq exlow lq wklv qhz uhsuhvhqwdwlrq vsdfh1 Li fodvvlfdwlrq shuirupdqfhv ri *4 dqg *5 duh htxlydohqw/ zh zloo dozd|v suhihu wkh prgho xvlqj ihzhu ihdwxuhv iru wkh frqvwuxfwlrq ri * 1 Wzr uhdvrqv mxvwli| wklv fkrlfh= 41 Wkh fkrlfh ri [3 uhgxfhv ryhuwwlqj ulvnv1 51 Wkh fkrlfh ri [3 uhgxfhv frpsxwdwlrqdo dqg vwrudjh frvwv1 Jhqhudol}dwlrq shuirupdqfhv ri *5 pd| vrphwlphv eh ehwwhu wkdq wkrvh re0 wdlqhg zlwk *4 > ehfdxvh vrph ihdwxuhv fdq eh qrlvhg lq wkh ruljlqdo vsdfh1 Qhyhuwkhohvv/ zh fdq qrw whvw doo frpelqdwlrqv ri ihdwxuhv/ l1h1 exlog dqg whvw 5s 4 fodvvlfdwlrq ixqfwlrqv1 Frqvwuxfwlyh phwkrgv +ghflvlrq wuhhv/ ix}}| wuhhv/ lqgxfwlrq judskv/ hwf1, vh0 ohfw ihdwxuhv vwhs e| vwhs zkhq wkh| lpsuryh shuirupdqfhv ri d jlyhq fulwhulrq +fodvvlfdwlrq vxffhvv udwh/ krprjhqhlw| fulwhulrq,1 Lq wkhvh phwkrgv/ wkh frq0 vwuxfwlrq ri wkh * ixqfwlrq lv grqh vlpxowdqhrxvo| zlwk ihdwxuhv fkrlfh1 Dprqj zrunv xvlqj wkh hvwlpdwlrq ri wkh fodvvlfdwlrq vxffhvv udwh/ zh fdq flwh wkh furvv0ydolgdwlrq surfhgxuh ^43`/ dqg Errwvwuds surfhgxuh ^8`1 Qhyhuwkhohvv/ hyhq li wkhvh phwkrgv doorz wr rewdlq dq xqeldvhg hvwlpdwlrq ri wklv udwh/ fdofxodwlrq frvwv vhhp surklelwlyh wr mxvwli| wkhvh surfhgxuhv dw hdfk vwdjh ri wkh ihdwxuh vhohfwlrq surfhvv1 Phwkrgv xvlqj krprjhqhlw| fulwhulrq riwhq sursrvh vlpsoh lqglfdwruv idvw wr frpsxwh/ vxfk dv hqwurs| phdvxuhv/ xqfhuwdlqw| phdvxuhv/ vhsdudelolw| phdvxuhv olnh wkh ri Zlonv ^47` ru Pdkdodqrelv*v glvwdqfh1 Exw uhvxowv dovr ghshqg rq wkh fxuuhqw * ixqfwlrq1 Zh sursrvh lq wkh qh{w vhfwlrq d qhz ihdwxuhv vhohfwlrq dssurdfk/ dssolhg ehiruh wkh frqvwuxfwlrq ri wkh * fodvvlfdwlrq ixqfwlrq/ lqghshqghqwo| ri wkh ohduqlqj phwkrg xvhg1 Wr hvwlpdwh txdolw| ri d ihdwxuh/ zh sursrvh wr hvwlpdwh txdolw| ri wkh uhsuhvhqwdwlrq vsdfh zlwk wklv ihdwxuh1
Selection and Statistical Validation of Features and Prototypes
615
187
Krz wr Hydoxdwh wkh Txdolw| ri d Uhsuhvhqwdwlrq VsdfhB
Zh frqvlghu wkdw p glhuhqw fodvvhv duh zhoo uhsuhvhqwhg e| s ihdwxuhv/ li wkh uhsuhvhqwdwlrq vsdfh +fkdudfwhul}hg e| s glphqvlrqv, vkrzv zlgh jhrphwulfdo vwuxfwxuhv ri srlqwv ehorqjlqj wr wkhvh fodvvhv1 Lq idfw/ zkhq zh exlog d prgho/ zh dozd|v vhdufk iru wkh uhsuhvhqwdwlrq vsdfh iduwkhvw iurp wkh vlwxdwlrq zkhuh hdfk srlqw ri hdfk fodvv frqvwlwxwhv rqh vwuxfwxuh1 Wkxv/ wkh txdolw| ri d uhsuhvhqwdwlrq vsdfh fdq eh hvwlpdwhg e| wkh glvwdqfh wr wkh zruvw vlwxdwlrq fkdudfwhulvhg e| wkh htxdolw| ri ghqvlw| ixqfwlrqv ri fodvvhv1 Wr vroyh wklv sureohp/ zh fdq xvh rqh ri wkh qxphurxv vwdwlvwlfdo whvwv ri srsxodwlrq krprjhqhlw|1 Xqiruwxqdwho|/ qrqh ri wkhvh whvwv lv erwk qrqsdudphwulf dqg dssolfdeoh lq LUs = Lq Vheedq ^48`/ zh exlow d qhz vwdwlvwlfdo whvw +fdoohg whvw ri hgjhv,/ zklfk grhv qrw vxhu iurp wkhvh frqvwudlqwv1 Xqghu wkh qxoo k|srwkhvlv K3 = K3 = I4 +{, @ I5 +{, @ === @ Ip +{, @ I +{, zkhuh Il +{, fruuhvsrqgv wr wkh uhsduwlwlrq ixqfwlrq ri wkh fodvv l Wkh frqvwuxfwlrq ri wklv whvw xvhv vrph frqwulexwlrqv ri frpsxwdwlrqdo jhrp0 hwu|1 Rxu dssurdfk lv edvhg rq wkh vhdufk iru jhrphwulfdo vwuxfwxuhv/ fdoohg kr0 prjhqhrxv vxevhwv/ mrlqlqj srlqwv wkdw ehorqj wr wkh vdph fodvv1 Wr rewdlq wkhvh krprjhqhrxv vxevhwv dqg hydoxdwh wkh txdolw| ri wkh uhsuhvhqwdwlrq vsdfh/ zh sursrvh wkh iroorzlqj surfhgxuh = 41 Frqvwuxfw d uhodwhg jhrphwulfdo judsk/ vxfk dv wkh Ghodxqd| Wuldqjxodwlrq/ wkh Jdeulho*v Judsk/ hwf1 ^46`1 51 Frqvwuxfw krprjhqhrxv vxevhwv/ ghohwlqj hgjhv frqqhfwlqj srlqwv zklfk ehorqj wr gliihuhqw fodvvhv1 61 Frpsduh wkh sursruwlrq ri ghohwhg hgjhv zlwk wkh suredelolw| rewdlqhg xqghu wkh qxoo k|srwkhvlv1 Wkh fulwlfdo wkuhvkrog ri wklv whvw lv xvhg wr vhdufk iru wkh uhsuhvhqwdwlrq vsdfh zklfk lv wkh iduwkhvw iurp wkh K3 k|srwkhvlv1 Dfwxdoo|/ wkh vpdoohu wklv ulvn lv/ wkh ixuwkhu iurp wkh K3 k|srwkhvlv zh duh1 Wzr vwdwhjlhv duh srvvleoh wr qg d jrrg uhsuhvhqwdwlrq vsdfh = 41 Vhdufk iru wkh uhsuhvhqwdwlrq vsdfh zklfk plqlpl}hv wkh fulwlfdo wkuhvkrog ri wkh whvw/ l1h1 zklfk lv wkh iduwkhvw iurp wkh K3 k|srwkhvlv1 Odwhu rq/ zh zloo xvh wklv dssurdfk wr wdfnoh wklv sureohp1 51 Vhdufk iru d zd| wr plqlpl}h wkh vl}h ri wkh uhsuhvhqwdwlrq vsdfh +zlwk wkh dgydqwdjh ri uhgxflqj vwrudjh dqg frpsxwlqj frvwv,/ zlwkrxw uhgxflqj wkh txdolw| ri wkh lqlwldo vsdfh1 616
Dojrulwkp
Ohw [ @ i[4 > [5 > ===> [s j eh wkh uhsuhvhqwdwlrq ri d jlyhq d ohduqlqj vdpsoh1 Dprqj wkhvh s ihdwxuhv/ zh vhdufk iru wkh s prvw glvfulplqdqw rqhv +s ? s, xvlqj wkh iroorzlqj dojrulwkp=
188
M. Sebban, D.A. Zighed, and S. Di Palma
41 Frpsxwh wkh 3 fulwlfdo wkuhvkrog ri wkh whvw ri hgjhv lq wkh lqlwldo uhsuhvhqwdwlrq vsdfh [ 51 Frpsxwh iru hdfk frpelqdwlrq ri s 4 ihdwxuhv wdnhq dprqj wkh s fxuuhqw/ wkh f fulwlfdo wkuhvkrog 61 Vhohfw wkh ihdwxuh zklfk plqlpl}hv wkh f fulwlfdo wkuhvkrog 71 Li f ? 3 wkhq ghohwh wkh vhohfwhg ihdwxuh/ s # s 4/ uhwxuq wr vwhs 4 hovh s @ s dqg vwrs1 4
Wklv dojrulwkp lv d kloo folpelqj phwkrg1 Lw grhv qrw vhdufk iru dq rswlpdo fodvvlfdwlrq ixqfwlrq/ lq dffrugdqfh zlwk d fulwhulrq edvhg rq dq xqfhuwdlqw| phdvxuh/ exw udwkhu dlpv dw qglqj d uhsuhvhqwdwlrq vsdfh wkdw doorzv wr exlog d ehwwhu prgho1 617
Vlpxodwhg H{dpsoh
Wr looxvwudwh rxu dssurdfk/ zh dsso| lq wklv vhfwlrq rxu dojrulwkp wr d vlpxodwhg h{dpsoh1 Ohw d eh d ohduqlqj vdpsoh frpsrvhg ri 433 h{dpsohv ehorqjlqj wr wzr fodvvhv1 Hdfk h{dpsoh lv uhsuhvhqwhg lq LU6 e| 6 ihdwxuhv +qrwhg [4 [5 [6 ,1 Wkh wzr fodvvhv duh vwdwlvwlfdoo| glhuhqw/ l1h1 fkdudfwhulvhg e| wzr glhuhqw sure0 delolw| ghqvlwlhv1 Iru lqvwdqfh/ Qrupdo odz Q +4 > 4 , iru h{dpsohv ri |4 fodvv Qrupdo odz Q +5 > 5 ,> zkhuh 5 A 4 iru h{dpsohv ri |5 fodvv /
/
Wr hvwlpdwh wkh fdsdflw| ri rxu dojrulwkp wr qg wkh ehvw uhsuhvhqwdwlrq vsdfh/ zh jhqhudwh 6 qhz qrlvhg ihdwxuhv +qrwhg [7 [8 [9 ,1 Hdfk ihdwxuh lv jhqhudwhg lghqwlfdoo| iru wkh zkroh vdpsoh1 Wkh uvw 3 ulvn lq LU9 lv derxw 4143; = Dsso|lqj rxu dojrulwkp/ zh rewdlq wkh iroorzlqj uhvxowv +wdeoh 4,1 /
/
Wdeoh 41 Dssolfdwlrq ri wkh ihdwxuh vhohfwlrq dojrulwkp vwhs l 4 5 6 7
f
f2
f
fe
fD
fS
3D 8 1 4 3 3D 5 14 3 3H 5 1 4 3 3 6 1 4 3 32 ; 14 3 32 8 14 3 3S 6 1 4 3 3S 4 14 3 32 - 4 1 4 3 3 7 14 3 3e : 14 3 3S 7 1 4 3 3e 9 14 3 32 - 4 1 4 3 3D 7 14 3 3D 4 1 4 3 3e < 14 3 3e 6 14 3 -
kSW
kf
G h flvlr q
3 4 1 4 3 3H F r q w l q x h 3e 5 14 3 3 F r q w l q x h 7 14 3 3D 7 14 3 3e F r q w l q x h 4 14 3 3e 4 14 3 3D V w r s < 14 3 5 14 3
Gxulqj vwhs 4/ ghohwlrq ri [7 ihdwxuh doorzv wr uhgxfh fulwlfdo wkuhvkrog +iurp 4143; wr 514346 ,1 Vwhsv 5 dqg 6 ohdg wr wkh vxsuhvvlrq ri [9 dqg [8 1 Dw wkh irxuwk vwhs/ wkh ydoxh <14347 +zlwkrxw wkh [6 ihdwxuh, grhv qrw doorz wr uhgxfh wkh 3 ulvn dqg wkxv wkh surfhvv vwrsv1 Xqolnh qxphurxv rwkhu phwkrgv wkdw zrxog dovr vhohfw +[4 / [5 /[6 ,/ rxu dssurdfk grhv qrw xvh wkh * fodvvlfdwlrq ixqfwlrq1 4
Li zh vhdufk iru plqlpl}lqj wkh vl}h ri wkh vsdfh/ zh zloo uhwxuq wr 51
Selection and Statistical Validation of Features and Prototypes
7 714
189
Surwrw|sh Vhohfwlrq Suhvhqwdwlrq
Lqwxlwlyho|/ zh wklqn wkdw d vpdoo qxpehu ri surwrw|shv fdq ohdg wr frpsdudeoh dqg shukdsv kljkhu shuirupdqfhv wkdq wkrvh rewdlqhg zlwk d zkroh vdpsoh1 Zh mxvwli| wklv lghd dv iroorzv = 41 Vrph qrlvh ru uhshwlwlrqv lq gdwd frxog eh ghohwhg/ 51 Hdfk surwrw|sh fdq eh ylhzhg dv d vxssohphqwdu| ghjuhh ri iuhhgrp1 Uhgxf0 lqj wkh qxpehu ri surwrw|shv fdq wkxv vrphwlphv dyrlg ryhuwwlqj vlwxdwlrqv1 Wr uhgxfh vwrudjh frvwv/ vrph dssurdfkhv xvh dq dojrulwkp vhohfwlqj plvfodv0 vlhg h{dpsohv vxfk dv frqghqvhg qhduhvw qhljkeruv ^;` zklfk doorzv wr qg d frqvlvwhqw vxevhw/ l1h1 zklfk fruuhfwo| fodvvlhv doo wkh uhpdlqlqj srlqwv lq wkh vdpsoh vhw1 Lq ^:`/ wkh dxwkru sursrvhv wkh uhgxfhg qhduhvw qhljkeru uxoh zklfk lpsuryhv wkh suhylrxv dojrulwkp e| qglqj wkh plqlpdo frqvlvwhqw vxevhw li lw ehorqjv wr wkh Kduw*v frqvlvwhqw vxevhw1 Vndodn ^4:` sursrvhv wzr glhuhqw surwrw|sh vhohfwlrq dojrulwkpv = wkh uvw lv d Prqwh Fduor vdpsolqj dojrulwkp > wkh vhfrqg dssolhv udqgrp pxwdwlrq kloo folpelqj/ zkhuh wkh wqhvv ixqfwlrq lv wkh fodvvlfdwlrq vxffhvv udwh rq wkh ohduqlqj vdpsoh1 \hw/ wklv dssurdfk lv olplwhg wr vlpsoh sureohpv zkhuh fodvvhv ri sdwwhuqv duh hdvlo| vhsdudeoh/ vlqfh wkh dxwkru d sulrul ghqhv wkh qxpehu ri surwrw|shv dv wkh qxpehu ri fodvvhv1 Zh fdq hdvlo| lpdjlqh vrph sureohpv zkhq fodvvhv duh pl{hg1 Lq rxu plqg/ zh frxog lpsuryh wklv dojrulwkp xvlqj dv wkh qxpehu ri surwrw|shv wkh qxpehu ri krprjhqhrxv vxevhwv ghvfulehg lq wkh suhylrxv vhfwlrq1 Rwkhu zrunv derxw surwrw|sh vhohfwlrq fdq eh irxqg lq ^<` ru ^44`1 Lq wklv vhfwlrq/ zh suhvhqw d qhz ghflvlrq uxoh/ wkh Suredelolvwlf Yrwh/ wkdw xvhv wkh lqirupdwlrq frqwdlqhg lq d frqqhfwhg qhljkerukrrg judsk1 Zh wkhq suhvhqw wkh sulqflsoh ri lwv xvh lq d yduldqw ri wkh surwrw|sh vhohfwlrq phwkrg sursrvhg lq ^;` dqg ^:`1 715
Wkh Suredelolvwlf Yrwh
Rxu dssurdfk xvhv d zhljkwhg yrwh ri qhljkeruv +lq d frqqhfwhg qhljkerukrrg judsk, wr odeho d qhz h{dpsoh1 Wkh zhljkw ri wkh d qhljkeru uhodwlrqvkls lv phdvxuhg e| wkh suredelolw| ri wkh wzr h{dpsohv ehlqj qhljkeruv hyhq li wkh vl}h ri wkh vhw lqfuhdvhv1 Zh suhvhqw lq wklv vhfwlrq wkh wkhruhwlfdo iudph ri wkh Suredelolvwlf Yrwh/ zlwk Jdeulho*v Judsk exw wklv dssurdfk fdq eh h{whqghg wr rwkhu qhljkerukrrg vwuxfwxuhv1 Ghqlwlrq 51 = Zhljkw +$m > $, Ohw +$m > $,> wkh zhljkw ri wkh $m yrwhu/ qhljkeru ri $/ eh ghqhg dv =
190
M. Sebban, D.A. Zighed, and S. Di Palma
= d $ ^3> 4` @ V$m >$ ,> ;$ 3 5 +$m > $, :$ +$m > $, @ Su+$ 3 5 zkhuh V$m >$ lv wkh k|shuvskhuh zlwk wkh gldphwhu +$m > $,1 Ghqlwlrq 61 = Fryhulqj vsdfh Zh ghqh wkh fryhulqj vsdfh G frqwdlqlqj doo srvvleoh phpehuvkls ri wkh
vhw dv wkh k|shufxeh fryhulqj wkh xqlrq ri k|shuvskhuhv ri qhljkeruv lq wkh ohduqlqj vdpsoh1
D
ω1 ω6
ω2
d2=10
R=1.5
ω
ω4
ω3 ω5 d1=8
Ilj1 51 H{dpsoh ri fryhulqj vsdfh1
Iurp G/ zh fdofxodwh wkh suredelolw|
YG YV$m >$ YG zkhuh YG lv wkh yroxph ri G dqg YV$m >$ lv wkh yroxph ri wkh k|shuvskhuh zlwk gldphwhu +$m > $,= Ghqlwlrq 71 = Zh ghqh YV$m >$ > wkh yroxph ri d jlyhq k|shuvskhuh lq LUs zlwk gldphwhu +$m > $, dv =
S u+$ 3 5 @ V$m >$ , @
s
s YV$m >$ @ 5s u$s m >$ +s , 5
zkhuh u$m >$ lv wkh udglxv ri wkh k|shuvskhuh zlwk gldphwhu +$m > $, dqg +{, lv wkh Jdppd ixqfwlrq1 YG lv rewdlqhg e| pxowlsolfdwlrq ri wkh ohqjwkv ri wkh k|shufxeh*v vlghv1 H{dpsoh = Jlyhq d Jdeulho*v Judsk exlow iurp d ohduqlqj vdpsoh
d @ i$4 > $5 > $6 > $7 > $8 > $9 j +Ilj1 5,/ dqg $ d qhz h{dpsoh wr odeho/ zh fdq fdofxodwh wkh zhljkw +$4 > $, ri $4 / YG YV$4 >$ g4 g5 u$5 4 >$ @ @ 3=<44 +$4 > $, @ YG g4 g5
Selection and Statistical Validation of Features and Prototypes
716
191
Surwrw|sh Vhohfwlrq Dojrulwkp
Wzr w|shv ri dojrulwkpv h{lvw iru wkh exloglqj ri jhrphwulfdo judskv ^6`= 41 Wrwdo dojrulwkpv = lq wklv fdvh/ qhljkerukrrg vwuxfwxuhv +Jdeulho/ Uhodwlyh qhljkeruv ru Ghodxqd|*v Wuldqjohv, duh dssolhg rq wkh zkroh vdpsoh1 Wr exlog d qhz hgjh/ vrph frqglwlrqv pxvw eh lpsrvhg rq wkh zkroh vhw1 Wkxv/ zkhq d qhljkerukrrg lv exlow/ lw lv qhyhu vxssuhvvhg1 51 Frqvwuxfwlyh dojrulwkpv = lq wklv fdvh/ wkh judsk lv exlow srlqw e| srlqw/ vwhs e| vwhs1 Hdfk srlqw lv lqvhuwhg/ jhqhudwlqj vrph qhljkerukrrgv/ ghohwlqj rwk0 huv1 Wkxv/ rqo| d orfdo xsgdwh ri wkh judsk lv qhfhvvdu| ^7`1 Iru wkhvh wzr w|shv ri dojrulwkpv/ wkh odeho ri srlqwv wr lqvhuw lv qrw xvhg1 Wkh surwrw|shv vhohfwlrq dojrulwkp suhvhqwhg lq wklv vhfwlrq ehorqjv wr wkh vhfrqg fdwhjru| exw wdnhv lqwr dffrxqw wkh odeho ri srlqwv douhdg| lqvhuwhg lq wkh judsk1 Lw pd| wkxv rqo| eh xvhg zlwk vxshuylvhg ohduqlqj1 Lwv sulqflsoh lv vxppxdul}hg e| wkh iroorzlqj svhxgr0frgh1 Ohw d eh wkh ruljlqdo wudlqlqj vdpsoh dqg eh wkh vhw ri vhohfwhg surswrw|shv Lqlwldoo|/ frqwdlqv rqh udqgrpo| vhohfwhg h{dpsoh Uhshdw Fodvvli| d zlwk wkh Suredelolvwlf Yrwh xvlqj wkh h{dpsohv lq 1 Pryh plvfodvvlilhg h{dpsohv lqwr 1 xqwlo doo h{dpsohv uhpdlqlqj lq d duh zhoo fodvvlilhg1 Wkxv/ wkh shuwlqhqfh ri dq h{dpsoh lv ghqhg dv iroorzlqj = d srlqw lv shu0 wlqhqw li lw eulqjv lqirupdwlrq derxw lwv fodvv1 Lqwhuhvwhg uhdghuv pd| qg wkh uhvxowv ri dq dssolfdwlrq ri rxu surwrw|sh vhohfwlrq whfkqltxh rq wkh zhoo0nqrzq Euhlpdq zdyh irupv sureohp ^5` lq ^49`1 Wklv uhvxowv vkrz wkdw wkh vhohfwlrq whfkqltxh doorzv /rq wklv sureohp/ wr fxw e| pruh wkdq kdoi wkh vl}h ri ohduqlqj vdpsoh zlwkrxw orzhulqj wkh jhqhudolvdwlrq dffxudf| ri wkh exlow fodvvlfdwlrq ixqfwlrq1
8
Frqfoxvlrq
Wkh jurzlqj vl}h ri prghuq gdwdedvhv pdnhv ihdwxuh vhohfwlrq dqg surwrw|sh vhohfwlrq fuxfldo lvvxhv1 Zh kdyh sursrvhg lq wklv duwlfoh wzr dojrulwkpv wr uhgxfh wkh glphqvlrqdolw| ri wkh uhsuhvhqwdwlrq vsdfh dqg wr uhgxfh wkh qxpehu ri h{dpsohv ri d ohduqlqj vdpsoh1 Rxu dssurdfk lv fxuuhqwo| olplwhg lq wkdw lw vxssrvhv wkdw h{dpsohv duh rqo| ghvfulehg e| qxphulfdo ihdwxuhv1 Zh duh qrz zrunlqj rq qhz qhljkerukrrg vwuxfwxuhv wr wdnh lqwr dffrxqw v|perolf gdwd/ zlwkrxw xvlqj hxfolghdq glvwdqfhv1
192
M. Sebban, D.A. Zighed, and S. Di Palma
Uhihuhqfhv 41 Dkd/ G1Z1/ Nleohu/ G1/ ) Doehuw/ P1N1 Lqvwdqfh0edvhg ohduqlqj dojrulwkpv1 Pd0 fklqh Ohduqlqj 9+4,=6:099/ 4<<41 51 Euhlpdq/ O1/ Iulhgpdq/ M1K1/ Rovkhq/ U1D1/ ) Vwrqh/ F1M1 Fodvvlfdwlrq Dqg Uh0 juhvvlrq Wuhhv1 Fkdspdq ) Kdoo/ 4<;71 61 Fkdvvhu|/ M1P1/ ) Prqwdqyhuw/ D1 Jìrpìwulh Glvfuëwh hq Dqdo|vh g*Lpdjhv1 Khu0 pëv/ 4<<41 71 Ghylmyhu/ S1D1/ ) Ghnhvho/ P1 Frpsxwlqj Pxowlglphqvlrqdo Ghodxqd| whvvhohwlrqv1 Uhvhdufk Uhsruw/4<;51 81 Hiurq/ E1/ ) Wlelvkludql/ U1 Dq Lqwurgxfwlrq wr wkh Errwvwuds1 Fkdspdq ) Kdoo/4<<61 91 Jdeulho/ N1U1/ ) Vrndo/ U1U1 D qhz Vwdwlvwlfdo Dssurdfk wr Jhrjudsklf Yduldwlrq Dqdo|vlv/ V|vwhpdwlf ]rrorj|/ 58<05:;1 4<9<1 :1 Jdwhv/ J1Z1 Wkh Uhgxfhg Qhduhvw Qhljkeru Uxoh1 LHHH Wudqv1 Lqirup1 Wkhru|/ 7640766/ 4<:51 ;1 Kduw/ S1H1 Wkh Frqghqvhg Qhduhvw Qhljkeru Uxoh1 LHHH Wudqv1 Lqirup1 Wkhru|/ 8480849/ 4<9;1 <1 Lfklqr/ P1/ ) Vnodqv|/ M1 Wkh Uhodwlyh Qhljkerukrrg Judsk iru Pl{hg Ihdwxuh Yduldeohv/ Sdwwhuq uhfrjqlwlrq > LVVQ 336406536 > XVD > GD/ +4;,=494049:/ 4<;81 431 Nrkdyl/ N1 Ihdwxuh Ixevhw Vhohfwlrq dv Vhdufk zlwk Suredelolvwlf Hvwlpdwhv/ DDDL Idoo V|psrvlxp rq Uhohydqfh/4<<71 441 OhErxujhrlv/ I1 ) Hpswr}/ K1 Suhwrsrorjlfdo Dssurdfk Iru Vxshuylvhg Ohduqlqj / 46wk LFSU <9/ Ylhqqd Dxvwuld Dxjxvw 5805 5890593/ 4<<91 451 Sdu}hq/ H1 Rq Hvwlpdwlrq ri d Suredelolw| Ghqvlw| Ixqfwlrq dqg Prgh1 Dqq1 Pdwk1 Vwdw/ +66,=4398043:9/ 4<951 461 Suhsdudwd/ I1S1/ ) Vkdprv/ P1L1 Sdwwhuq Uhfrjqlwlrq dqg Vfhqh Dqdo|vlv1 Vsulqjhu0 Yhuodj14<;81 471 Udr/ F1 Olqhdu Vwdwlvwlfdo Lqihuhqfh dqg lwv Dssolfdwlrqv1 Zloh| Qhz \run1 4<981 481 Vheedq/ P1 Prgëohv wkìrultxhv hq Uhfrqqdlvvdqfh gh Iruphv hw Dufklwhfwxuh K|0 eulgh srxu Pdfklqh Shufhswlyh1 Wkëvh gh grfwrudw gh o*Xqlyhuvlwì O|rq414<<91 491 ]ljkhg/ G1D1 dqg Vheedq/ P1 Vìohfwlrq hw Ydolgdwlrq Vwdwlvwltxh gh Yduldeohv hw gh Surwrw|shv1 Uhyxh Hohfwurqltxh vxu o*Dssuhqwlvvdjh sdu ohv Grqqìhv/ +5,/ 4<<;1 4:1 Vndodn/ G1E1 Surwrw|sh dqg Ihdwxuh Vhohfwlrq e| Vdpsolqj dqg Udqgrp Pxwd0 wlrq Kloo Folpelqj Dojrulwkpv1 Surfhhglqjv ri 44wk Lqwhuqdwlrqdo Frqihuhqfh rq PDfklqh Ohduqlqj/ Prujdq Ndxipdqq/ 5<6063414<<71 4;1 Wkuxq hwdo1 Wkh Prqn*v Sureohp= d Shuirupdqfh Frpsdulvrq ri Glhuhqw Ohduqlqj Dojrulwkpv1 Whfkqlfdo Uhsruw FPX0FV0<404<:/ Fduqhjlh Phoorq Xqlyhuvlw|/ 4<<41
Taming Large Rule Models in Rough Set Approaches Thomas Agotnes1 , Jan Komorowski1, and Terje Lken2 1
Knowledge Systems Group, Department of Information and Computer Science, Norwegian University of Science and Technology, N-7491 Trondheim, Norway 2
fagotnes, [email protected]
Department of Computer Science, SINTEF Telecom and Informatics, N-7465 Trondheim, Norway [email protected]
In knowledge discovery from uncertain data we usually wish to obtain models that have good predictive properties when applied to unseen objects. In several applications, it is also desirable to synthesize models that in addition have good descriptive properties. The ultimate goal therefore, is to maximize both properties, i.e. to obtain models that are amenable to human inspection and that have high predictive performance. Models consisting of decision or classi cation rules, such as those produced with rough sets [19], can exhibit both properties. In practice, however, the induced models are often too large to be inspected. This paper reports on two basic approaches to obtaining manageable rule-based models that do not sacri ce their predictive qualities: a priori and a posteriori pruning. The methods are discussed in the context of rough sets, but several of the results are applicable to rule-based models in general. Algorithms realizing these approaches have been implemented in the Rosetta system. Predictive performance of the models has been estimated using accuracy and receiver operating characteristics (ROC). The methods has been tested on real-world data sets, with encouraging results. Abstract.
1
Introduction
Rough set theory, introduced by Pawlak [19], provides a theoretically sound framework for extracting models, in the form of propositional decision rules, from data. There are two main tasks for which a model is useful: prediction and description. When performing knowledge discovery from databases (KDD), we are interested in nding as good a model as possible from a set of data. However, what constitutes a good model may vary, depending on the goals of the particular KDD process. If the goal is to build a model that is able to classify unseen objects as accurately as possible then the predictive quality is all-important. If the goal of the KDD process is description, we need to formalize what it means that a model displays good descriptive qualities. This is a diÆcult task, but the size of the model is of fundamental importance. A model consisting of thousands of •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 193−203, 1999. Springer−Verlag Berlin Heidelberg 1999
194
T. Ågotnes, J. Komorowski, and T. Løken
rules, or utilizing several hundred attributes is incomprehensible, while a small model, for example containing in the neighborhood of twenty relatively short rules or less, is easily understood by a human. There are two fundamentally dierent approaches to nding models displaying good descriptive characteristics. The rst approach starts out by generating as good a predictive model as possible, ignoring the descriptive quality. It is then believed that within this model there are some patterns which are of fundamental importance, and others which are redundant or only apply to a small number of objects in the data. Each model consisting of rules which are contained in the original model is called a submodel, and using various ltering strategies, one accepts a small drop in predictive performance in exchange for a submodel which is signi cantly smaller than the original model. The other approach aims at locating the most important patterns directly from the data. This approach is based on a generalization of Kowalczyk's Rough Data Modelling [12]. The rst approach may be computationally expensive, because the problem of nding all models (i.e. reducts) is NP-hard [23], even if it may, at times, be alleviated through the use of appropriate heuristics. In the second approach, the cost of computing is much smaller, for the price of, possibly, not nding the best model. However, the models that are found may be suÆciently good for the problem at hand. It may also be that the computational cost of the rst approach is indeed unacceptable. Clearly, there are advantages and disadvantages with both approaches and there is an apparent need to explore both dimensions in depth. The models considered here are induced from data and are on the form RUL = r1 ; : : : ; rn , ri = i i , where each i is a conjunction of descriptors over the condition attributes, and each i a descriptor over the decision attribute, of a decision system. Such models can both be induced from data and used for classi cation purposes using the Rosetta system [18, 11] which is a tool-kit for data mining and knowledge discovery within the framework of rough set theory. Predictive performance of the models is estimated using ROC (Relative Operating Characteristics) analysis; a method with origins in signal theory that is commonly used in medical diagnosis and is gaining popularity in machine learning. For comparative purposes, the area under the ROC curve (the AUC) is recommended as the appropriate measure of classi er accuracy compared to traditional accuracy measures [25]. Dierences in calculated AUC values for two classi ers may be due to chance. Hanley and McNeil [7] provide a statistical hypothesis test for detecting statistically signi cant dierence in two correlated (calculated from the same data) AUC values. We used this test with a 5% signi cance level for the two-tailed test using Pearson's correlation measure. Both approaches are illustrated here with experiments on real-world data sets, and the results are encouraging. The methods and experiments are further discussed in [1] and [13]. In the following it is assumed that the reader has some familiarity with the use of the rough set framework for synthesizing models from data. A tutorial
f
g
!
Taming Large Rule Models in Rough Set Approaches
195
introduction to rough sets may be found in [10]. Kowalczyk's method is, however, brie y explained, as it may be less known.
2 2.1
Filtering Strategies A Posteriori Filtering
The general problem of post-pruning of models is nding submodels with lower complexity but without signi cantly lower predictive performance. Thus, two general properties of models are performance and complexity. For rule ltering, submodels correspond to subsets and complexity is equal to rule count. The rule ltering problem thus consists of nding a high performance subset RUL0 of a given rule set RUL. A genetic algorithm for rule ltering was implemented in the Rosetta framework, based on an implementation by Vinterbo [26]. Given an initial rule set RUL = fr1 ; : : : ; rn g the search space fR : R RULg is represented by bit strings of length n, where the semantics of a bit string b1 bn is the rule set fri : bi = 1g. The tness function is a weighted sum of the performance and the inverse of rule complexity. Both the weight and the termination criteria for the algorithm is set by the user. Genetic Filtering
Numerical measures of the quality of individual decision rules are generally derived from the contingency tables. The contingency table for the rule r = ! tabulates the number of objects from the rule's originating decision system that matches the antecedent and/or the consequent. For each 2 f; :g and 2 f ; : g, n; denotes the number of objects matching both and . Also, n = n ; + n ;: and n = n; + n:; . The total number of objects in the originating decision system is denoted jU j. Relative frequencies are often used: f; = n; =jU j, f = n =jU j and f = n =jU j for 2 f; :g and 2 f ; : g. The most commonly referenced numerical properties for a rule r = ! are accuracy (r) = n; =n and coverage (r) = n; =n . Coverage was used as a measure for rule ltering in [16]. An overview of rule quality formulae is given by Bruha in [3] and is summarized in Table 1. We used these quality functions for rule ltering by plotting the performance of the model versus the number of the individually best rules included from the corresponding orderings of the un ltered rule set, and selecting models according to given problem speci c criteria. Quality-based Filtering
2.2
A Priori Filtering
Several approaches to a priori rule ltering exist. In this paper, Rough Data Modelling, and its generalization, Rough Modelling, have been investigated. Rough Data Modelling is a method introduced in [12] which attempts to address two common problems found in traditional data mining methods: the
196
T. Ågotnes, J. Komorowski, and T. Løken
Quality measure Formula Michalski Torgo Brazdil Pearson G2 J Cohen Coleman C1 C2 Kononenko
accuracy (r) + (1 ) coverage(r) 1 1 Michalski with = 2 + 4 accuracy (r ) coverage (r ) 1 accuracy (r) e (n; n:;: n;: n:; )=(n n: n n: ) 2 (n; ln(n; jU j=(n n )) + n;: ln(n;: jU j=(n n: ))) qualityG2 (r)=(2jU j) (jU j n; + jU j n:;: n n )=(jU j2 n n n: n: ) (jU j n; n n )=(n n: ) qualityColeman (r)((2 + qualityCohen (r))=3) qualityColeman (r)((1 + coverage(r))=2) log 2 f + log 2 accuracy (r ) Table 1. Rule quality formulae [3]
computation cost of model generation and the inability to tailor the method to the speci c needs of each data mining session. Kowalczyk argues, as do many others, that knowledge discovery should be looked upon as an iterative process where the goals of the process may vary. With the large number of alternatives found at each stage in the knowledge discovery process (feature selection, discretization, data mining, etc.), it is impossible to nd general guidelines which will always produce the best model. This means that it is desirable to generate as large a number of models as possible and search among these in order to nd a satisfying model. However, the computational cost of many commonly used algorithms prohibits the generation of numerous models, and in addition, most algorithms are speci cally designed to maximize (or minimize) a certain measure, or a prede ned combination of measures (accuracy, speci city, misclassi cation cost, etc.). Rough data modelling simpli es the model generation process and makes it feasible to search among a large number of models. In addition, it allows the user to tailor the data mining process to suit his/her own needs, by allowing the user to specify in detail how to evaluate a model. From a decision system A = (U; A [ fdg), a rough data model is a triple M =< B; dB ; >, where B is an attribute subset of A that induces a set of equivalence classes E B : dB is a class decision function, which maps the equivalence classes E B to a single decision value for each class, and a linear ordering of the equivalence classes in B E : Both the class decision function and the ordering on the classes are decided by the user. The process of generating high-quality models using rough data modelling is a three-step process [12]: 1. Specify the performance measure that is to be optimized, and specify the measure used to rank classes in a model. 2. Determine the search space, i. e. the collection of models that should be searched for a model maximizing the measure found in step 1. This step also involves determining the class decision function dB .
Taming Large Rule Models in Rough Set Approaches
197
3. Determine the search procedure. And so now, each equivalence class in a particular rough data model is uniquely identi ed by the values of the attributes in B: It is thus equivalent to a single rule of the form (a1 (x) = v1 ) ^ : : : ^ (an (x) = vn ) ! (d(x) = dB ([x]B ))
(1)
where B = fa1 ; : : : an g, and fv1 ; : : : vn g are the characteristic values of the equivalence class of x, [x]B . The complexity of rough data modelling is linear with respect to the number of objects in the data set [12]. The size of the search space is decided by the user-speci ed upper and lower bounds on how many attributes to include in the model, and it is only feasible to search for models which use relatively few (5-10) attributes. Rough data modelling is closely related to several other strategies for extracting small, yet accurate models from possibly large data sets, for example rough classi ers [20] and feature subset search [9]. It is desirable to develop a uni ed framework for mining compact, yet accurate models which is able to incorporate new developments in rough set theory, such as replacement of the discernibility relation with a similarity or tolerance relation [22, 24], or the incorporation of ordinal properties of attribute values obtained through the use of a dominance relation [5]. As a step towards such a framework, we introduce the concept of rough models, which is a generalization of rough data models. While a rough data model partitions the universe U using the indiscernibility relation, a rough model is only required to have classes that cover U , imposing no restrictions on how these are found. For a decision system A = (U; A [ fdg), a rough model is a tuple
M=
^B ; E ; ; R >
< B; d
(2)
where B A is a set of attributes, d^B : U ! Vd is an object decision function, not a class decision function as used in rough data models. This allows us the added exibility of being able to assign dierent decision values for objects which belong to the same decision class, if we so desire. E is a set of object classes which cover the universe U , is a linear ordering on the classes in E , and R is a set of reducts for the decision system A , which is obtained by replacing the decision values of each object x 2 U with d^B (x). Rough data models are a kind of rough models where R = B , and E = U=IND (B ). In addition, one can easily de ne, for example, rough similarity models, where the classes E are induced by a form of similarity relation [24]1 , rough dominance models, where the dominance relation, explained in [5], is used to partition the universe into classes E , or rough Holte models, where each attribute in B is regarded as a reduct, yielding a model containing univariate rules. 0
1
In this case, the classes will overlap.
198
3 3.1
T. Ågotnes, J. Komorowski, and T. Løken
Experimental Results Rule Filtering
Two preliminary experiments using the rule ltering schemes presented herein (i.e. genetic ltering and quality-based ltering) have been carried out. In the rst experiment, henceforth called \the acute appendicitis experiment", we used a data set describing 257 patients with suspected acute appendicitis collected at Innherred Hospital in Norway [6]. This data set has previously been mined in [4] using methods from rough set theory. In the second experiment, henceforth called \the Cleveland experiment", we used the Cleveland heart disease database, available from the UCI repository [14], consisting of 303 patients with suspected coronary artery disease. 6 of the objects had missing values for one or more attributes and were removed. Both datasets were split into equally sized learning, hold-out and testing sets three times. After preprocessing, rule induction (using dynamic reducts [2]) were performed using the learning sets. Rule ltering was done using the holdout sets, and the performance assessed and compared to the un ltered sets using the testing sets. The two data sets were used to illustrate two slightly dierent rule ltering applications. In the acute appendicitis experiment, the goal was to nd descriptive models. These were subjectively de ned as models with no more than 20 rules. In the Cleveland experiment, the goal was to lter down the rule sets as much as possible without any constraint on the maximal size of a rule set. The execution of the experiments diered in the selected parameters (e.g. the weight for the tness function in the genetic algorithm), and in the procedure used to select a particular ltered model from several alternatives. The estimated performance of the pruned models is shown in Table 2. Acute Appendicitis Cleveland Method Size AUC (SE) p-value Size AUC (SE) p-value Un ltered 447 0.9043 (0.0363) 6949.67 0.9035 (0.0374) Genetic 6.67 0.8930 (0.3318) 0.6209 34 0.8776 (0.0377) 0.4475 Michalski ( = 0) 9.67 0.8890 (0.0392) 0.6195 65 0.8920 (0.0353) n/a Pearson 8.33 0.8661 (0.0426) 0.3253 92.33 0.8846 (0.0366) n/a J 7 0.8734 (0.0411) 0.4813 119.33 0.8753 (0.0380) n/a Table 2. Performance of ltered versus un ltered rule sets on previously unseen data. Only the best quality functions are shown. All values are averages over three dierent splits of the data set. p-values outside the Hanley/McNeil lookup table are speci ed as n/a.
Taming Large Rule Models in Rough Set Approaches 3.2
199
Rough Modelling
In order to investigate the performance of the rough modelling approach within the Rosetta toolkit, several data sets were analyzed using rough modelling, as well as traditional rough set methods for comparison. All results shown below were computed using a genetic algorithm for rough model search, as initial experiments proved this algorithm on average used only 20-30% of the time used by an exhaustive search and still consistently returned the same models. Rough models were generated from the Pima Indian diabetes data set from the UCI repository, and the acute appendicitis data set described earlier. In order to investigate the eect of excluding the smallest equivalence classes from the rough data models, the size threshold for inclusion into the rough model was varied and a rough model generated from two thirds of the objects in the data set. The remaining objects were used to test the performance of the resulting model, using ROC analysis. The results for both data sets are shown in Table 3, and the numbers are labeled AO and PO , for the appendicitis and diabetes data, respectively. In addition to a decision function d^B which sets the decision value for each object to the dominating decision value for the equivalence class of that object, a decision function which copies the original decision value for each object was implemented. This breaks with the principle put forth by Kowalczyk that each equivalence class should have a single decision value associated with it and thereby produces a model which may contain indeterministic rules. Table 3 also shows the results using this decision function, labeled AD and PD for the two data sets used. The splits were unaltered to facilitate comparison of the AUC values. The AUC values were compared using the Hanley-McNeil test. This was only done for the appendicitis data, as no known benchmark existed for the diabetes data. The numerical results indicate that rough data models perform somewhat worse than the best known RS model, but the dierence is not signi cant if the correct threshold for class size is selected. This means that one may accept a slight drop in performance, but the drop is often insigni cant. No good measure of descriptiveness exists, but while the RS models for the appendicitis data contained between 850 and 900 rules, the rough data models contained 15 rules or fewer if the smallest classes (less than 5 objects) were ltered out from the model. On all other data sets examined, the size of the models found by a rough model search were comparable to the results reported for the appendicitis set (between 5 and 20 rules, if a small cut-o on class size is used). The performance of rough Holte models was only brie y investigated on the appendicitis data. The complete set of univariate rules for the data set, denoted 1R, was used as a benchmark, and a 3-fold split-validation was carried out, with dierent lower limits on the classes to be included in the model. The results are labeled AH in Table 3. No signi cant dierences were found between the dierent rough Holte models and the 1R rule set, or between the dierent rough Holte models. This means that it is possible to mimic the performance of the 1R
200
T. Ågotnes, J. Komorowski, and T. Løken
rule set (which in turn has been found to be comparable to the performance of the best reduct-based models) using only a handful of attributes. Size limit
AD
AO
Model AUC (SE)
AH
PD
PO
RS 1R
0.907 (0.031) 0.907 (0.031) 0.907 (0.031)
{
{
0.908 (0.032) 0.908 (0.032) 0.908 (0.032)
{
{
0
0.783 (0.049) 0.794 (0.048) 0.925 (0.028) 0.692 (0.036) 0.748 (0.034)
5
0.835 (0.043) 0.897 (0.033) 0.894 (0.034) 0.694 (0.036) 0.730 (0.034)
10
0.816 (0.045) 0.857 (0.039) 0.884 (0.035) 0.732 (0.035) 0.763 (0.033)
15
0.771 (0.050) 0.806 (0.047) 0.903 (0.032) 0.715 (0.035) 0.767 (0.033)
Table 3. Results from analysis on the acute appendicitis data and diabetes data, using a variety of rough models.
4 4.1
Discussion Rule Filtering
The results from both experiments are rather encouraging. In the acute appendicitis experiment, dramatically smaller rule sets without signi cantly poorer | in fact, sometimes (insigni cantly) better | performance were found. In the Cleveland experiment however, the selected models for one of the splits (split 3) generally had signi cantly poorer performance compared to the un ltered models. The performance of the quality functions for rule ltering purposes was diverse. It seems that the Michalski formula with = 0 (see Table 1) can be recommended. Corresponding to ltering according to the coverage only (similar to [16]), this is a rather surprising result. In addition to the Michalski formula, the Pearson 2 statistic and the J-measure seem to perform well. The results back up [3] in that the theoretically based quality formulae generally do not perform better than the empirically based. The genetic algorithm performs slightly (insigni cantly) better than the quality formulae in the case with relatively small rule sets; but slightly poorer in the case with comparatively large sets. For practical purposes, rule ltering with the genetic algorithm is much more time consuming than using quality-based ltering. 4.2
Rough Modelling
As with a posteriori pruning, the results obtained so far using rough modelling are not a solid enough foundation on which to make strong claims about the performance of rough models versus models mined using traditional rough set
Taming Large Rule Models in Rough Set Approaches
201
methods. There is some indication that the performance of rough models falls slightly short of the performance of larger rough set induced models, but the evidence is not conclusive. However, there is strong evidence supporting the conjecture that rough models of comparable performance to traditional RS models are far more descriptive. No universally agreed upon measure of descriptiveness exists, but a decrease in size from thousands of rules to between ve and twenty, few attributes used, as well as the option of determinism, add up to a very large improvement in descriptive capability. The results in the experiments performed support the observations made by Holte [8], that very simple rules often perform just as well as more complex rules. For the appendicitis data, it may seem that the predictive capabilities of univariate rules match and possibly exceed those of rough data models as well as the best known rough set-based models, which both contain rules with several attributes in the antecedent. However, univariate rules are less interesting from a knowledge discovery standpoint, as they only represent simple correlations between individual attribute values and a decision value. For a related discussion, see [17] in this volume.
5
A Comparison and Future Research
The results on rule ltering are rather encouraging. It is possible to obtain small models that preserve predictive quality. While the computational costs may be high, the method guarantees that no signi cant combination of features will be excluded from the original set. Applications of rule ltering vary widely in their goals, and it may be impossible to construct an automatic method serving all needs. The simplicity of rough model generation means that rough modelling is applicable in particular to large databases; it is also well-suited as an initial approach to mining data. By searching for various forms of rough models, models which are of a high descriptive and predictive quality may be generated quickly. The insight gained from inspecting the rough models may then be used to process the data before using reduct-based model inducers to mine the best possible predictive model, if this is of interest. The experiments presented here are still preliminary and not suÆcient to draw general and de nite conclusions regarding the applicability of the proposed methods. A further investigation should include a wider range of data sets, induction algorithms, and, in the case of genetic algorithms, various parameter settings. Also, more sophisticated experiment designs | such as cross-validation | should be considered. We also plan to investigate the relationship of our approaches to the so-called templates developed now at Logic Section of Warsaw University [15]. Not surprisingly, we can conclude that a knowledge discoverer will be best served if both approaches are present in his/her arsenal. The choice of tool will depend on the task at hand. We hope that this work will help him/her decide which method to choose.
202
T. Ågotnes, J. Komorowski, and T. Løken
References 1. Thomas Agotnes. Filtering large propositional rule sets while retaining classi er performance. Master's thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, 1999. 2. Jan G. Bazan, Andrzej Skowron, and Piotr Synak. Dynamic reducts as a tool for extracting laws from decision tables. In Proc. International Symposium on Methodologies for Intelligent Systems, number 869 in Lecture Notes in Arti cial Intelligence, pages 346{355. Springer-Verlag, 1994. 3. I. Bruha. Quality of decision rules: De nitions and classi cation schemes for mulitple rules. In G. Nakhaeizadeh and C. C. Taylor, editors, Machine Learning and Statistics, The Interface, chapter 5. John Wiley and Sons, Inc., 1997. 4. U. Carlin, J. Komorowski, and A. hrn. Rough set analysis of medical datasets in a case of patients with suspected acute appendicitis. In Proc. ECAI'98 Workshop on Intelligent Data Analysis in Medicine and Pharmacology (IDAMAP'98), pages 18{28, 1998. 5. S. Greco, B. Matarazzo, and R. Slowinski. New developments in the rough set approach to multi-attribute decision analysis. Bulletin of International Rough Set Society, 2(2/3):57{87, 1998. 6. S. Hallan, A. Asberg, and T.-H. Edna. Estimating the probability of acute appendicitis using clinical criteria of a structured record sheet: The physician against the computer. European Journal of Surgery, 163(6):427{432, 1997. 7. James A. Hanley and Barbara J. McNeil. A method for comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148:839{843, September 1983. 8. R.C. Holte. Very simple classi cation rules perform well on most commonly used datasets. Machine Learning, 11:63{91, 1993. 9. R. Kohavi and B. Frasca. Useful feature subsets and rough set reducts. In T.Y. Lin and A.M. Wildberger, editors, 3rd International Workshop on Rough Sets and Soft Computing (RSSC '94), San Jose, USA, 1994. 10. J. Komorowski, Z. Pawlak, L. Polkowski, and A. Skowron. Rough sets: A tutorial. In S.K. Pal and A. Skowron, editors, Rough Fuzzy Hybridization { A New Trend in Decision-Making, pages 3 { 98. Springer, 1999. 11. Jan Komorowski, Aleksander hrn, and Andrzej Skowron. ROSETTA and other software systems for rough sets. In Willy Klosgen and Jan Zytkow, editors, Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2000. 12. W. Kowalczyk. Rough data modelling : a new technique for analyzing data. In Rough Sets and Knowledge Discovery 1: Methodology and Applications [21], chapter 20, pages 400{421. Physica-Verlag, 1998. 13. Terje Lken. Rough modeling: Extracting compact models from large databases. Master's thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway, 1999. 14. P. M. Murphy and D. W. Aha. UCI Repository of Machine Learning Databases. Machine-readable collection, Dept of Information and Computer Science, University of California, Irvine, 1995. [Available by anonymous ftp from ics.uci.edu in directory pub/machine-learning-databases]. 15. Hoa S. Nguyen. Data regularity analysis and applications in data mining. PhD thesis, Warsaw University, 1999. 16. A. hrn, L. Ohno-Machado, and T. Rowland. Building manageable rough set classi ers. In Proc. AMIA Annual Fall Symposium, pages 543{547, Orlando, FL, USA, 1998.
Taming Large Rule Models in Rough Set Approaches
203
17. Aleksander hrn and Jan Komorowski. Diagnosing acute appendicitis with very simple classi cation rules. In Jan Rauch and Jan Zytkow, editors, Proceedings of the
Third European Symposium on Principles and Practice of Knowledge Discovery in Database (PKDD'99), Prague, Czech Republic, September 1999. Springer-Verlag.
18. Aleksander hrn, Jan Komorowski, Andrzej Skowron, and Piotr Synak. The design and implementation of a knowledge discovery toolkit based on rough sets: The ROSETTA system. In Lech Polkowski and Andrzej Skowron, editors, Rough Sets in Knowledge Discovery 1: Methodology and Applications, number 18 in Studies in Fuzziness and Soft Computing, chapter 19, pages 376{399. Physica-Verlag, Heidelberg, Germany, 1998. 19. Z. Pawlak. Rough sets. International Journal of Information and Computer Science, 11(5):341{356, 1982. 20. Z. Piasta and A. Lenarcik. Rule induction with probabilistic rough classi ers. Technical report, Warszaw University of Technology, 1996. ICS Research Report 24/96. 21. L. Polkowski and A. Skowron, editors. Rough Sets in Knowledge Discovery 1: Methodology and Applications, volume 1 of Studies in Fuzziness and Soft Computing. Physica-Verlag, 1998. 22. A. Skowron, L. Polkowski, and J. Komorowski. Learning tolerance relations by boolean descriptors, automatic feature extraction from data tables. In S. Tsumoto, S. Kobayashi, T. Yokomori, H. Tanaka, and A. Nakamura, editors, Proceedings
of the Fourth International Workshop on Rough Sets, Fuzzy Sets, and Machine Discovery, RSFD'96, Tokyo, Japan, pages 11{17, 1996.
23. A. Skowron and C. Rauszer. The discernibility matrices and functions in information systems. In R. Slowinski, editor, Intelligent Decision Support Systems | Handbook of Applications and advances in Rough Set Theory, pages 331{362. Kluwer Academic Publishers, 1991. 24. R. Slowinski and D. Vanderpooten. Similarity relation as a basis for rough approximations. In P.P. Wang, editor, Advances in Machine Intelligence & Soft Computing, volume 4, pages 17{33. Duke University Press, 1997. 25. J. A. Swets. Measuring the accuracy of diagnostic systems. Science, 240:1285{1293, 1988. 26. Staal Vinterbo. Finding minimal cost hitting sets: A genetic approach. Technical report, Department of Computer and Information Science, Norwegian University of Science and Technology, 1999.
Optimizing Disjunctive Association Rules Dmitry Zelenko Department of Computer Science University of Illinois at Urbana-Champaign Urbana IL 61801, USA, [email protected]
Abstract. We analyze several problems of optimizing disjunctive association rules. The problems have important applications in data mining, allowing users to focus at interesting rules V extracted from databases. We consider association rules of the form nj=1 (Aij = vj ) ! C2 , where fAi1 ; Ai2 ; : : : ; Ain g is a subset of the categorical attributes of the underlying relation R, and C2 is any xed condition de ned over the attributes of the relation R. An instantiation of the rule binds the variables vj 's to values from the corresponding attribute domains. We study several problems, in which we seek a collection of instantiations of a given rule that satisfy certain optimality constraints. Each of the problems can re-interpreted as looking for one optimized disjunctive association rule. We exhibit ecient algorithms for solving the optimized support and optimized con dence problems, the weighted support/con dence problem, and the shortest rule problem. We discuss time and space complexity of the designed algorithms and show how they can be improved by allowing for approximate solutions.
1
Introduction
The last decade computer technology coupled with improved data acquisition techniques resulted into a signi cant increase in the rate of growth of large databases. The bulk of data accumulated in such databases encompasses every sphere of human life. Having eliminated the paper-based technology, the modern businesses are now recognizing the necessity to remove the burden of analyzing the data from their employees. The amount of the data accrued within any major company obviously precludes their manual treatment. Thus, a new generation of data processing techniques are called for. The techniques have to assist users with analysis of the data and extraction of useful knowledge. These techniques and tools are the subject of the eld of knowledge discovery in databases (KDD) or data mining [3]. Knowledge extracted from databases can be represented in many possible ways. A representation that received much attention within the community of KDD researches is that of association rules [1,11,13]. An association rule , in its most general form, is an implication 1 ! 2, where 1 2 are conditions de ned over the attributes of an underlying database X
C ;C
•
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 204−213, 1999. Springer−Verlag Berlin Heidelberg 1999
C
C
Optimizing Disjunctive Association Rules
205
relation R. For example, for a relation R(Student; CourseType; Grade), we can de ne the association rule X : (CourseType = elective) ! (Grade = A)
The fraction of the relation tuples satisfying C1 is termed the support of the rule (and denoted sup(X )). The ratio of the number of tuples satisfying the C1 ^ C2 to the number of tuples satisfying C1 is termed the con dence of the rule (and denoted conf (X )). For the above example, the support of the rule is the percentage of elective course tuples, and the con dence is the fraction of the elective tuples with the grade A. Association rules have been introduced as an attempt to capture correlations and regularities in the underlying database relation, with hope that the regularities prove useful for future decision making. Data mining algorithms aim at extracting such useful association rules. Extracting association rules usually means nding all rules satisfying pre-speci ed lower bounds on their con dence and support. For example, we can ask a data mining system to produce all rules for the above relation R with support greater than 20% and con dence greater than 80%. If the rule X satis es the support and con dence restrictions, then it should be output by the system. However, if we do not place additional constraints on the conditions C1 and C2 of an association rule, the number of feasible rules may be very large. Moreover, most of the rules satisfying the conditions may be useless. Therefore, we have introduce a representational bias to restrict the form of a rule. The most popular bias in the literature requires that the condition C1 be a conjunction of binary-valued attributes and C2 be a single binary-valued attribute[1,11]. Recently, Rastogi and Shim[12] introduced a new kind of association rules of the form ^ (A = v ) ! C ; (1) 2 n
j =1
ij
j
where fA 1 ; A 2 ; : : :; A g is a subset of the relation attributes, v is variable to be bound to a value d from the domain of A , and the consequent C2 of a rule is any xed condition de ned over the attributes of the relation R. For this kind of rules, extracting association rules means nding a tuple of d 's so that the rule instantiated with the d 's satis es the given minimum support and con dence conditions. A tuple of d 's is called an instantiation of the rule. For example, consider the uninstantiated version of the above rule X : i
i
in
j
j
ij
j
j
j
(CourseType = v1 ) ! (Grade = A)
The goal is to nd a value d1(e.g., elective) for the variable v1 so that the rule, instantiated with the value, will satisfy the minimum support and con dence conditions. We can generalize the setting by allowing more than one instantiation for an association rule. More precisely, we seek a collection I of instantiations
206
D. Zelenko
(I = fI1 ; : : :; Ik g) so that the cumulative support and con dence1 of the association rule are above the given minimum values. We can view the collection of instantiations as a disjunctive association rule, because the rule antecendent can be written as a disjunction of conjunctions, where each conjunction corresponds to a dierent instantiation of (1). In our example, we are looking for several course types that cover sucient number of tuples in the relation R as well as have the grade A in the sucient proportion of the covered tuples. We study three optimizationproblems introduced in [12], and another natural optimization problem, the shortest rule problem. Note that each of the problems can be recast as seeking one disjunctive association rule. 1. The optimized support problem : given a relation R and a rule (1), nd a set I of instantiations maximizing the cumulative support of the rule with its con dence greater than the given minimum con dence. 2. The optimized con dence problem : given a relation R and a rule (1), nd a set I of instantiations maximizing the cumulative con dence of the rule with its support greater than the given minimum support. 3. The weighted support/con dence problem : given a relation R and a rule (1), and two positive integers ; , nd a set I of instantiations maximizing sup+ conf , where sup and conf are the cumulative support and con dence of the instantiated rule (1), respectively. 4. The shortest rule problem : given a relation R and a rule (1), nd a minimum set I of instantiations, so that the cumulative support and con dence of the rule are greater than the given minimum support and con dence, respectively. 2
Related Work
Association rules were rst studied in [1]. The paper led to plethora of research aimed at enumerating conjunctive association rules de ned over boolean attributes (e.g., [11,14,8, 13,6]). Association rule enumeration is linked to nding minimal hypergraph transversals[7], learning monotone DNF from membership queries [7], and minimizing the number of failing subqueries[5]. The related computational hardness results are presented in [6, ?]. The problem of nding optimized association rules was introduced in [4]. The problem was de ned for numeric attributes, and allowed at most two uninstantiated attributes to be present in the rule antecedent. Moreover, the algorithms presented in the paper produced a single instantiation of the rule; therefore, the rules were essentially conjunctive. Rastogi and Shim [12] generalized the setting by allowing more than two attributes and multiple instantiations, thereby introducing the notion of disjunctive association rules (they did not use the terminology though). They posed three of the problems that we study in this paper, and proved that the optimized support and optimized con dence problems are 1
\Cumulative" means that we count tuples that satisfy at least one instantiation of the rule. We formally de ne the cumulative support and con dence in section 4
Optimizing Disjunctive Association Rules
207
NP-hard. However, the proof implicitly relied on the unnatural complexity parameter for both of the problems. Namely, if M is the number of records in the relation R, then the problems are NP-hard, if log M(instead of M) is used as a complexity parameter for the problems. We take M to be the natural complexity parameter for all of the problems (that is, we allow to scan all of the relation), and design ecient algorithms for all of the above problems. Our algorithms run linear in n, in contrast to the algorithms in [12], which are exponential in n. However, [12] also considers rules with numeric attributes, and exhibits algorithms for nding optimized rules in the setting, while we deal only with categorical attributes.
3 The Summary of the Results Let M be the number of records in the relation R, and N be the number of attributes. Also, given a rule (1) and a relation R, let m be the number of distinct tuples (di1 ; di2 ; : : :; di ) 2 A 1 ;A 2 ;:::;A (R). Then we have the following results: 1. The optimized support problem can be solved in time O(M(N + m(log M + size()))) and space O(M(log M +size())), where size() is the number of bits used to encode the given minimum con dence value . 2. There is a fully polynomial time approximation scheme for the optimized con dence problem with time complexity O(M(N +m log 1 (logM +log 1 ))), where is the absolute error. The approximation scheme can be converted into an exact algorithmthat takes O(M(N+m log2 M)) time and O(M logM) space. 3. The weighted support/con dence problem can be solved in time O(M(N + m log M)) and space O(M log M). 4. The shortest rule problem can be solved in time O(M(N +M(p+q)m log m)) and space O(Mm log m), where pq = is the given minimumcon dence value. 5. By allowing to produce an approximate solution we can reduce both time and space complexity of all of the algorithms. In the following sections, Z denotes the set of integers, Z+ stands for the set of nonnegative integers, and Q denotes the set of rational numbers. Also, for a nite set S, jS j is the number of elements in S. n
i
i
in
4 Preliminaries In this section we introduce the notation and de nitions in order to reformulate all of the stated problems in terms of mathematical programming. The reformulation highlights the relationship between them and classical problems of combinatorial optimization. Let R be a relation over the attributes A1 ; A2; : : :; AN with M tuples. Each of the attributes is categorical, that is, there is no partial order imposed on the domain dom(Ai ) of an attribute Ai , for each i = 1; 2; : : :; N. Let A be a subset of fA1; A2 ; : : :; AN g, A = fAi1 ; Ai2 ; : : :; Ai g. n
208
D. Zelenko
De nition 1. An (uninstantiated) association rule X(v1 ; v2; : : :; vn) over A is an implication
^n (A
j =1
ij
= vj ) ! C2;
where vj are variables, and C2 (A1 ; A2; : : :; AN ) is any xed formula de ned over A1; A2; : : :; AN .
We also assume that it takes linear (in N) time to evaluate C2 for a tuple of R. Let X be an uninstantiated association rule over A. Then we de ne the set of all instantiations Inst of X as the projection A(R) of R on A. Note that jInstj jRj = M. We instantiate an association rule X by picking an instantiation of Inst and binding the variables of X to the elements of the instantiation. More precisely, take I = (d1 ; d2; : : :; dn) 2 Inst. Then, X(I) = X(d1 ; d2; : : :; dn) is the association rule X instantiated with I. V Let Inst = fI1 ; I2; : : :; Im g. Denote nj=1 (Ai = vj ) by U(A1; A2; : : :; AN ; v1 ; v2; : : :; vn). For a xed instantiation Ii 2 Inst let si = jft 2 R : U(t; Ii )gj and ci = jft 2 R : U(t; Ii ) ^ C2(t)gj Intuitively, si is the count P of the tuples contributing to the support of X instantiated with Ii (note that mi=1 si = M), while ci is the count of tuples contributing to the con dence of X instantiated with Ii . Note that we can compute all ci 's and si 's in one pass over the relation R, which takes O(NM) time. This is the preprocessing time for all our algorithms. Since all attributes are categorical, a tuple can contribute to the count of only one instantiation for a xed association rule. Therefore, it is possible to express the cumulative support and con dence for a set of instantiations using the quantities of si and ci. For a set I Inst of instantiations, the cumulative support of I is de ned as: P s sup(I ) = IM2I i The cumulative con dence of I is de ned similarly: P c P conf(I ) = I 2I si I 2I i Finally, we introduce a set 0-1 valued variables xi ; i = 1; 2; : : :; m, and equate each subset I of Inst with a boolean assignment to xi's. That is, Ii 2 I xi = 10 ifotherwise Now all four introduced problems problems can be reformulated in terms of 0-1 mathematical programming. In the following sections we state each of the problems as a 0-1 programming problem, and then give ecient algorithms for solving the problems j
i
i
i
Optimizing Disjunctive Association Rules
209
5 Optimizing Support The optimized support problem for an association rule X, a relation R, and a minimum con dence value can be stated as follows: m X s x ! max i i
i=1
Pm Pmii=1=1 sciixxii
(2)
si ci 0; ci ; si 2 Z; 2 Q \ [0; 1]; xi 2 f0; 1g; i = 1; 2; : : :; m Let the minimum con dence be represented as a rational number pq , where p; q 2 Z+ . Then, (2) can be rewritten as m X (ps , qc )x 0 i
i=1
i i
After denoting ai = psi , qci ; i = 1; 2; : : :; m, the optimized support problem becomes: m X si xi ! max i=1 m
Xa x 0 i=1
i i
si 0; ai ; si 2 Z; xi 2 f0; 1g; i = 1; 2; : : :; m The above problem is a variant of knapsack, a P classical optimization problem that can be solved in less than mM steps (M = mi=1 si ) by dynamic programming(see, for example,[2]). Usually the complexity of each step for knapsack is treated as constant. In our case, however, the complexity depends on the value of the minimum con dence parameter for the problem. Since the time complexity P of one step is roughly equal to the time of writing down the value P of m mi=1 aixi , we can bound it by bounding the number of bits used to encode i=1 ai xi for any (x1; x2; : : :; xm ) 2 f0; 1gm: log(j
m m m X X X a x j) log( jps , qc j) log(p + q) + log( s ) size() + log M; i=1
i i
i=1
i
i
i=1
i
where size() = logp + log q is the number of bits used to encode . Thus, the classical dynamic programming algorithm solves the optimized support problem in O(mM(log M + size())). Adding the preprocessing time of O(NM) gives O(M(N + m(log M + size()))).
210
D. Zelenko
6 Optimizing Con dence
The optimized con dence problem for an association rule X, a relation R, a minimum support = MB ; B 2 Z+ , can be stated as follows: Pmi=1 cixi Pmi=1 sixi ! max (3) m X sx B i=1
i i
si ci 0; ci ; si 2 Z; B 2 Z+ ; xi 2 f0; 1g; i = 1; 2; : : :; m Consider the decision problem corresponding to the optimized con dence problem: Xm s x B; PPmi=1 cixi i i m sx i=1
i=1 i i
si ci 0; ci; si 2 Z; B 2 Z+ ; 2 Q \ [0; 1]; xi 2 f0; 1g; i = 1; 2; : : :; m We can treat the problem as a decision version of the optimized support problem and solve it by dynamic programming, as we described in the previous section. We consider this procedure as an oracle DecConf() returning ;, if the decision problem does not have a solution, and returning a solution x otherwise. Then, given the bound on the absolute error of the required solution, we start with an initial interval [0; 1]. The interval is repeatedly halved in a way that guarantees that the optimal con dence value lies within the halved interval. When the interval becomes less than we can be sure that any feasible solution lying within the interval has the absolute error less than . Thus, the DecConf() returns an approximate solution x to the optimized con dence problem with the absolute error less than . The number of iterations in the algorithm is log 1 . The maximum size of is also O(log 1 ). Therefore, the running time of the algorithmis O(mM log 1 (log M+ log 1 )). Adding the preprocessing time of O(NM) gives O(M(N+m log 1 (logM+ log 1 ))). Since the running time of the algorithm is polynomial in all complexity parameters (including ), we have designed a fully polynomial approximation scheme [9] for the optimized con dence problem. Observe that we can convert the approximation scheme into an exact algorithm for the optimized con dence problem by choosing the absolute error of to be (Pm si )2 , where < 1. Then, log 1 2 logM , log , and the running i=1 time of the algorithm is O(M(N + m log2 M)).
7 Optimizing Weighted Support/Con dence
The weighted support/con dence problem for an association rule X, a relation R, and weights ; 2 Z+ , can be stated as follows: Pm m X si xi + Pmi=1 scixxi ! max (4) i=1 i i i=1
Optimizing Disjunctive Association Rules
211
si ci 0; ci ; si ; ; 2 Z+ ; xi 2 f0; 1g; i = 1; 2; : : :; m We can solve the problem by considering the following set of optimization problems: m X si xi = B i=1
m X B + B cixi ! max
i=1 ci ; si ; ; 2 Z+ ; xi 2 f0; 1g;
si ci 0; i = 1; 2; : : :; m Each of the problems corresponds to a dierent value of B, where B 2 Z+ ; 0 < B M. In order to produce an optimal solution to the weighted support/con dence problem, we take the best solution from the M solutions to the above M problems. Each of the problems is again a restricted variant of knapsack, and it can be solved by dynamic programming in time O(mB logM). Therefore, the naive algorithm that solves all M problems separately takes O(mM 2 logM) time. We can reduce the running time by the factor of M by solving all of the B problems More precisely, let W(k; B) be the maximum value P at the same time. of mi=1 cixi produced using the variables x1; x2; : : :; xk (i.e., xk+1P= = xm = P k 0), such that i=1 si xi = B. If for all (x1; : : :; xk ) 2 f0; 1gk, ki=1 si xi 6= B, then we set W(k; B) = ,1. Now we can specify the recurrence relation for W(k; B) as follows: W(k; 0) = 0; 0 k n; W(0; B) = ,1; 0 < B M W(k; B) = max(W(k , 1; B , sk ) + ck ; W(k , 1; B)); 1 k n; 0 < B M We can ll the table for P W(k; B) in O(mM) steps, where each step takes at most O(log M) time (since j mi=1 ci xi j M). An optimal solution to the weighted support/con dence problem corresponds to the maximum value among B + B W(n; B); 0 < B M. We can nd the value in M steps and then nd the solution in less than m steps by using the above recurrence relation and going backwards from the cell of the table corresponding to the maximum value. Hence, the total running time of the algorithm is O(mM logM). Adding the preprocessing time of O(NM) gives O(M(N + m log M)).
8 Optimizing the Rule Length The shortest rule problem for an association rule X, a relation R, a minimum support = MB ; B 2 Z+ , and a minimum con dence = qp ; p; q 2 Z+ can be stated as follows: m X xi ! min
Pm cixii=1 p X m Pmi=1 si xi q ; si xi B i=1
i=1
212
D. Zelenko
si ci 0; ci ; si 2 Z; B 2 Z+ ; xi 2 f0; 1g; i = 1; 2; : : :; m which is equivalent to m X xi ! min i=1
m X a x 0; X sx B m
i=1
i i
i i i=1 B 2 Z+ ; xi 2 f0; 1g;
(5)
si 0; ai; si 2 Z; i = 1; 2; : : :; m P where ai = qci , psi ; i = 1; 2; : : :; m. Recall that j mi=1 aixi j (p + q)M. We again use ubiquitous dynamic programming to solve (5). Denote C = (p + q)M. Let L(k; D; A) be theP minimum number of P nonzero variables in (x1; : : :; xk) 2 f0; 1gk, so that ki=1 si xi D and ki=1 ai xi A, where 0 k m; 0 D B and jAj P C. Again, we setP L(k; D; A) = 1, if for all (x1 ; : : :; xk) 2 f0; 1gk, either ki=1 si xi D or ki=1 ai xi A is not satis ed. Then, L(k; 0; A) = 0; 0 k n; ,C A 0 L(k; 0; A) = 1; 0 < A C L(0; D; A) = 1; 0 < D B; jAj C L(k; D; A) = min(L(k , 1; D , sk ; A , ak )+1; L(k , 1; D; A)); 1 k n; jAj C We ll the 3-dimensional table for L in O(m(p + q)M 2 ) where each step takes O(log m) time. Hence, the time complexity of lling in the table is O(m logm(p+ q)M 2). Then, L(m; B; 0) is the minimum length of a rule satisfying the support and con dence constraints. We can nd the actual optimal solution (rule) in O(m) steps by backtracking from L(m; B; 0). The total time complexity of the shortest rule problem (including the preprocessing time) is O(M(N + M(p + q)m log m)). 9
Discussion
We used dynamic programming to solve all of the above problems, Since a dynamic programming algorithm only needs to know the values in the previous row of the table, while lling in the current row, the memory complexity of the algorithm is the size of one table row. For the rst three problems, the size is O(M log M) (M cells and O(logM) bits for a cell value). For the shortest rule problem, the size of the row is O(mM log m) (mM cells and O(log m) bits for a cell value). We can reduce the time and space complexity of the above algorithms by allowing for suboptimal solutions. There is a number of standard approximation schemes for knapsack[10], and any of them can be used to nd approximate solutions for the above problems. One of the most common methods is to scale down each of objective function coecients by a factor k. This will lead to both
Optimizing Disjunctive Association Rules
213
running time and space that the algorithms take being reduced by the factor of k. The produced solution, however, will be suboptimal with the absolute error less than kL, where L is the length (number of instantiations) of the optimal rule. 10
Conclusions and Further Research
We analyzed several problems for optimizing association rules. The problems have important application in data mining, allowing users to focus at interesting rules extracted from databases. We exhibited ecient algorithms for solving all of the problems. For further research, we plan to apply the algorithms to nding association rules in real world databases. We also intend to study and try to develop ecient algorithms for optimizing other variants of association rules. References 1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD, 22(2), 1993. 2. G. B. Danzig. Linear Programming and Extensions. Princeton University Press, 1963. 3. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Anvances in Knowledge Discovery and Data Mining. AAAI Press / The MIT Press, Menlo Park, California, 1996. 4. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining optimized association rules for numeric attributes. In PODS, 1996. 5. P. Godfrey. Minimization in cooperative response to failing database queries. Technical Report CS-TR-3348, University of Maryland, College Park, 1994. 6. D. Gunopoulos, H. Mannila, and S. Saluja. Discovering all most speci c sentences by randomized algorithms. In ICDT, 1997. 7. D. Gunopulos, H. Mannila, R. Khardon, and H. Toivonen. Data mining, hypergraph transversals, and machine learning (extended abstract). In PODS, 1997. 8. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In VLDB, 1995. 9. D. S. Hochbaum, editor. Approximation Algorithms for NP-hard problems. PWS Publishing Company, 1997. 10. E. Lawler. Fast approximation algorithms for knapsack problems. Mathematics of Operations Research, 4, 1979. 11. H. Mannila, H. Toivonen, and A. I. Verkamo. Ecient algorithms for discovering association rules. In KDD, 1994. 12. R. Rastogi and K. Shim. Mining optimized association rules for categorical and numeric attributes. In ICDE, 1998. 13. R.Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In ICMD, 1996. 14. A. Savasere, E. Omiecinski, and S. Navathe. An ecient algorithm for mining association rules in large databases. In VLDB, 1995.
Contribution of Boosting in Wrapper Models Marc Sebban, Richard Nock TRIVIA, West Indies and Guiana University Campus de Fouillole, 95159 - Pointe `a Pitre (France) {msebban,rnock}@univ-ag.fr
Abstract. We describe a new way to deal with feature selection when boosting is used to assess the relevancy of feature subsets. In the context of wrapper models, the accuracy is here replaced as a performance function by a particular exponential criterion, usually optimized in boosting algorithms. A first experimental study brings to the fore the relevance of our approach. However, this new ”boosted” strategy needs the construction at each step of many learners, leading to high computational costs. We focus then, in a second part, on how to speed-up boosting convergence to reduce this complexity. We propose a new update of the instance distribution, which is the core of a boosting algorithm. We exploit these results to implement a new forward selection algorithm which converges much faster using overbiased distributions over learning instances. Speed-up is achieved by reducing the number of weak hypothesis when many identical observations are shared by different classes. A second experimental study on the UCI repository shows significantly speeding improvements with our new update without altering the feature subset selection.
1
Introduction
While increasing the number of descriptors for a machine learning domain would not intuitively make it harder for ”perfect” learners, machine learning algorithms are quite sensitive to the addition of irrelevant features. Actually, the presence of attributes not directly necessary for prediction could have serious consequences for the performances of classifiers. That’s why feature selection is became a central problem in machine learning. This trend will certainly continue because of the huge quantities of data (not always relevant) collected thanks to new acquisition technologies (the World Wide Web for instance). In addition, the selection of a good feature subset may not only improve performances of the deduced model, but may also allow to build simpler classifiers with higher understanding. To achieve feature selection, we generally use one of the two following approaches, respectively called filter and wrapper [2, 6, 7]. Filter models use a preprocessing step, before the induction process, to select relevant features. The parameter to optimize is often a statistical criterion or an information measure: interclass distance, probabilistic distance [11], entropy •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 214−222, 1999. Springer−Verlag Berlin Heidelberg 1999
Contribution of Boosting in Wrapper Models
215
[8, 13], etc. The main argument for these methods is that they try to estimate feature relevance regardless of the classifier, as an intrinsic property of the represented concept. The second main trend uses wrapper models. These methods assess alternative feature subsets using a given induction algorithm; the criterion to optimize is often the accuracy. In spite of high computational costs, wrapper models have the advantage to provide better accuracy estimates (by holdout, cross-validation or bootstrap) than a statistical criterion or an information measure as used in the filter approach. In this paper, we propose to challenge the criterion to optimize in wrapper models, replacing the accuracy by the Schapire-Singer’s criterion [12] which was not previously tested in the field of feature selection. This approach is motivated by recent theoretical results on the general performances of the algorithm AdaBoost [12]. Boosting consists in training and combining the output of T various base learners. Each of them can return various formulas (decision trees, rules, k-Nearest-Neighbors (kNN), etc.). We show through experimental results that the optimization of this criterion in a forward feature selection algorithm, called FS2 BOOST, allows to select more relevant features, and achieves higher classification performances, compared to the classical accuracy criterion. Despite its interesting properties, using boosting in a feature selection algorithm (notably in a wrapper model) needs to cope with high computational costs. Actually, wrapper models are already known to have high complexity. The worst case of a forward selection algorithm requires O(p2 ) estimates, each 2 of them requiring O(|LS| ) comparisons (using for instance a kNN classifier), where LS is the learning sample and p is the number of features. The use of the boosting procedure increases this complexity, requiring T steps at each stage. However, arguing for the use of Boosting, Quinlan [10] points out that even if Boosting is costly, the additional complexity factor (T ) is known in advance and can be controlled. Moreover, it can be useful to choose a fast classifier (such as the kNN) to decrease again this complexity. Nevertheless, this parameter T in FS2 Boost deserves investigation. In the second part of this paper, we focus on how to speed-up boosting convergence in FS2 Boost, to reduce this complexity. We propose a particular update of the instance distribution during boosting. It consists in balancing the distribution not only toward hard examples (as in the original AdaBoost algorithm [12]), but also on examples for which the conditional class distribution is highly in favor of some class against the others. The main point of this new update, which particularly suits to feature selection, is that as the description becomes poorer (e.g. by removing features), many examples of different classes may match the same description. Ultimately, descriptions with evenly balanced examples among classes are somewhat useless and can be ”forgotten” by the learning algorithm. Applying this new principle, we can avoid to build many base learners, while selecting almost the same feature subsets. We propose an improved extension of our first algorithm, called iFS2 Boost, and we compare performances of the two presented algorithms on several benchmarks of the
216
M. Sebban and R. Nock
UCI repository1 . Our experimental results highlight significant speed-up factors during the selection, without altering the selected feature subsets.
2
A Wrapper Model using Boosting
Wrapper models evaluate alternative feature subsets using a performance criterion which is usually the accuracy over the learning sample; the goal is to find which feature subset allows to increase the prediction accuracy. The accuracy is estimated using an induction algorithm as a core procedure, which builds formulae such as decision trees, induction graphs, neural networks, kNN, etc. This core procedure chosen, there remains to choose an heuristic of search among all the possible subspaces. Here we consider a forward selection algorithm, which often allows to reduce computational costs, avoiding calculations in high dimensional spaces. It is an a priori choice, but selecting a backward instead of a forward algorithm would not challenge the framework of our approach. Its principle can be summarized as follows: At each time, add the feature to a current feature set (initialized to ∅) which increases the most the accuracy of a formula built using the core algorithm. If no addition of a new feature increases the accuracy, then stop and return the current feature subset. While the wrapper approach is accurate when the core algorithm is the same as the subsequent induction algorithm, it may suffer a drawback that the core step introduces a representational bias. Indeed, not only do we measure the potential of improvement a feature represents, but also the bias according to which the feature could improve the accuracy of a formula built from the concept class of the core algorithm. Such a problem appears because functional dependencies of various nature exist between features, themselves understandable by means of representational biases [2]. For these reasons, recent works have chosen to investigate the properties of a novel kind of algorithms: boosting [12]. Boosting as presented into AdaBoost [12] is related to the stepwise construction of a linear separator into a high dimensional space, using a base learner to provide each functional dimension. Decision tree learning algorithms are well-suited for such a base-learner task, but other kind of algorithms can be chosen. The main idea of boosting is to repetitively query the base learner on a learning sample biased to increase the weights of the misclassified examples; by this mean, each new hypothesis is built on a learning sample which was hard to classify for its predecessor. Figure 1 presents the AdaBoost learning algorithm [12] in the two-classes case. When there are k > 2 classes, k binary classifiers are built, each of them used for the discrimination of one class against all others. The classifier returning 1
http://www.ics.uci.edu/˜mlearn/MLRepository.html
Contribution of Boosting in Wrapper Models
217
|LS|
AdaBoost(LS = {(xi , y(xi ))}i=1 ) Initialize distribution D1 (xi ) = 1/|LS|; For t = 1, 2, ..., T Build weak hypothesis ht using Dt ; Compute the confidence αt : αt =
1 + rt 1 log 2 1 − rt
X
(1)
m
rt =
Dt (xi )y(xi )ht (xi )
(2)
i=1 −αt y(xi )ht (xi )
; Update: Dt+1 (xi ) = Dt (xi )e Zt /∗Zt is a normalization coefficient∗/ endFor Return the classifier T X
H(x) = sign(
αt ht (x))
t=1
Fig. 1. Pseudocode for AdaBoost.
the greatest value gives the class of the observation. Boosting has been shown theoretically or empirically to satisfy particularly interesting properties. Among them, it was remarked [5] that boosting is sometimes immune to overfitting, a classical problem in machine learning. Moreover it allows to reduce a lot the representational bias PTin relevance estimation we pointed out before. Define the function F (x) = t=1 αt ht (x) to avoid problems with the “sign” expression in H(x). [12] have proven that using AdaBoost is equivalent to optimize a criterion which is not the accuracy, but precisely the normalization factor Zt as presented in figure 1. Using a more synthetic notation, [5] have proven that AdaBoost repetitively optimizes the following criterion: Z = E(x,y(x)) (e−y(x)F (x) ) In a first step, we decided then to use this criterion in a forward selection algorithm that we called FS2 Boost (figure 2). We show in the next section the interest of this new optimized criterion thanks to experimental results.
3
Experimental Results: Z versus Accuracy
In this section, the goal is to test the effect of the criterion optimized in the wrapper model. We propose to compare the selected feature relevance using either Z or the accuracy, on synthetic or natural databases. Nineteen problems
218
M. Sebban and R. Nock |LS|
FS2 Boost(LS = {(xi , y(xi ))}i=1 ) 1 Z0 ← +∞; E ← ∅; S ← {s1 , s2 , ..., sp }; 2 ForEach sj ∈ S H ←AdaBoost(LS,E ∪ si ); Zi ← ZE∪si (H); select smin for which Zmin = mini Zi ; endFor 3 If Zmin Z0 then S = S\{smin }; E = E ∪ {smin }; Z0 ← Zmin ; Goto step 2; Else return E; Fig. 2. Pseudocode for FS2 Boost. S is the set of features
were chosen, among them the majority was taken from the UCI repository. A database was generated synthetically with some irrelevant features (called Artificial ). Hard is a hard problem consisting of two classes and 10 features per instance. There are five irrelevant features. The class is given by the XOR of the five relevant features. Finally, each feature has 10% noise. The Xd6 problem was previously used by [3]: it is composed of 10 attributes, one of which is irrelevant. The target concept is a disjunctive normal form over the nine other attributes. There is also classification noise. Since we know for artificial problems the relevance degree of each feature, we can easily evaluate the effectiveness of our selection method. The problem is more difficult for natural domains. An adequate solution consists in running on each feature subset an induction algorithm (kNN in our study), and compare the “qualities” of the feature subsets with respect to the a posteriori accuracies. Accuracies are estimated by a leave-one-out cross-validation. On each dataset, we used the following experimental set-up: 1. the Simple Forward Selection (SFS) algorithm is applied, optimizing the accuracy during the selection. We compute then the accuracy by crossvalidation in the selected subspace. 2. FS2 Boost is run (T = 50). We compute also the a posteriori accuracy. 3. We compute the accuracy in the original space with all the attributes. Results are presented in table 1. First, FS2 Boost works well on datasets for which we knew the nature of features: relevant attributes are almost always selected, even if irrelevant attributes are sometimes also selected. On these problems, the expected effects of FS2 Boost are then confirmed. Second, FS2 Boost allows to obtain almost always a better accuracy rate on the selected subset, than on the subset chosen by the simple forward selection algorithm. Third, in the majority of cases, accuracy estimates on feature subsets after FS2 Boost are better than on the whole set of attributes. Despite these interesting results, FS2 Boost has a shortcoming: its computational cost. In the next section, after some definitions, we will show that instead
Contribution of Boosting in Wrapper Models Database
SFS FS2 Boost All Attributes
Monks 1
97.9
97.9
81
Monks 2
67.2
67.2
68.3 94.4
Monks 2
99.0
99.0
Artificial
84.7
86.4
84
LED
81.4
90.2
90.2
LED24
81.4
87.2
77.9
Credit
86.1
87.1
76.8
73
74.8
66.9
Glass2
62.5
73.2
72.0
Heart
82.2
81.7
82.8
Hepatitis
78.7
81.9
82.4
Horse
77.6
86.3
72.2
Breast Cancer 96.4
96.4
96.5
EchoCardio
219
Xd6
79.9
79.9
78.1
Australian
83.8
81.6
78.7 91.5
White House
95.7
95.7
Pima
73.2
73.3
73.0
Hard
58.7
58.7
59.0
Vehicle
72.9
73.7
71.6
Table 1. Accuracy comparisons between three feature sets: (i) the subset obtained by optimizing the accuracy, (ii) the subset deduced by FS2 Boost, and (iii) the whole set of features. Best results are underlined.
of minimizing Z, we can speed-up the boosting convergence optimizing another 0 Z criterion.
4
Speeding-up Boosting Convergence
Let S = {(x1 , y(x1 )), (x2 , y(x2 )), ..., (xm , y(xm ))} be a sequence of training examples, where each observation belongs to X , and each label yi belongs to a finite label space Y. In order to handle observations which can belong to different classes, for any description xp over X , define |xp + | (resp. |xp − |) to be the cardinality of positive (resp. negative) examples having the description xp ; note that |xp | = |xp − | + |xp + |. We make large use of three quantities, |xp max | = max(|xp + |, |xp − |), |xp min | = min(|xp + |, |xp − |) and ∆(xp ) = |xp max | − |xp min |. The optimal prediction for some description x is the class hidden in the “max” of |xp max |, which we write y(xp ) for short. Finally, for some predicate P, define as [[P]] to be 1 if P holds, and 0 otherwise; define as π(x, x0 ) to be the predicate “x0 and x share identical descriptions”, for arbitrary descriptions x and x0 . We give here indications on speeding-up Boosting convergence for the biclass setting. In the multiclass case, the strategy remains the same. The idea is to replace Schapire-Singer’s Z criterion by another one, which integrates the notion
220
M. Sebban and R. Nock
of similar descriptions belonging to different classes. This kind of situation often appears in feature selection, notably at the beginning of the SFS algorithm or also when the number of features is small according to a high cardinality of the learning set. More precisely, we use h i 0 0 Ex0˜D0 e−y(x )(αt ht (x )) t
with
P Dt0 (x0 )
=P x00
x
p P
Dt (x0 )[[π(x, x0 )]]
∆(xp ) |xp |
00 0 xp Dt (x )[[π(x, x )]]
∆(xp ) |xp |
In other words, we minimize a weighted expectation with distribution favoring the examples for which the conditional distribution of the observations projecting onto it is greatly in favor of one class against the others. Note that when each possible observation belongs to one class (i.e. no information is lost among the examples), the expectation is exactly Schapire-Singer’s Z. As [12] suggest, for the sake of simplicity, we can fold temporarily αt in ht so that the weak learner scales-up its votes to IR. Removing the t subscript, we obtain the following criterion which the weak learner should strive to optimize: h i 0 0 Z 0 = Ex0˜D0 e−y(x )h(x ) Optimizing Z 0 instead of Z at each round of Boosting is equivalent to (i) keeping strictly AdaBoost’s algorithm while optimizing Z 0 , or (ii) modifying Ad0 aBoost’s initial distribution, or its update rule. With the new Z criterion, we have to choose in our extended algorithm iFS2 Boost, 1 + rt0 1 αt0 = log 2 1 − rt0 where
rt0 =
X x0
5
Dt0 (x0 )y(x0 )ht (x0 ) = Ex0˜Dt0 [y(x0 )ht (x0 )]
Experimental Results: Z 0 versus Z
We tested here 12 datasets with the following experimental set-up: 1. The FS2 Boost algorithm is run with T base learners. We test the algorithm with different values of T (T = 1, .., 100), and we search for the minimal number TZ which provides a stabilized feature subset F Sstabilized , i.e. for which the feature subset is the same for T = TZ , .., 100.
Contribution of Boosting in Wrapper Models
221
Fig. 3. Relative Gain Grel of weak learners. The dotted line presents the average gain
2. iFS2 Boost is also run with different values of T and we search for the 0 feature subset. number TZ 0 which provides a F Sstabilized For ten datasets, we note that the use of Z 0 in iFS2 Boost allows to save on some weak hypothesis, without modifying the selected features (i.e. F Sstabilized = 0 ). In average, our new algorithm requires 3.5 learners less than F Sstabilized FS2 Boost. These results confirm the speeder convergence of iFS2 Boost, without alteration of the selected subspace. What is more surprising is that for two databases (Glass2 and LED24 ), iFS2 Boost needs more learners than 0 . We could intuitively think FS2 Boost and we obtain F Sstabilized 6= F Sstabilized 0 that, Z converging faster than Z, we should not meet such a situation. In fact, we can explain this phenomenon analyzing the speed-up factor of iFS2 Boost. Actually, the number |xp | of instances sharing a same description and belonging to different classes is independent from a subset to another, and the gain G = TZ − TZ 0 is directly dependant of |xp |. Thus, at a given step of the selection, iFS2 Boost can exceptionally select a weak relevant feature for which the speed-up factor is higher than for a strongly relevant one. In that case, iFS2 Boost will require supplementary weak hypothesis to correctly update the instance distribution. Nevertheless, this phenomenon seems to be quite marginal. Improvements of iFS2 Boost can be more dramatically presented by comTZ −T 0 puting the relative gain of weak learners Grel = T 0 Z . Results are presented in Z
figure 3. In that case, we notice that iFS2 Boost requires in average 22.5% learners less than FS2 Boost, that confirms the positive effects of our new approach, without challenging the selected subset by FS2 Boost.
6
Conclusion
In this article, we linked two central problems in machine learning and data mining: feature selection and boosting. Even if these two fields have the common
222
M. Sebban and R. Nock
aim to deduce from feature sets powerful classifiers, as far as we know few works tried to share their interesting properties. Replacing the accuracy by another Z performance criterion optimized by a boosting algorithm, we obtained better results for feature selection, despite high computational costs. To reduce this complexity we tried to improve the proposed FS2 Boost algorithm, introducing a speed-up factor in the selection. In the majority of cases, improvements are significant, allowing to save on some weak learners. The experimental gain represents on average more than 20% of the running time. Following a remark of [10] on Boosting, improvements of this magnitude without degradation of the solution, would be well worth the choice of iFS2 Boost, particularly on large domains where feature selection becomes essential. We still think however that time improvements are possible, but with possibly slight modifications of the solutions. In particular, investigations on computationally efficient estimators of boosting coefficients are sought. This shall be the subject of future work in the framework of feature selection.
References 1. D. Aha and R. Bankert. A comparative evaluation of sequential feature selection algorithms. In Fisher and Lenz Edts, Artificial intelligence and Statistics, 1996. 2. A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Issue of Artificial Intelligence, 1997. 3. W. Buntine and T. Niblett. A further comparison of splitting rules for decision tree induction. Machine Learning, pages 75–85, 1992. 4. Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, pages 119–139, 1997. 5. J. Friedman, T. Hastie, and R. Tibshirani. Additive Logistic Regression : a Statistical View of Boosting. draft, July 1998. 6. G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Eleventh ICML conference, pages 121–129, 1994. 7. R. Kohavi. Feature subset selection as search with probabilistic estimates. AAAI Fall Symposium on Relevance, 1994. 8. D. Koller and R. Sahami. Toward optimal feature selection. In Thirteenth International Conference on Machine Learning (Bari-Italy), pages 284–292, 1996. 9. P. Langley and S. Sage. Oblivious decision trees and abstract cases. In Working Notes of the AAAI94 Workshop on Case-Based Reasoning, pages 113–117, 1994. 10. J. Quinlan. Bagging, boosting and c4.5. In AAAI96, pages 725–730, 1996. 11. C. Rao. Linear statistical inference and its applications. Wiley New York, 1965. 12. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidencerated predictions. In Proceedings of the Eleventh Annual ACM Conference on Computational Learning Theory, pages 80–91, 1998. 13. M. Sebban. On feature selection: a new filter model. In Twelfth International Florida AI Research Society Conference, pages 230–234, 1999. 14. D. Skalak. Prototype and feature selection by sampling and random mutation hill climbing algorithms. In 11th International Conference on Machine Learning, pages 293–301, 1994.
Experiments on a Representation-Independent “Top-Down and Prune” Induction Scheme Richard Nock1 , Marc Sebban1 , and Pascal Jappy2
2
1 Univ. Antilles-Guyane, Dept of Maths and CS, ` Campus de Fouillole, 97159 Pointe-A-Pitre, France {rnock,msebban}@univ-ag.fr Leonard’s Logic, 20 rue th´er`ese, 75001 Paris, France [email protected]
Abstract. Recently, some methods for the induction of Decision Trees have received much theoretical attention. While some of these works focused on efficient top-down induction algorithms, others investigated the pruning of large trees to obtain small and accurate formulae. This paper discusses the practical possibility of combining and generalizing both approaches, to use them on various classes of concept representations, not strictly restricted to decision trees or formulae built from decision trees. The algorithm, Wirei, is able to produce decision trees, decision lists, simple rules, disjunctive normal form formulae, a variant of multilinear polynomials, and more. This shifting ability allows to reduce the risk of deviating from valuable concepts during the induction. As an example, in a previously used simulated noisy dataset, the algorithm managed to find systematically the target concept itself, when using an adequate concept representation. Further experiments on twenty-two readily available datasets show the ability of Wirei to build small and accurate concept representations, which lets the user choose his formalism to best suit his interpretation needs, in particular for mining purposes.
1
Introduction
Many of the classical problems in designing machine learning (ML) algorithms can be understood by means of accuracy, time/space complexity, size and intelligibility issues. Generally, satisfying most of them is essentially a matter of compromises. In such cases, the problem is to transform rapidly enough the dataset to a useful compact representation that, while capturing most of the generalizable knowledge of the original data, will stay sufficiently small to be intelligible and interpretable. While the rapid increase in computer’s performances has somewhat de-emphasized the time requirements to obtain the algorithm’s outputs [Qui96], the other requirements cannot be easily solved. As an example, it was recently observed that the end user of ML algorithms is likely to appreciate various output types against other ones, that is, not a single concept representation class fits to all users, and the ability to shift in practice the output type is of great importance. This is also important from a theoretical viewpoint. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 223–231, 1999. c Springer-Verlag Berlin Heidelberg 1999
224
R. Nock, M. Sebban, and P. Jappy
Some problems admit a small coding on some concept representation, but lead to overly large representations on other classes. There has been recently much work to establish sound theoretical bases for the induction of decision trees, to explain and improve the behavior of algorithms such as C4.5 [KM96, KM98, SS98]. Such algorithms proceed by a “top-down and prune” scheme: a large formula is induced, which is pruned in a latter step, to obtain a small and accurate final output. While [KM96, SS98] have focused on improving the top-down induction, [KM98] have established the theoretical bases of a new pruning scheme, with theoretically proven near-optimal behavior. These schemes, thought initially focused on decision trees, have remarkable general properties, which can be applied outside the class of decision trees. A previous study [NJ98] shows that the class of decision lists, which shares close properties with decision trees, can benefit of principles closely related to the top-down induction. In that paper, we are concerned by the generalization of the whole “top-down and prune” scheme to a very large scope of concept representations. More precisely, we propose a general principle derived from the weak learning framework of [SS98] and the pruning framework of [KM98], to which we relate to as Wirei (for Weak Induction REpresentation-independent). Wirei is able to induce on any problem formulae such as Decision Lists (DL), Decision Committees (DC, a variant of multilinear polynomials), Decision Trees (DT), Disjunctive Normal Form formulae (DNF), simple monomials, and more. Wirei is much different from approaches such as C4.5rules, which proposes to induce rules from DT. Indeed, in C4.5rules, a DT is always primarily induced, which in a subsequent step is transformed into a set of rules. Wirei, on the other hand, processes directly formulae inside the chosen class. Experiments carried out on twenty-two publicly available domains reveal that on each dataset, concept representations built from various classes can be much different from each other while still being small and accurate. Wirei was also able to exhibit on runs over noisy domains the target formula itself, thus achieving an optimal compromise between accuracy and size. The time complexity of Wirei compares favorably to that of classical approaches such as C4.5. After a general presentation of Wirei, and its applications to a large scope of concept representation classes, we relate experiments conducted using Wirei on twenty-two domains, almost all of which can be found on the UCI repository of machine learning database [BKM98].
2
Wirei
Throughout the paper, the following notations are used: LS denotes the set examples used for training, each of which is described with n attributes, and belongs to one class among c. The following subsections present the basis of the growing and pruning algorithms. For the sake of clarity, an applicative example (generally on DT) is provided for all, and the specific applications to other classes are presented on a subsequent devoted part.
Representation-Independent Induction
2.1
225
A General Top-down Induction Algorithm
The principle is to repeatedly optimize, in a top-down manner, a particular Z criterion over the partition induced by the current formula f on LS. This partition into subsets LS1 , LS2 , ..., LSk satisfies the following two axioms: 1. ∀1 ≤ j ≤ k, any two examples of LSj are classified exactly in the same fashion, 2. ∀1 ≤ i < j ≤ k, any two examples of respectively LSi and LSj are not classified in the same fashion. It is important to note the term “fashion” instead of “class”. Two examples classified in the same fashion follow exactly the same path in the formula, e.g. the same leaf in a DT. To each example is associated a weight, which mimics its appearance probability inside LS (if uniform, all weights equal 1/|LS|, where |.| is the cardinality function). We adopt the convention that examples are described using couples of the type (o, co ), where o is an observation, and co its corresponding class; its weight is written w((o, co )). Fix as [[π]] the function returning the truth value of a predicate π. Define for any class 1 ≤ l ≤ c and any subset LSj of the partition the following quantities: X X w((o, co ))[[co = l]] ; W−j,l = w((o, co ))[[co 6= l]] W+j,l = (o,co )∈LSj
(o,co )∈LSj
In other words, W+j,l represents the fraction of examples of class l present in subset LSj , and W−j,l represents the fraction of examples of classes 6= l present in subset LSj . The Z criterion of [SS98] is the following: X X q j,l j,l Z=2 W + W− j
l
The core of procedure TDbuild simply consists in repeatedly optimizing the decreasing of the current Z, until either no decreasing is possible, or some upperbound Imax of the formula’s size is reached. In order to keep a fast procedure, in any rule-based formula (e.g. DL, DNF), the current search is focused on a currently grown rule, until a new one is grown when no addition in the current rule decreases Z. 2.2
A General Pruning Algorithm
The objective is to test exactly once the removal of each of the subparts of the formula f obtained from TDbuild. The test is bottom-up for all formula based on literal or rule-ordering, such as decision trees or decision lists. For other formulae without ordering, such as DNF, it “simulates” a bottom-up scanning of the formulae. The ordering of the former formulae supposes that some parts Q of the formula may only be reached after having reached another part P . In the case of decision trees, P is an internal node, and all possible Q are internal
226
R. Nock, M. Sebban, and P. Jappy
nodes belonging to the subtree rooted at P . The test evaluates the possible removal of all Q before testing P , and whenever P is removed, all depending Q are also removed, leading to the entire pruning of the subtree rooted at P , itself replaced by the best leaf other the examples reaching P . Tests for other formulae will be detailed in their devoted subsections. Algorithm 1 presents Algorithm 1: BUprune(LS, f , δ, seq ) Input: sample LS, formula f , real 0 < δ < 1, integer seq Output: a formula f foreach P ∈ f scanned bottom-up do C :=Ways(P ); H :=Sons(P ); sloc := |Reach(P, f, LS)|×seq /(|LS|(c−1)2 ); p α := (log C + log H + 2 log seq /δ)/(sloc ); P := lError(f ,Reach(P, f, LS)); ∅ := lError(f \P ,Reach(P, f, LS)); if P + α ≥ ∅ then Remove(P ,f ); return f
BUprune. We emphasize the fact that BUprune is an application of the theoretical results of [KM98]. The parameters used are the following ones. Ways(.) returns the number of distinct formulae which could replace in f the series of tests to reach P . In a decision tree, this represents the number of distinct monomials whose length equal the depth of P . Sons(.) returns the number of distinct subformulae in f that could be placed after P , without changing the size of f . In a decision tree, this represents the number of distinct subtrees that can be rooted at P without changing the whole number of internal nodes of f . Reach(.,.,.) returns the subset of examples from LS reaching P in f . In a decision tree, this represents the subset of LS reaching the internal node P . lError(.,.) returns the local error over Reach(.,.,.), in the formula f (for P ), or f to which P and all subformulae of P are removed (for ∅ ). In the case of a decision tree, the latter quantity corresponds to the local error of the best leaf rooted at P . The term “local error” is very important: in particular, the distribution used to calculate lError(.,.) is such that all examples from LS\Reach(P, f, LS) have zero weight. seq is a correction factor, which is not in [KM98]. We now explain its use. The test to remove P is optimistic, in that we face the possibility to overprune the formula, all the more if LS is not sufficiently large. For example, consider the case c = 2, |LS| = 2000, |Reach(P, f, LS)| = 100, δ < .20 and seq = |LS|. Then we obtain α > .40, even when considering C = H = 1. Experimentally, this shortcoming may lead to an empty formula, by pruning all parts of the initial formula. In order to overcome this difficulty, we have chosen to “mimic” the re-sampling of LS into another set of size seq > |LS|, in which examples would have exactly the same distribution as in LS. In our experiments, the values of C and H, since having a hard fast calculation, were approximated with upperbounds as large as possible, still in order not to face this possibility of overpruning. The bounds are not as tight as one could expect, yet they gave
Representation-Independent Induction
227
experimentally good results. We now go on detailing the algorithms TDbuild and BUprune for various kinds of formalisms.
3
Applications of Wirei to Specific Classes
Fix u(k) = 2k × n!/((n − k)!k!) (fast approximations of u(.) can be obtained by Stirling’s formula). This represents the number of Boolean monomials of length k over n variables. The application of Wirei to DT mainly follows from our preceding comments, and the results of [SS98, KM98, Qui96]. Due to the lack of space, we only detail results on other formalisms. The most simple is for monomials. When a single monomial f is needed, associated to a fixed class to which we refer as the positive class, only algorithm TDbuild is used. There are only two subsets LS1 and LS2 in the partition of LS, containing respectively examples satisfying the monomial, and those which do not satisfy the monomial. We additionally put the following constraint: each test added keeps the positive class as the majority class for the examples satisfying f . This gives the algorithm Wirei(Rule). Decision Lists: Wirei(DL). TDbuild: for a DL with m monomials, the partition of LS contains m + 1 subsets. The m first subsets are those corresponding to a monomial, and the (m + 1)th corresponds to the default class. Optimize(.) proceeds as follows. Each possible test is added to the last rule of the decision list. When no further addition of a test decreases the Z value, a new rule, created in the last position, is investigated. BUprune: for a DL with m monomials, each P is a monomial, and the monomials are tested from the last monomial of the DL to the first one. Reach(.,.,.) returns the subset of examples reaching P . When pruning a monomial P , all monomials following P (that were not pruned) are removed with P . The best default class other the training sample replaces P . Fix as l the position of P inside the DL. We then choose C = u(l − 1). Fix as t the average number of literals of each monomial following P . Then, we fix H = (m − l)u(t). DNF: Wirei(DNF) , is used when c = 2. TDbuild: for a DNF with m monomials, the partition can contain up to min{|LS|, 2m } subsets (this quantity is never greater than |LS|, which guarantees efficient processing time). Each subset contains the examples satisfying exactly the same subset of monomials. While there is no ordering on monomials, algorithm TDbuild is still bottom-up. Each test is added to a current monomial. When no further addition of a test into this monomial decreases the Z value, a new monomial is created, initialized to ∅, and treated as the current monomial. The same constraint as for monomials is used when minimizing Z: each test added to a monomial keeps the positive class as the majority class for all examples satisfying this monomial. BUprune: while there is no ordering on monomials, the bottom-up fashion is still preserved for the formula f . Each P represents a monomial of the DNF, and when removing P , no other monomial is removed. Reach(.,.,.) returns the
228
R. Nock, M. Sebban, and P. Jappy
subset of examples satisfying P . Fix as l the total number of monomials 6= P , inside f , that could be satisfied while satisfying P , and t their average length. In other words, each of these monomials must not have a contradictory test with P . Fix as |P | the number of literals of P . We choose C = u(|P |) and H = l × u(t). In addition, sloc is the cardinality of the examples satisfying P . Decision Committees: Wirei(DC). We use DC with constrained vectors. Such a DC [NG95, NJ99] contains two parts: – A set of unordered couples (or rules) {(mi , v i )} where each mi is a monomial, and each v i is a vector in {−1, 0, 1}c (the values correspond to the natural interpretation “is in disfavor of”, “is neutral w.r.t.”, “is in favor of” one class). – A Default Vector D in [0, 1]c . For any observation o we calculate V o , the sum of all vectors whose monomials are satisfied by o. The index of the maximal component of V o gives the class assigned to o. If it is not unique, we take the index of the maximal component of D corresponding to the maximal component of V o . Algorithm BUprune has the same structure as for DNF. However, in order not to artificially increase the power of the vectors by multiplying the appearance of some monomials, we do not authorize the addition of multiple copies of a single monomial, a case which can only occur when the current Z is not decreased. TDbuild: it is the same as for DNF, except that we remove the constraint on choosing monomials discriminating the positive class. Before executing algorithm BUprune, we calculate the components of each v i . To do so, we use the algorithm of [NJ99] which proceeds by minimizing Ranking Loss as defined by [SS98].
4
Experimental Results
Wirei was evaluated on a representative collection of twenty-two problems, most of which can be found on the UCI repository [BKM98]. The only exceptions were the “LEDeven” and “XD6” domain. “LEDeven” consists in the noisy ten-classes problem “LED10” with classes reunited into odd and even classes. “XD6” consists in a two-classes problem with ten description variables for each example. The target concept, from which all examples are uniformly sampled, is a DNF with three variables in each of its monomials, described over the first nine variables. The tenth variable is irrelevant in the strongest sense. A 10% classification noise is added, which also represents Bayes optimum. References for all the datasets, omitted due to space constraints, can be found in [BN92, Hol93, Qui96], or on the UCI repository [BKM98]. All algorithms are ran using δ = 15%, |seq | = 10000 and Imax = 40, in order to make clear comparisons. Ten complete 10-fold stratified cross-validations were carried out with each database. In a tenfold cross-validation, the training instances are partitioned into 10 equal-sized subsets with similar class distributions. Each subset in turn is used for testing
Representation-Independent Induction
229
while the remaining nine are used for training. Due to the lack of space, only experiments with Wirei(DC), Wirei(DNF) and Wirei(DL) are shown in table 1. With respect to each algorithm, in its first column are shown error rates averaged over the 10-fold cross-validations, with the average number of monomials (second column), and the average total number of literal on the third column (if a literal appears k times, it is counted k times). Column “Others” relates various results among the best we know, for which the experiments were carried out under a similar setting as ours. Over the 22 datasets, Wirei outperforms many of the traditional approaches. Comparing the errors gives already an advantage to Wirei(particularly Wirei(DL)), but improvements become more sensible as the errors are compared in the light of the corresponding formula’s sizes. Size reductions, while still preserving in many cases a better error, can range towards magnitude order of twenty or more. In particular, Wirei(DL) is a clear winner against CN2 when considering both errors and sizes. But there is more to say, when comparing approaches head to head. More than performing a comparison between accuracies, we performed a comparison between the classifiers themselves on specific problems. On “XD6”, we observed that the classifiers built for both Wirei(DNF) and Wirei(DL) are always exactly the target formula, beating in both accuracy and size classical DT induction approaches [BN92]. However, coding the target formula with DT can be done modulo the creation of comparatively large trees, which is more risky when building formula in a top-down fashion: chances are indeed larger that the formula built deviates from the optimal one. This clearly accounts for the representation shift Wirei proposes. On “Vote0”, we still obtained exactly the same classifiers for Wirei(DNF) and Wirei(DL), with one literal. This problem is known to have one attribute which makes a very reliable test [BN92], attribute which is precisely always selected by Wirei(DNF) and Wirei(DL). In order to cope with this problem, [BN92] propose to remove this attribute, which gives the “Vote1” problem. While DT induction algorithms give much larger formulae, Wirei(DL) always manages to find a two-tests rule which still gives very good results, and might contain useful informations for Data Mining purposes. However, the problem seems indeed more difficult since Wirei(DC) finds more complex formulae with average accuracy slightly below 10%, a seldom result if we refer to the collection of reported studies in [Hol93], none of which break the 10% barrier. This stability property was also remarked on the “Horse-Co” problem, where both Wirei(DL) and Wirei(DNF) even encompassed DT approaches using a very simple concept. On the “LED10” domain, Wirei(DC) obtained on average a result a little above the 24% Bayes error rate, but Wirei(DL) performed very poorly (while DT give intermediate results). Interestingly, when transforming the problem to “LEDeven”, Wirei(DL) achieved near-optimal prediction, with a completely stable classifier, but Wirei(DC)’s prediction degraded with respect to Bayes. A simple explanation for this behavior is that “LED10” is a problem which can be encoded very efficiently using simple linear frontiers around classes [NG95], and it was proven that linear separators, while being DC with one-rule monomials
230
R. Nock, M. Sebban, and P. Jappy
(remark that Wirei(DC)’s monomials contain on average 1.43 literals), are very difficult to encode using simple DLs [NG95]. On the other hand, “LEDeven” can be related to much simpler concepts, which can be very efficiently coded using DL. We have remarked that the DL obtained by Wirei(DL) were very accurate in that among their two rules, the first contained a test which discriminates all but one (the “4”) even digits against all odd digits, and the second coupled with the default class, led to an efficient test to discriminate under noise the “4” digit against all odd digits. When comparing “Glass” and “Glass2”, which is a modified version of “Glass” [CB91], the interest of the hypothesis concept shift between the two problems is clear, as Wirei(DC)performed well on “Glass2”, while Wirei(DL) gave the best results on “Glass”. Following all these observations, we can say that Wirei is an experimental evidence of the power of simple induction schemes such as “Top-down and prune”, which received recently much attention to establish sound theoretical foundations. Though many works were primarily based on decision trees [KM96, KM98], theoretical results seem to be practically scalable to various different classes of formalism representation, three of which were explored in depth in our experiments. Additionally, experimental results reveal that applications of the generic algorithm Wirei to specific classes can exhibit even better behavior than algorithms specifically dedicated to the same classes.
References [BKM98] C. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases. 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [BN92] W. Buntine and T. Niblett. A further comparison of splitting rules for Decision-Tree induction. Machine Learning, pages 75–85, 1992. [CB91] P. Clark and R. Boswell. Rule induction with CN2: some recent improvements. In Proc. of the 6 th European Working Session in Learning, pages 151–161, 1991. [Dom98] P. Domingos. A Process-oriented Heuristic for Model selection. In Proc. of the 15 th International Conference on Machine Learning, pages 127–135, 1998. [FW98] E. Franck and I. Witten. Using a Permutation Test for Attribute selection in Decision Trees. In Proc. of the 15 th International Conference on Machine Learning, pages 152–160, 1998. [Hol93] R.C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, pages 63–91, 1993. [KM96] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, pages 459–468, 1996. [KM98] M. J. Kearns and Y. Mansour. A Fast, Bottom-up Decision Tree Pruning algorithm with Near-Optimal generalization. In Proc. of the 15 th International Conference on Machine Learning, 1998. [NG95] R. Nock and O. Gascuel. On learning decision committees. In Proc. of the 12 th International Conference on Machine Learning, pages 413–420, 1995. [NJ98] R. Nock and P. Jappy. On the power of decision lists. In Proc. of the 15 th International Conference on Machine Learning, pages 413–420, 1998.
Representation-Independent Induction
231
Table 1. Comparisons of various approaches of Wirei (whose least errors are underlined for each domain; a “/” mark for DNF denotes a domain with c > 2 classes). Wirei(DC) Wirei(DNF) Wirei(DL) Domain err (%) mDC lDC err (%) mDN F lDN F err (%) mDL lDL Balance 22.24 6.2 13.7 / / / 23.01 4.2 13.5 Breast-W 4.08 5.4 22.8 13.80 1.6 2.9 6.90 1.7 6.7 Echo 28.57 2.0 3.9 36.42 1.3 2.3 30.00 0.7 1.6 Glass 53.91 1.3 1.8 / / / 38.69 6.0 19.2 Glass2 21.10 6.6 18.2 23.52 5.7 15.4 22.35 11.2 27.1 Heart-St 22.96 3.9 11.7 34.44 2.0 6.6 23.53 9.4 22.6 Heart-C 24.87 4.4 14.3 23.87 2.0 5.7 21.93 2.8 7.7 Heart-H 20.67 5.2 13.8 24.00 1.3 5.1 23.00 1.2 4.5 Hepatitis 21.76 3.9 10.1 24.70 3.2 6.5 18.82 1.7 4.2 Horse-Co 22.31 9.5 27.0 4 13.68 4 1.0 4 2.0 4 13.68 4 1.0 4 2.0 Iris 5.33 1.9 4.6 / / / 2.67 2.0 4.8 Labor 15.00 4.1 9.5 43.33 1.2 1.9 16.67 3.1 5.6 Lung 42.50 1.3 3.8 / / / 47.50 2.0 5.9 † LED10 26.95 10.1 14.4 / / / 58.26 4.1 9.1 LEDeven 19.91 5.7 10.3 42.50 1.9 3.9 †,4 12.08 4 2.0 4 6.0 Monk1 15.00 4.1 9.5 35.44 2.8 3.9 4.11 9.4 18.3 Monk2 24.26 8.3 35.5 48.69 2.4 5.3 21.97 10.8 44.4 Monk3 3.93 4.3 6.4 44.11 4.0 4.3 4 3.57 4 2.0 4 2.0 Pima 28.52 3.3 7.1 38.44 0.7 1.8 25.71 2.4 6.5 Vote0 7.70 2.9 4.4 7.73 1.4 1.7 4 5.68 4 1.0 4 1.0 Vote1 9.95 4.5 10.8 18.86 2.4 3.2 4 10.23 4 1.0 4 2.0 †,4 †,4 †,4 XD6 20.34 9.4 18.4 9.84 3.0 9.0 †,4 9.84 †,4 3.0 †,4 9.0
Others 32.1 b 4.0 b 32.335.4 c 41.5032.8 c 20.3 b 21.5 b 22.552.0 c 21.860.3 c 19.234.0 c 14.92 b 4.93.5 a 32.8913.0 a
25.9 b 4.349.6 c 12.798.9 a 22.0614.8 a
† : near-optimal or optimal results. 4 : the same classifier is produced at each fold. mF : average number of monomials (see text). lF : average number of literals (see text). References: a [BN92], best reported results on DT induction. The small number indicates the least number of leaves (equiv. to a number of monomials). b [FW98, Qui96], C4.5’s error. c [Dom98, CB91], various improved CN2’s error (small numbers indicate the whole number of literals).
[NJ99] [Qui96] [SS98]
R. Nock and P. Jappy. A top-down and prune induction scheme for constrained decision committees. In Proc. of the 3 rd International Symposium on Intelligent Data Analysis, 1999. accepted. J. R. Quinlan. Bagging, Boosting and C4.5. In Proc. of AAAI-96, pages 725–730, 1996. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the Eleventh Annual ACM Conference on Computational Learning Theory, pages 80–91, 1998.
Heuristic Measures of Interestingness Robert J. Hilderman and Howard J. Hamilton Department of Computer Science University of Regina Regina, Saskatchewan, Canada S4S 0A2 {hilder,hamilton}@cs.uregina.ca
Abstract. The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures that evaluate the structure of a summary to assign a single real-valued index that represents its interestingness relative to other summaries generated from the same database. The heuristics are based upon well-known measures of diversity, dispersion, dominance, and inequality used in several areas of the physical, social, ecological, management, information, and computer sciences. Their use for ranking summaries generated from databases is a new application area. All sixteen heuristics rank less complex summaries (i.e., those with few tuples and/or few non-ANY attributes) as most interesting. We demonstrate that for sample data sets, the order in which some of the measures rank summaries is highly correlated.
1
Introduction
Techniques for determining the interestingness of discovered knowledge have previously received some attention in the literature. For example, in [5], a measure is proposed that determines the interestingness (called surprise there) of discovered knowledge via the explicit detection of Simpson’s paradox. Also, in [22], information-theoretic measures for evaluating the importance of attributes are described. And in previous work, we proposed and evaluated four heuristics, based upon measures from information theory and statistics, for ranking the interestingness of summaries generated from databases [8,9]. Ranking summaries generated from databases is useful in the context of descriptive data mining tasks where a single data set can be generalized in many different ways and to many levels of granularity. Our approach to generating summaries is based upon a data structure called a domain generalization graph (DGG) [7,10]. A DGG for an attribute is a directed graph where each node represents a domain of values created by partitioning the original domain for the attribute, and each edge represents a generalization relation between these domains. Given a set of DGGs corresponding to a set of attributes, a generalization space can be defined as all possible combinations of domains, where one ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 232–241, 1999. c Springer-Verlag Berlin Heidelberg 1999
Heuristic Measures of Interestingness
233
domain is selected from each DGG for each combination. This generalization space describes, then, all possible summaries consistent with the DGGs that can be generated from the selected attributes. When the number of attributes to be generalized is large or the DGGs associated with the attributes are complex, the generalization space can be very large, resulting in the generation of many summaries. If the user must manually evaluate each summary to determine whether it contains an interesting result, inefficiency results. Thus, techniques are needed to assist the user in identifying the most interesting summaries. In this paper, we introduce and evaluate twelve new heuristics based upon measures from economics, ecology, and information theory, in addition to the four previously mentioned in [8] and [9], and present additional experimental results describing the behaviour of these heuristics when used to rank the interestingness of summaries. Together, we refer to these sixteen measures as the HMI set (i.e., heuristic measures of interestingness). Although our measures were developed and utilized for ranking the interestingness of generalized relations using DGGs, they are more generally applicable to other problem domains. For example, alternative methods could be used to guide the generation of summaries, such as Galois lattices [6], conceptual graphs [3], or formal concept analysis [19]. Also, summaries could more generally include views generated from databases or summary tables generated from data cubes. However, we do not dwell here on the methods or technical aspects of deriving summaries, views, or summary tables. Instead, we simply refer collectively to these objects as summaries, and assume that some collection of them is available for ranking. The heuristics in the HMI set were chosen for evaluation because they are well-known measures of diversity, dispersion, dominance, and inequality that have previously been successfully applied in several areas of the physical, social, ecological, management, information, and computer sciences. They share three important properties. First, each heuristic depends only on the probability distribution of the data to which it is being applied. Second, each heuristic allows a value to be generated with at most one pass through the data. And third, each heuristic is independent of any specific units of measure. Since the tuples in a summary are unique, they can be considered to be a population with a structure that can be described by some probability distribution. Thus, utilizing the heuristics in the HMI set for ranking the interestingness of summaries generated from databases is a natural and useful extension into a new application domain.
2
The HMI Set
A number of variables will be used in describing the HMI set, which we define as follows. Let m be the total number of tuples in a summary. Let ni be the value contained in the Count attribute for tuple ti (all summaries contain Pm a derived attribute called Count; see [8] or [9] for more details). Let N = i=1 ni be the total count. Let p be the actual probability distribution of the tuples based upon the values ni . Let pi = ni /N be the actual probability for tuple ti . Let q be a
234
R.J. Hilderman and H.J. Hamilton
uniform probability distribution of the tuples. Let u¯ = N/m be the count for tuple ti , i = 1, 2, . . . , m according to the uniform distribution q. Let q¯ = 1/m be the probability for tuple ti , for all i = 1, 2, . . . , m according to the uniform distribution q. Let r be the probability distribution obtained by combining the ¯. Let ri = (ni + u ¯)/2N , be the probability for tuples ti , for all values ni and u i = 1, 2, . . . , m according to the distribution r. So, given the sample summary shown in Table 1, for example, we have m = 4, n1 = 3, n2 = 1, n3 = 1, n4 = 2, ¯ = 1.75, q¯ = 0.25, N = 7, p1 = 0.429, p2 = 0.143, p3 = 0.143, p4 = 0.286, u r1 = 0.339, r2 = 0.196, r3 = 0.196, and r4 = 0.268. Table 1. A sample summary Tuple ID Colour Shape t1 t2 t3 t4
red red blue green
Count
round square square round
3 1 1 2
We now describe the sixteen heuristics in the HMI set. Examples showing the calculation of each heuristic are not provided due to space limitations. IV ariance . Based upon sample variance from classical statistics [15], IV ariance measures the weighted average of the squared deviations of the probabilities pi from the mean probability q¯, where the weight assigned to each squared deviation is 1/(m − 1). Pm (pi − q¯)2 IV ariance = i=1 m−1 ISimpson . A variance-like measure based upon the Simpson index [18], ISimpson measures the extent to which the counts are distributed over the tuples in a summary, rather than being concentrated in any single one of them. ISimpson =
m X
p2i
i=1
IShannon . Based upon a relative entropy measure from information theory (known as the Shannon index) [17], IShannon measures the average information content in the tuples of a summary. IShannon = −
m X
pi log2 pi
i=1
IT otal . Based upon the Shannon index from information theory [23], IT otal measures the total information content in a summary. IT otal = m ∗ IShannon IM ax . Based upon the Shannon index from information theory [23], IM ax measures the maximum possible information content in a summary. IM ax = log2 m
Heuristic Measures of Interestingness
235
IM cIntosh . Based upon a heterogeneity index from ecology [14], IM cIntosh views the counts in a summary as the coordinates of a point in a multidimensional space and measures the modified Euclidean distance from this point to the origin. pPm n2 N− √i=1 i IM cIntosh = N− N ILorenz . Based upon the Lorenz curve from statistics, economics, and social science [20], ILorenz measures the average value of the Lorenz curve derived from the probabilities pi associated with the tuples in a summary. The Lorenz curve is a series of straight lines in a square of unit length, starting from the origin and going successively to points (p1 , q1 ), (p1 + p2 , q1 + q2 ), . . .. When the pi ’s are all equal, the Lorenz curve coincides with the diagonal that cuts the unit square into equal halves. When the pi ’s are not all equal, the Lorenz curve is below the diagonal. m X (m − i + 1)pi ILorenz = q¯ i=1
IGini . Based upon the Gini coefficient [20] which is defined in terms of the Lorenz curve, IGini measures the ratio of the area between the diagonal (i.e., the line of equality) and the Lorenz curve, and the total area below the diagonal. Pm Pm ¯ − pj q¯| i=1 j=1 |pi q IGini = 2m2 q¯ IBerger . Based upon a dominance index from ecology [2], IBerger measures the proportional dominance of the tuple in a summary with the highest probability pi . IBerger = max(pi ) ISchutz . Based upon an inequality measure from economics and social science [16], ISchutz measures the relative mean deviation of the actual distribution of the counts in a summary from a uniform distribution of the counts. Pm pi − q¯ ISchutz = i=1 2m¯ q IBray . Based upon a community similarity index from ecology [4], IBray measures the percentage of similarity between the actual distribution of the counts in a summary and a uniform distribution of the counts. Pm min(ni , u ¯) IBray = i=1 N IW hittaker . Based upon a community similarity index from ecology [21], IW hittaker measures the percentage of similarity between the actual distribution of the counts in a summary and a uniform distribution of the counts. ! m X |pi − q¯| IW hittaker = 1 − 0.5 i=1
236
R.J. Hilderman and H.J. Hamilton
IKullback . Based upon a distance measure from information theory [11], IKullback measures the distance between the actual distribution of the counts in a summary and a uniform distribution of the counts. ! m X pi pi log2 IKullback = log2 m − q¯ i=1 IM acArthur . Based upon the Shannon index from information theory [13], IM acArthur combines two summaries, and then measures the difference between the amount of information contained in the combined distribution and the amount contained in the average of the two original distributions. ! P m m X (− i=1 pi log2 pi ) + log2 m IM acArthur = − ri log2 ri − 2 i=1 IT heil . Based upon a distance measure from information theory [20], IT heil measures the distance between the actual distribution of the counts in a summary and a uniform distribution of the counts. Pm |pi log2 pi − q¯ log2 q¯| IT heil = i=1 m¯ q IAtkinson . Based upon a measure of inequality from economics [1], IAtkinson measures the percentage to which the population in a summary would have to be increased to achieve the same level of interestingness if the counts in the summary were uniformly distributed. !q¯ m Y pi IAtkinson = 1 − q¯ i=1
3
Experimental Results
To generate summaries, a series of seven discovery tasks were run: three on the NSERC Research Awards Database (a database available in the public domain) and four on the Customer Database (a confidential database supplied by an industrial partner). These databases have been frequently used in previous data mining research [8,9,12] and will not be described again here. We present the results of the three NSERC discovery tasks, which we refer to as N-2, N-3, and N-4, where 2, 3, and 4 correspond to the number of attributes selected in each discovery task. Similar results were obtained from the Customer Database. Typical results are shown in Tables 2 through 5, where the 22 summaries generated from the N-2 discovery task are ranked by the various measures. In Tables 2 through 5, the Summary ID column describes a unique summary identifier (for reference purposes), the Non-ANY Attributes column describes the number of non-ANY attributes in the summary (i.e., attributes that have not
Heuristic Measures of Interestingness
237
Table 2. Ranks assigned by IV ariance , ISimpson , IShannon , and IT otal from N-2 Summary Non-ANY No. of ID Attributes Tuples 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 3 4 5 6 9 10 2 4 5 9 9 10 11 16 17 21 21 30 40 50 67
IV ariance Score Rank 0.377595 0.128641 0.208346 0.024569 0.018374 0.017788 0.041606 0.377595 0.208346 0.079693 0.018715 0.050770 0.041606 0.013534 0.010611 0.012575 0.008896 0.011547 0.006470 0.002986 0.002078 0.001582
1.5 5.0 3.5 10.0 12.0 13.0 8.5 1.5 3.5 6.0 11.0 7.0 8.5 14.0 17.0 15.0 18.0 16.0 19.0 20.0 21.0 22.0
ISimpson Score Rank 0.877595 0.590615 0.875039 0.298277 0.258539 0.253419 0.474451 0.877595 0.875039 0.518772 0.260833 0.517271 0.474451 0.226253 0.221664 0.260017 0.225542 0.278568 0.220962 0.141445 0.121836 0.119351
1.5 5.0 3.5 10.0 14.0 15.0 8.5 1.5 3.5 6.0 12.0 7.0 8.5 16.0 18.0 13.0 17.0 11.0 19.0 20.0 21.0 22.0
IShannon Score Rank 0.348869 0.866330 0.443306 1.846288 2.125994 2.268893 1.419260 0.348869 0.443306 1.215166 2.194598 1.309049 1.419260 2.473949 2.616697 2.288068 2.567410 2.282864 2.710100 3.259974 3.538550 3.679394
1.5 5.0 3.5 10.0 11.0 13.0 8.5 1.5 3.5 6.0 12.0 7.0 8.5 16.0 18.0 15.0 17.0 14.0 19.0 20.0 21.0 22.0
IT otal Score Rank 0.697738 2.598990 1.773225 9.231440 12.755962 20.420033 14.192604 0.697738 1.773225 6.075830 19.751385 11.781437 14.192604 27.213436 41.867161 38.897160 53.915619 47.940136 81.302986 130.39897 176.92749 246.51939
1.5 5.0 3.5 7.0 9.0 13.0 10.5 1.5 3.5 6.0 12.0 8.0 10.5 14.0 16.0 15.0 18.0 17.0 19.0 20.0 21.0 22.0
Table 3. Ranks assigned by IM ax , IM cIntosh , ILorenz , and IBerger from N-2 Summary Non-ANY No. of ID Attributes Tuples 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 3 4 5 6 9 10 2 4 5 9 9 10 11 16 17 21 21 30 40 50 67
IM ax Score Rank 1.000000 1.584963 2.000000 2.321928 2.584963 3.169925 3.321928 1.000000 2.000000 2.321928 3.169925 3.169925 3.321928 3.459432 4.000000 4.087463 4.392317 4.392317 4.906891 5.321928 5.643856 6.066089
1.5 3.0 4.5 6.5 8.0 10.0 12.5 1.5 4.5 6.5 10.0 10.0 12.5 14.0 15.0 16.0 17.5 17.5 19.0 20.0 21.0 22.0
IM cIntosh Score Rank 0.063874 0.233956 0.065254 0.458697 0.496780 0.501894 0.314518 0.063874 0.065254 0.282728 0.494505 0.283782 0.314518 0.529937 0.534837 0.495313 0.530693 0.477246 0.535592 0.630569 0.657900 0.661515
1.5 5.0 3.5 10.0 14.0 15.0 8.5 1.5 3.5 6.0 12.0 7.0 8.5 16.0 18.0 13.0 17.0 11.0 19.0 20.0 21.0 22.0
ILorenz Score Rank 0.532746 0.429060 0.277279 0.402945 0.379616 0.261123 0.165982 0.532746 0.277279 0.283677 0.253015 0.166537 0.165982 0.236883 0.175297 0.142521 0.132651 0.118036 0.100625 0.108058 0.102211 0.083496
1.5 3.0 7.5 4.0 5.0 9.0 14.5 1.5 7.5 6.0 10.0 13.0 14.5 11.0 12.0 16.0 17.0 18.0 21.0 19.0 20.0 22.0
IBerger Score Rank 0.934509 0.712931 0.934509 0.393841 0.393841 0.393841 0.603704 0.934509 0.934509 0.666853 0.365614 0.666853 0.603704 0.365614 0.365614 0.365614 0.365614 0.420841 0.365614 0.234297 0.234297 0.234297
2.5 5.0 2.5 12.0 12.0 12.0 8.5 2.5 2.5 6.5 16.5 6.5 8.5 16.5 16.5 16.5 16.5 10.0 16.5 21.0 21.0 21.0
been generalized to the level of the most general node in the associated DGG that contains the default description “ANY”), the No. of Tuples column describes the number of tuples in the summary, and the Score and Rank columns describe the calculated interestingness and the assigned rank, respectively, as determined by the corresponding measure. Some measures are ranked by score in descending order and some in ascending order (this is easily determined by examining the ranks assigned in Tables 2 through 5). This is done so that each measure ranks the less complex summaries (i.e., those with few tuples and/or few non-ANY attributes) as more interesting. Tables 2 through 5 do not show any single-tuple summaries (e.g., a single-tuple summary where both attributes are generalized to ANY and a single-tuple summary that was an artifact of the DGGs used), as these summaries are considered to contain no information and are, therefore, uninteresting by definition. The summaries in Tables 2 through 5 are shown in increasing order of the number of non-ANY attributes and the number of tuples in each summary, respectively.
238
R.J. Hilderman and H.J. Hamilton Table 4. Ranks assigned by ISchutz , IBray , IW hittaker , and IKullback from N-2 Summary Non-ANY No. of ID Attributes Tuples 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 3 4 5 6 9 10 2 4 5 9 9 10 11 16 17 21 21 30 40 50 67
ISchutz Score Rank 0.434509 0.379598 0.684509 0.310744 0.294042 0.466300 0.734509 0.434509 0.684509 0.534397 0.516940 0.712175 0.734509 0.486637 0.600273 0.699103 0.696302 0.743921 0.723102 0.734397 0.734397 0.742610
4.5 3.0 11.5 2.0 1.0 6.0 19.5 4.5 11.5 9.0 8.0 15.0 19.5 7.0 10.0 14.0 13.0 22.0 16.0 17.5 17.5 21.0
IBray IW hittaker Score Rank Score Rank 0.565491 0.620402 0.315491 0.689256 0.705958 0.533700 0.265491 0.565491 0.315491 0.465603 0.483060 0.287825 0.265491 0.513363 0.399727 0.300897 0.303698 0.256079 0.276898 0.265603 0.265603 0.25739
4.5 3.0 11.5 2.0 1.0 6.0 19.5 4.5 11.5 9.0 8.0 15.0 19.5 7.0 10.0 14.0 13.0 22.0 16.0 17.5 17.5 21.0
0.565491 0.620402 0.315491 0.689256 0.705958 0.533700 0.265491 0.565491 0.315491 0.465603 0.483060 0.287825 0.265491 0.513363 0.399727 0.300897 0.303698 0.256079 0.276898 0.265603 0.265603 0.257390
4.5 3.0 11.5 2.0 1.0 6.0 19.5 4.5 11.5 9.0 8.0 15.0 19.5 7.0 10.0 14.0 13.0 22.0 16.0 17.5 17.5 21.0
IKullback Score Rank 0.348869 0.866330 0.443306 1.846288 2.125994 2.268893 1.419260 0.348869 0.443306 1.215166 2.194598 1.309049 1.419260 2.473949 2.616697 2.288068 2.567410 2.282864 2.710100 3.259974 3.538550 3.679394
1.5 5.0 3.5 10.0 11.0 13.0 8.5 1.5 3.5 6.0 12.0 7.0 8.5 16.0 18.0 15.0 17.0 14.0 19.0 20.0 21.0 22.0
Table 5. Ranks assigned by IM acArthur , IT heil , IAtkinson , and IGini from N-2 Summary Non-ANY No. of IM acArthur ID Attributes Tuples Score Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 3 4 5 6 9 10 2 4 5 9 9 10 11 16 17 21 21 30 40 50 67
0.184731 0.218074 0.399511 0.144729 0.132377 0.243857 0.457814 0.184731 0.399511 0.298402 0.264620 0.452998 0.457814 0.260255 0.342143 0.441534 0.440642 0.487441 0.494412 0.479347 0.482560 0.515363
3.5 5.0 11.5 2.0 1.0 6.0 16.5 3.5 11.5 9.0 8.0 15.0 16.5 7.0 10.0 14.0 13.0 20.0 21.0 18.0 19.0 22.0
IT heil Score Rank 0.651131 0.718633 1.556694 0.757153 0.777902 1.710559 2.508888 0.651131 1.556694 1.195810 1.898130 2.249471 2.508888 2.025527 2.939297 3.512838 3.890191 3.982314 4.485426 5.317662 5.751495 6.181546
1.5 3.0 7.5 4.0 5.0 9.0 13.5 1.5 7.5 6.0 10.0 12.0 13.5 11.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0
IAtkinson Score Rank 0.505218 0.914901 0.792127 0.759314 0.693136 0.765973 0.821439 0.505218 0.792127 0.859044 0.759162 0.884562 0.821439 0.727091 0.797472 0.860465 0.852812 0.862917 0.894697 0.854864 0.854329 0.885877
1.5 22.0 8.5 6.0 3.0 7.0 11.5 1.5 8.5 16.0 5.0 19.0 11.5 4.0 10.0 17.0 13.0 18.0 21.0 15.0 14.0 20.0
IGini Score Rank 0.217254 0.158404 0.173861 0.078822 0.067906 0.065429 0.076804 0.217254 0.173861 0.126529 0.067231 0.086449 0.076804 0.056104 0.044494 0.045517 0.037253 0.038645 0.027736 0.020222 0.016312 0.012656
1.5 5.0 3.5 8.0 11.0 13.0 9.5 1.5 3.5 6.0 12.0 7.0 9.5 14.0 16.0 15.0 18.0 17.0 19.0 20.0 21.0 22.0
Tables 2 through 5 show similarities in how some of the sixteen measures rank summaries. For example, the six most interesting summaries (i.e., 1, 2, 3, 8, 9, and 10) are ranked identically by IV ariance , ISimpson , IShannon , IT otal , IM cIntosh , and IKullback , while the four least interesting summaries (i.e., 19, 20, 21, and 22) are ranked identically by IV ariance , ISimpson , IShannon , IT otal , IM ax , IM cIntosh , IKullback , IT heil , and IGini . To quantify the extent of the ranking similarities between the sixteen measures across all seven discovery tasks, we calculated the Gamma correlation coefficient for each pair of measures and found that 86.4% of the coefficients are highly significant with a p-value below 0.005. We also found the ranks assigned to the summaries have a high positive correlation for some pairs of measures. For the purpose of this discussion, we considered a pair of measures to be highly correlated when the average coefficient is greater than 0.85. Thus, 35% of the pairs (i.e., 42 of 120 pairs) are highly correlated using the 0.85 threshold. Following careful examination of the 42 highly correlated pairs, we found two distinct groups of measures within which summaries are ranked similarly. One group
Heuristic Measures of Interestingness
239
consists of the measures IV ariance , ISimpson , IShannon , IT otal , IM ax , IM cIntosh , IBerger , IKullback , and IGini . The other group consists of the measures ISchutz , IBray , IW hittaker , and IM acArthur . There are no similarities (i.e., no high positive correlations) shared between the two groups. Of the remaining three measures, IT heil , ILorenz , and IAtkinson , IT heil is only highly correlated with IM ax , while ILorenz and IAtkinson are not highly correlated with any of the other measures. There were no highly negative correlations between any of the pairs of measures. One way to analyze the measures is to determine the complexity of summaries considered to be of high, moderate, and low interest (i.e., the relative interestingness). These results are shown in Table 6. In Table 6, the values in the H, M, and L columns describe the complexity index for a group of summaries considered to be of high, moderate, and low interest, respectively. The complexity index for a group of summaries is defined as the product of the average number of tuples and the average number of non-ANY attributes contained in the group of summaries. For example, the complexity index for summaries determined to be of high interest by the IV ariance index for discovery task N-2, is 4.5 (i.e., 3 × 1.5, where 3 and 1.5 are the average number of tuples and average number of non-ANY attributes, respectively). High, moderate, and low interest summaries were considered to be the top, middle, and bottom 20%, respectively, of summaries. The N-2, N-3, and N-4 discovery tasks generated sets containing 22, 70, and 214 summaries, respectively. Thus, the complexity index of the summaries from the N-2, N-3, and N-4 discovery tasks is based upon the averages for four, 14, and 43 summaries, respectively. Table 6. Relative interestingness of summaries from the NSERC discovery tasks Interestingness Measure
H
IV ariance ISimpson IShannon IT otal IM ax IM cIntosh ILorenz IBerger ISchutz IBray IW hittaker IKullback IM acArthur IT heil IAtkinson IGini
4.5 4.5 4.5 4.5 3.6 4.5 3.9 4.5 4.0 4.0 4.0 4.5 4.9 3.9 8.0 4.5
N-2 M 11.3 20.3 11.3 13.2 14.0 20.3 20.3 15.8 13.1 13.1 13.1 11.3 13.1 17.1 18.0 13.2
L 93.6 93.6 93.6 93.6 93.6 93.6 93.6 93.6 48.6 48.6 48.6 93.6 84.0 93.6 49.1 93.6
Relative Interestingness N-3 H M L H 9.0 9.0 9.0 8.1 8.3 9.0 21.1 9.6 23.4 23.4 23.4 9.0 23.2 9.1 31.5 9.0
64.7 72.9 72.9 65.8 63.7 72.9 104.8 86.6 367.9 367.9 367.9 72.9 251.4 66.2 270.5 60.5
520.3 477.4 520.3 545.5 545.5 477.4 249.3 457.5 146.7 146.7 146.7 520.3 220.8 533.3 103.7 537.7
34.6 38.0 29.8 27.2 27.0 38.0 133.6 48.8 289.8 289.8 289.8 29.8 249.5 33.8 531.1 27.9
N-4 M 430.5 447.8 430.2 423.6 424.2 447.8 1373.9 587.8 1242.2 1242.2 1242.2 430.2 1210.3 558.9 555.6 425.1
L 3212.9 3163.1 3210.2 3220.5 3221.6 3163.1 482.6 2807.2 227.0 227.0 227.0 3210.2 233.2 2668.4 1611.1 3220.5
Table 6 shows that in most cases the complexity index is lowest for the most interesting summaries and highest for the least interesting summaries. For example, the complexity index for summaries determined by the IV ariance index to be of high, moderate, and low interest are 4.5, 11.3, and 93.6 from N-2, respectively, 9.0, 64.7, and 520.3 from N-3, respectively, and 34.6, 430.5, and 3212.9 from N-4, respectively. The only exceptions occurred in the results for the ILorenz , ISchutz , IBray , IW hittaker , IM acArthur , and IAtkinson indexes from the N-3 and N-4 discovery tasks.
240
R.J. Hilderman and H.J. Hamilton
A comparison of the summaries with high relative interestingness from the N-2, N-3, and N-4 discovery tasks is shown in the graph of Figure 1. In Figure 1, the horizontal and vertical axes describe the measures and the complexity indexes, respectively. Horizontal rows of bars correspond to the complexity indexes of summaries from a particular discovery task. The back most horizontal row of bars corresponds to the average complexity index for a particular measure. Figure 1 shows a maximum complexity index on the vertical axes of 60.0 (although the complexity indexes for ILorenz , ISchutz , IBray , IW hittaker , IM acArthur , and IAtkinson from the N-4 discovery task each exceed this value by a minimum of 189.5). The measures, listed in ascending order of the complexity index, are (position in parentheses): IM ax (1), IT otal (2), IGini (3), IShannon and IKullback (4), IT heil (5), IV ariance (6), ISimpson and IM cIntosh (7), IBerger (8), ILorenz (9), IM acArthur (10), ISchutz , IBray , and IW hittaker (11), and IAtkinson (12).
Complexity Index
60.0
45.0
30.0
Average N-4 N-3 N-2
15.0
in i IG
he il in so n
IT
IA tk
IB r IW ay hi tta ke r IK ul lb ac IM k ac Ar th ur
z IB er ge r IS ch ut z
h
en
os nt
or IL
IM cI
al
ax
ot
IM
IT
IV ar ia nc e IS im ps on IS ha nn on
0.0
Interestingness Measures
Fig. 1. Relative complexity of summaries from the NSERC discovery tasks
4
Conclusion and Future Research
We described the HMI set of heuristics for ranking the interestingness of summaries generated from databases. Although the heuristics have previously been applied in several areas of the physical, social, ecological, management, information, and computer sciences, their use for ranking summaries generated from databases is a new application area. The preliminary results presented here show that the order in which some of the measures rank summaries is highly correlated, resulting in two distinct groups of measures in which summaries are ranked similarly. Highly ranked, concise summaries provide a reasonable starting point for further analysis of discovered knowledge. That is, other highly ranked summaries that are nearby in the generalization space will probably contain information at useful and appropriate levels of detail. Future research will focus on determining the specific response of each measure to different population structures.
Heuristic Measures of Interestingness
241
References 1. A.B. Atkinson. On the measurement of inequality. Journal of Economic Theory, 2:244–263, 1970. 2. W.H. Berger and F.L. Parker. Diversity of planktonic forminifera in deep-sea sediments. Science, 168:1345–1347, 1970. 3. I. Bournaud and J.-G. Ganascia. Accounting for domain knowledge in the construction of a generalization space. In Proceedings of the Third International Conference on Conceptual Structures, pages 446–459. Springer-Verlag, August 1997. 4. J.R. Bray and J.T. Curtis. An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs, 27:325–349, 1957. 5. A.A. Freitas. On objective measures of rule surprisingness. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD’98), pages 1–9, Nantes, France, September 1998. 6. R. Godin, R. Missaoui, and H. Alaoui. Incremental concept formation algorithms based on galois (concept) lattices. Computational Intelligence, 11(2):246–267, 1995. 7. H.J. Hamilton, R.J. Hilderman, L. Li, and D.J. Randall. Generalization lattices. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD’98), pages 328–336, Nantes, France, September 1998. 8. R.J. Hilderman and H.J. Hamilton. Heuristics for ranking the interestingness of discovered knowledge. In N. Zhong and L. Zhou, editors, Proceedings of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’99), pages 204–209, Beijing, China, April 1999. 9. R.J. Hilderman, H.J. Hamilton, and B. Barber. Ranking the interestingness of summaries from data mining systems. In Proceedings of the 12th International Florida Artificial Intelligence Research Symposium (FLAIRS’99), pages 100–106, Orlando, Florida, May 1999. 10. R.J. Hilderman, H.J. Hamilton, R.J. Kowalchuk, and N. Cercone. Parallel knowledge discovery using domain generalization graphs. In J. Komorowski and J. Zytkow, editors, Proceedings of the First European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD’97), pages 25–35, Trondheim, Norway, June 1997. 11. S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79–86, 1951. 12. H. Liu, H. Lu, and J. Yao. Identifying relevant databases for multidatabase mining. In X. Wu, R. Kotagiri, and K. Korb, editors, Proceedings of the Second Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’98), pages 210–221, Melbourne, Australia, April 1998. 13. R.H. MacArthur. Patterns of species diversity. Biological Review, 40:510–533, 1965. 14. R.P. McIntosh. An index of diversity and the relation of certain concepts to diveristy. Ecology, 48(3):392–404, 1967. 15. W.A. Rosenkrantz. Introduction to Probability and Statistics for Scientists and Engineers. McGraw-Hill, 1997. 16. R.R. Schutz. On the measurement of income inequality. American Economic Review, 41:107– 122, March 1951. 17. C.E. Shannon and W. Weaver. The mathematical theory of communication. University of Illinois Press, 1949. 18. E.H. Simpson. Measurement of diversity. Nature, 163:688, 1949. 19. G. Stumme, R. Wille, and U. Wille. Conceptual knowledge discovery in databases using formal concept analysis methods. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD’98), pages 450–458, Nantes, France, September 1998. 20. H. Theil. Economics and information theory. Rand McNally, 1970. 21. R.H. Whittaker. Evolution and measurement of species diversity. Taxon, 21 (2/3):213–251, May 1972. 22. Y.Y. Yao, S.K.M. Wong, and C.J. Butz. On information-theoretic measures of attribute importance. In N. Zhong and L. Zhou, editors, Proceedings of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’99), pages 133–137, Beijing, China, April 1999. 23. J.F. Young. Information theory. John Wiley & Sons, 1971.
Enhancing Rule Interestingness for Neuro-fuzzy Systems Thomas Wittmann, Johannes Ruhland, and Matthias Eichholz Lehrstuhl für Wirtschaftsinformatik, Friedrich Schiller Universität Jena Carl-Zeiß-Str. 3, 07743 Jena, Germany [email protected], [email protected], [email protected]
Abstract. Data Mining Algorithms extract patterns from large amounts of data. But these patterns will yield knowledge only if they are interesting, i.e. valid, new, potentially useful, and understandable. Unfortunately, during pattern search most Data Mining Algorithms focus on validity only, which also holds true for Neuro-Fuzzy Systems. In this Paper we introduce a method to enhance the interestingness of a rule base as a whole. In the first step, we aggregate the rule base through amalgamation of adjacent rules and eliminiation of redundant attributes. Supplementing this rather technical approach, we next sort rules with regard to their performance, as measured by their evidence. Finally, we compute reduced evidences, which penalize rules that are very similar to rules with a higher evidence. Rules sorted on reduced evidence are fed into an integrated rulebrowser, to allow for manual rule selection according to personal and situation-dependent preference. This method was applied successfully to two real-life classification problems, the target group selection for a retail bank, and fault diagnosis for a large car manufacturer. Explicit reference is taken to the NEFCLASS algorithm, but the procedure is easily generalized to other systems.
1
Introduction
Data Mining Algorithms extract patterns from large amounts of data. But patterns are only interesting, if they are valid, new, potentially useful, and understandable. Unfortunately, most Data Mining Algorithms only refer to validity in search for patterns. This also holds for Neuro-Fuzzy Systems, a promising new development for classification learning. Empirical studies (e.g. [11]) have proven their ability to combine automatic learning, as attributed to neural networks, with classification quality comparable to other data mining methods. But when it comes to analyzing real-life problems the main advantage over pure neural networks, the ease of understandability and interpretation, remains a much-acclaimed desire rather than a proven fact, though. Hence, pattern post-processing is necessary to identify interesting rules in the rule base output of a Neuro-Fuzzy System. This can also be called „data mining of second order“ [6]. Different methods can be used to enhance interestingness of rules J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 242-250, 1999. © Springer-Verlag Berlin Heidelberg 1999
Enhancing Rule Interestingness for Neuro-fuzzy Systems
243
according to the four criteria mentioned above. The aim is to concentrate on the most valid, new and useful rules and thus aggregate the rule base. Finally, a lean rule base is easier to understand. In this paper we refer especially to the NEFCLASS algorithm, but the procedure we develop can be generalized easily. In the first step, we aggregate the rule base with respect to adjacent rules and redundant attributes, in the second step we order the rules with regard to their performance. These sorted rules are fed into an integrated rulebrowser, which gives the user the opportunity to select rules according to his/her interests. As a first test, we successfully applied this method to two real-life classification problems, the target group selection for a bank, and the fault diagnosis for a big car manufacturer.
2
The Neuro-fuzzy System NEFCLASS
In NEFCLASS (NEuroFuzzyCLASSification) a fuzzy system is mapped on a neural network, a feed-forward three-layered multilayer perceptron. The (crisp) input pattern is presented to the neurons of the first layer. The fuzzification takes place when the input signals are propagated to the hidden layer, because the weights of the connections are modeled as membership functions of linguistic terms. The neurons of the hidden layer represent the rules. They are fully connected to the input layer with connection weights being interpretable as fuzzy sets. A hidden neuron’s response is connected to one output neuron only. With output neurons being associated with an output class on a 1:1 basis, each hidden neuron serves as a classification detector for exactly one output class. Hidden to output layer connections do not carry connection weights reflecting the lack of rule confidences in the approach. The learning phase is divided into two steps. In the first step the system learns the number of rule units and their connections, i.e. the rules and their antecedents, and in the second step it learns the optimal fuzzy sets, that is their membership functions [9]. NEFCLASS is available as freeware from the Institute of Knowledge Processing and Language Engineering, University of Magdeburg, Germany, http://fuzzy.cs.uni-magdeburg.de/welcome.html.
3
Rule Aggregation
Next, we describe a prototype for rule post-processing, called Rulebrowser. In the first step, Rulebrowser aggregates the rule base with respect to redundant rules and attributes. It joins adjacent rules, i.e. those containing the same conclusion while being based on adjacent fuzzy sets, and eliminates attributes becoming irrelevant afterwards (i.e. all fuzzy sets of variable joined). Figure 1 shows the basic idea for a simple case with two attributes (each containing three fuzzy sets) and nine rules classifying the cases into two classes (class1: c1 and class 2: c2). In figure 1 we are able to join the rules r1 to r3, thus making attribute x1 superfluous for this rule ra. Furthermore, e.g. r5 and r8 can be joined, but this will not allow us to eliminate any attribute.
244
T. Wittmann, J. Ruhland, and M. Eichholz
x1
large medium small
r1: c1 r2: c1 r3: c1 small
r4: c1 r5: c2 r6: c2 medium
r7: c2 r8: c2 r9: c1 large x2
x1
large medium small
ra: c1 small
r4: c1 r7: c2 rb: c2 r6: c2 r9: c1 medium large x2
Fig. 1. Rule Aggregation. Example
As seen from figure 1 joining rules in one dimension can inhibit rule aggregation along another attribute. Hence, when confronted with a high-dimensional rule space, rule aggregation is no trivial task, but a complex search problem [7]. Different algorithms can be used to search for suitable candidate rules for aggregation. Rulebrowser resorts on a similarity measure to be defined below. For each rule we determine the most similar rule and check the position of both rules in rule space. If we can join them, the first rule’s premise is extended to the range of the second rule, eliminating the second rule. If premises can not be joined, we examine the next similar rules, until we find two adjacent rules or all rules are examined. An attribute becomes superfluous, if after rule aggregation the antecedent containing this attribute covers all possible terms of this attribute, e.g. ra: ‘x1 = small or medium or large’ in figure 1. In pseudo-code notation this is : program Rule Aggregation determine similarities between rules for all classes do begin for all rules in a class do begin eliminate superfluous attributes in rules repeat for all rules in a class in order of similarity to the rule in focus if both rules can be joined then extend first rule’s premise eliminate second rule until adjacent rule found end end The similarity of two rules simr1,r2 has been defined in [3] as
sim r1,r2 r1, r2 = n= disti =
 =1-
n i =1
dist i
n
.
compared rules, total number of attributes in r1 and r2, distance measure for attribute i: Ïx ,x if attribute i exists in both rules dist i = Ì r1 r 2 , else Ó1
(1)
Enhancing Rule Interestingness for Neuro-fuzzy Systems
245
|xr ,xr | = distance between the fuzzy sets in r1 and r2, measured by their rank, i.e. for an attribute with 5 fuzzy sets {very small, small, medium, large, very large} |large, very small| = 4-1 = 3. For compound fuzzy sets like ‘low or medium’ the mean of the ranks is used for computation. 1
4
2
Rule Sorting and Browsing
Rule aggregation is a more or less technical approach, aiming only at understandability (lean rule base) and utility of rules (relevant attributes). In the second step, Rulebrowser sorts the rules with regard to a user-defined performance criterion and gives the user the opportunity to select the rules that comply with his/her interests. According to the user’s data mining target different foci on the same rule base may be realized by such a system. Possible aims of analysis are, for example: 1. In database marketing the user looks for a specified number of addresses, which have a high potential for becoming customers. The goal is a number of rules with a high coverage, ordered by decreasing validity. Not a single justification of a decision is important but reliable selection of a high number of potential customers. 2. In checking creditworthiness the user searches for rules with a high validity, but not necessarily a high coverage, that tells him whether the applicant is a high or a low risk customer. This is to illustrate our belief that there is no single ‘most interesting’ rule base, but that the user’s interests are the prime criteria for interestingness [1][2]. In detail, this approach takes into consideration the single rule confidence, as well as extended utility and novelty aspects. 4.1
Selecting High-Performance Rules
First, we determine the high-performance rules and sort the rules according to their power. This will be operationalized in various ways. Besides the well-known criteria, rule confidence and rule support, rule evidence is a relevant feature. A rule’s support is measured by the number of patterns that accomplish the rule’s premise. In other words, it is the number of times the rule ‘fires’. For this we emulate the signal propagation within NEFCLASS. Rules with compound antecedents have to be split up into ‘pure’ rules combined by a disjunction. The proportion of patterns, in which classification by a rule is done correctly, determines its confidence. Evidence is a composite measurement, depending on rule confidence and rule support. The following equation is based on suggestions of Gebhardt [3] and the ‘average error’, proposed by Jim/ Wuthrich [5]:
246
T. Wittmann, J. Ruhland, and M. Eichholz
Ï Ê a + bˆ Ô1 evidence i = Ì ÁË n ˜¯ ÔÓ0
if support > 0 . else
(2)
a = recognized patterns of the wrong class, b = not recognized patterns of the right class and n = number of patterns.
class 1 class 2
rule fires
rule does not fire
c a
b d
confidence = c/(c+a)
n= number of patterns support = c+a
evidence = 1- (a+b)/n or (c+d)/n
Fig. 2. Interrelation between the concepts of confidence, support and evidence.
Figure 2 visualizes the interrelation between confidence, support and evidence for a simple example with two classes. The rule’s conclusion is class 1. While confidence measures how valid, and support measures how often a rule ‘fires’, evidence makes a more complex proposition. It penalizes rules that misfire (rule fires, although it should not) as well as rules that fire too infrequently (rule does not fire, although it should). Rulebrowser sorts the rules based on the chosen performance measure (evidence is the default option, but confidence and support can be chosen). In each iteration these measures are calculated only for the patterns not already covered by rules that have entered the rulebase in previous steps. Hence, just the incremental value of the rule to the rulebase is taken into account. This is based on the idea of the ‘rule cover’-algorithm by Toivonen [13]. The user defines a mimimum support of the rules as a stop-criterion (if this support is not reached, the algorithm stops after 100 iteration): This leads to a selection of the most relevant rules. Defining a high support threshold will often shrink a rulebase drastically, but must be weighted against ensuing information loss. program Rule Sort repeat determine performance for each rule in search space for all classes do begin select rule with maximum evidence and support > 0 delete all cases covered by this rule delete rule from search space end until stop-criterion fulfilled
4.2
Devaluation of Similar Rules
To this point, we have only judged rules based on their performance considered in isolation. We may now manipulate the evidence formula to account for novelty of
Enhancing Rule Interestingness for Neuro-fuzzy Systems
247
entering candidate to the rulebase. We compute reduced evidences, which penalize rules that are very similar to rules with a higher evidence [3]. program Devaluate Similar Rules for all rules do begin reduced evidence := evidence rule := not marked end repeat for all classes do begin determine not-marked rule with highest reduced evidence and mark it for all not-marked rules do begin compute new reduced evidence according to (3) end end until all rules are marked
V
new red
new red
k d *sim Ï R1, R2 È V(R 1 ) ˘ Ô red (R 1 ) = min ÌV (R 1 ) , V(R 1 ) * Í red ˙ ÍÎ V (R 2 ) ˙˚ Ô Ó
(Ri) = V red V (Ri) = V (Ri) = d= simR1, R2 = K=
¸ Ô ˝ . Ô ˛
(3)
new reduced evidence of rule i, reduced evidence of rule i, evidence of rule i, strength of devaluation, similarity of rule 1 and 2 according to formula (1), relevance of similarity.
This reduced evidence is again used to sort the rules [3]. In the resulting, rather userfriendly overview of the rules, the user can determine the parameters of devaluation by a scroll-bar, re- and devaluate rules manually and cut off rules with low reduced evidence. In addition he/she can change the sort criterion. In doing so the user can find his/her optimal position in the trade-off between the different facets of interestingness, especially validity and simplicity of the rule base.
5
Empirical Evaluation of Rulebrowser
Rulebrowser has been successfully applied to two real-life classification problems: fault diagnosis for a large car manufacturer and the target group selection for a bank. The first database (car data) contains 18.000 records on cars recently sold, their respective characteristics (21 standard equipment characteristics such as number of cylinders plus 221 on optional equipment as ABS) and data on faults detected (e.g. engine breakdown). The analysis aim was, How do car characteristics influence certain fault frequences? The second database (bank data) is based on a mailing campaign to
248
T. Wittmann, J. Ruhland, and M. Eichholz
convince customers of a bank to buy a credit card. It consists of about 180.000 cases, 21 attributes and 2 classes (respondents/ non-respondents). For the car data, figure 3 shows the considerable reduction in the number of rules and the average number of attributes in one rule after rule aggregation. Figure 4 shows the further reduction in the number of rules for a minimum support of 95%, 90% and 80%. number of rules 25 22 20
16
average number of attributes per rule 22 20
8
15
15
12
6
13
12
10
10 6
5
6 43
3
5
2,33
6,25 5
3
5
4,4
4
8
10
7
3,83 3
3 2,67
2,63
5
4
1,67 2,67 2,58
2
2
3
0
0 w01w07w30w42w54w68w72w80w82
w01 w07 w30 w42 w54w68 w72 w80 w82 different car faults
different car faults after Rule Aggregation
before
Fig. 3. Reduction of the number of rules and average number of attributes in one rule.
num- 15 ber of 10 rules 5
12 10 8
1212 9
8 66
66
666
8
4
3333
8
6666 3333
333
7
66
2
0 w01
w07
after Rule Aggregation Min. Support: 90%
w30
w42
w54
w68
w72
Min. Support: 95% Min. Support: 80%
w80 w82 9 types of car fault
Fig. 4. Further reduction in the number of rules. Looking at the bank data, which, in contrast to the car data, were only moderately preprocessed with respect to the number of attributes, we can see even better results. Rule aggregation reduced the average number of attributes per rule from 21 to 17.5, with the smallest rule containing only 11 attributes. The number of rules sank from 242 to 48 after rule aggregation. With a minimum support of 90% we even managed to decrease the number of rules to 16. In both cases, rule aggregation leads to no loss of validity, as no information is deleted. Looking at the minimum support rule, we are confronted with a validity versus simplicity trade-off. But the loss of validity is small due to the elimination of the least powerfull rules, that cover only few cases and describe no new structures.
Enhancing Rule Interestingness for Neuro-fuzzy Systems
6
249
Related work
Pattern postprocessing, in contrast to the various preprocessing tasks, like attribute selection or treating missing values, is a stepchild of research. Only few approaches have been developed to solve the problem of enhancing rule interestingness. Most of them are found in the field of association analysis, where the phenomenon of ‘exploding’ rule bases is fundamental. Due to space restrictions we can only mention a selection of the approaches Most methods rely on objective measures of rule interestingness, like the ‘Neighborhood-Based Unexpectedness’ of Dong/ Li [2], the use of rule-covers by Toivonen [13] or the rule aggregation approach of Major/ Mangano [8]. Only few approaches integrate the user, usually by requiring predefined rule templates (Klemettinen et al. [6]) or belief systems (Silberschatz/ Tuzhilin [12]). But this is often unfeasible in practice, due to the strong involvement of the user. Few authors propose combined approaches for the evaluation of interestingness. Gebhardt suggests evidence and affinity of rules for a composed measure of interestingness. He proposes several ways to quantify them and discusses pros and cons [3]. Hausdorf/ Müller have developed a complex system for evaluating interestingness based on different facets [4]. The developers of NEFCLASS themselves have proposed a rule aggregation method, based on four different steps of attribute, rule and fuzzy set elimination [10]. But these steps mainly aim at the validity of the rules and contain serious problems. Another algorithm by Klose et al. aggregates the rule base in three steps, ‘input pruning on data set level’ (attribute selection), ‘input pruning on rule level’ and ‘simple rule merging’ [7]. The last two steps correspond to the rule aggregation proposed in this paper.
7
Conclusions
Rule post-processing is an essential step in the Knowledge Discovery in Databases process. This holds true for Neuro-Fuzzy Systems in particular, too. In this paper we have proposed a prototype for a tool that aggregates and sorts rules according to different facets of interestingness. The results are promising, but further research is needed. For example, more sophisticated search strategies to identify candidate rules for aggregation might improve the resulting rule base. Visualization techniques, which have not been mentioned in this paper, could enhance the understandability of the rules. However, the quality of a post-processing method is very hard to quantify, as it strongly depends on the system’s user. Hence, only further practical applications of our methods can evaluate their effectiveness. Research in this subject was funded in part by the Thuringian Ministry for Science, Research and Culture. The authors are responsible for the content of this publication.
250
T. Wittmann, J. Ruhland, and M. Eichholz
References 1. Brachman, R. J., Anand, T., The Process of Knowledge Discovery in Databases, In: Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. (Eds.), Advances in Knowledge Discovery and Data Mining, Menlo Park, CA 1996, pp. 37-58 2. Dong, G., Li, J., Interestingness of Discovered Association Rules in terms of NeighborhoodBased Unexpectedness, In: Wu, X., Kotagiri, R., Korb, E. (Eds.), Research and Development in Knowledge Discovery and Data Mining (Proceedings of the Second Pacific-Asia Conference on Knowledge Discovery and Data Mining), Heidelberg 1998, pp. 72-86 3. Gebhardt, F., Discovering interesting statements from a database, In: Applied Stochastic Models And Data Analysis 1/1994, pp. 1-14 4. Hausdorf, C. Müller, M., A Theory of Interestingness for Knowledge Discovery in Databases Exemplified in Medicine, In: Lavrac, C. Keravnou, E., Zupan, B. (Eds.), First International Workshop on Intelligent Data Analysis in Medicine and Pharmacology 5. Jim, K., Wuthrich, B., Rule Discovery: Error measures and Conditional Rule Probabilities, In: KDD: Techniques and Applications. Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapur u.a. 1997, pp. 82-89 6. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H.; Verkamo, A. I., Finding Interesting Rules from Large Sets of Discovered Association Rules, In: Adam, N. R., Bhargava, B. K., Yesha, Y. (Eds.), Proceedings of the Third International Conference on Information and Knowledge Management (CIKM'94), Gaithersburg, Maryland, November 29 - December 2, 1994, (ACM Press) 1994, pp. 401-407 7. Klose, A., Nürnberger, A., Nauck, D., Some approaches to improve the interpretability of Neuro-Fuzzy Classifiers, In: Zimmermann, H.-J. (Eds.), 6th European Congress on Intelligent Techniques and Soft Computing. Aachen, Germany, September 7-10, 1998, Aachen 1998, pp. 629-633 8. Major, J. A., Mangano, J. J., Selecting among rules induced from a hurricane Database, In: Piatetsky-Shapiro, G. (Eds.), Knowledge Discovery in Databases, Papers from the 1993 AAAI Workshop, Menlo Park, CA 1993, pp. 28-44 9. Nauck, D., Kruse, R., NEFCLASS - A Neuro-Fuzzy Approach for the Classification of Data. Paper of Symposium on Applied Computing 1995 (SAC'95) in Nashville 10. Nauck, D., Kruse. R., New Learning Strategies for NEFCLASS, In: Proc. Seventh International Fuzzy Systems Association World Congress IFSA'97, Vol. IV, Academia Prague, 1997, pp. 50-55 11. Ruhland, J., Wittmann, T., Neurofuzzy Systems In Large Databases - A comparison of Alternative Algorithms for a real-life Classification Problem, in: Proceedings 5th European Congress on Intelligent Techniques and Soft Computing, Aachen, Germany, September 8-11 1997 (EUFIT' 97), pp. 1517-1521 12. Siberschatz, A., Tuzhilin, A., On subjective measures of interestingness in knowledge discovery, In: Proc. of the 1st International Conference on Knowledge Discovery and Data Mining, Montreal, August 1995, pp. 275-281 13. Toivonen, H., Klemettinen, M., Ronkainen, P., Hätönen, K., Mannila, H., Pruning and Grouping Discovered Associations Rules, In: Kodratoff, Y., Nakhaeizadeh, G., Taylor, C. (Eds.), Workshop Notes Statistics, Machine Learning, and Knowledge Discovery in Databases. MLNet Familiarization Workshop, Heraklion, Crete, April 1995, 1995, pp. 47-52
Unsupervised Profiling for Identifying Superimposed Fraud Uzi Murad and Gadi Pinkas Tel Aviv University, Ramat-Aviv 69978, Israel Amdocs (Israel) Ltd., 8 Hapnina St., Ra’anana 43000, Israel {uzimu, gadip}@amdocs.com
Abstract. Many fraud analysis applications try to detect “probably fraudulent” usage patterns, and to discover these patterns in historical data. This paper builds on a different detection concept; there are no fixed “probably fraudulent” patterns, but any significant deviation from the normal behavior indicates a potential fraud. In order to detect such deviations, a comprehensive representation of “customer behavior” must be used. This paper presents such representation, and discusses issues derived from it: a distance function and a clustering algorithm for probability distributions.
1 Introduction The telecommunications industry regularly suffers major losses due to fraud. The various types of fraud may be classified into two categories: Subscription fraud. Fraudsters obtain an account without intention to pay the bill. In such cases, abnormal usage occurs throughout the active period of the account. Superimposed fraud [3]. Fraudsters „take over“ a legitimate account. In such cases, the abnormal usage is „superimposed“ upon the normal usage of the legitimate customers. Examples of such cases include cellular cloning and calling card theft. Data mining can be used to learn what situations constitute fraud. However, the nature of the problem makes standard pattern recognition algorithms impractical. Following are the unique characteristics of the superimposed fraud detection problem. Context. Customers differ in calling behaviors (usage patterns and volume). A usage pattern may be normal for one customer and abnormal for another. For example, calls from New York are suspicious if the customer lives and works in Boston, but perfectly normal for New York residents. Sometimes a single customer demonstrates different types of behavior on different occasions. For example, business customers make many calls on business days, but no calls on weekends. Finally, normal behavior may change over time. It is impossible to define global fraud criteria that would be valid for all customers, all the time. Changing fraud patterns. Following the progress of technology, fraudsters adopt new fraud techniques, which may result in new usage patterns. A set of previously observed fraud-related patterns might not suffice to detect new instances of fraud. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 251-261, 1999. © Springer-Verlag Berlin Heidelberg 1999
252
U. Murad and G. Pinkas
Following this discussion, a fraud detection system should be (1) sensitive to customer-unique behavior, (2) adaptable to changes in customer behavior, (3) sensitive to rare yet normal customer behavior and (4) adaptable to new fraud patterns.
2 Related Work Several techniques concerning telecommunication and credit card fraud detection are described in literature. The majority of techniques monitor customer behavior regarding specific usage patterns. [10] describes a rule based system, which accumulates number or duration of calls that match specific patterns (e.g., international calls) in one day and calculates the average and standard deviation of the daily values. Then it compares new values against a user-defined threshold in terms of standard deviations from the average. [9] describes a neural network based system, which uses similar parameters, but learns from known cases what situations (combinations of current value, average value and standard deviation) are fraudulent. In this approach, the threshold is determined from known cases, rather than being user-defined. [3] uses supervised learning to discover from historical data what usage patterns are „probably fraudulent“, but still deals with specific patterns. In general, any supervised learning algorithm has to receive a significant number of cases of a pattern in order to learn it. It might be difficult to obtain enough known cases of each pattern. In addition, new patterns of fraud must be discovered by other means (such as customer complaints), before being presented to the learning algorithm. A long time may pass from the first occurrence of the pattern until it is used for detection. Finally, accurately classified data may be hard to get. A different approach, the one this paper follows, is to create a general model of behavior, without using predefined patterns. [7] deals with credit card fraud, and profiles customers by their typical transactions. A new transaction is examined against the profile, to see whether similar transactions have appeared in the past. There is no element of usage volume, or frequency of transactions. [1] represents short-term behavior by the probability distribution of calls made by a customer in one day. This „Current User Profile“ (CUP) is examined against the „User Profile History“ (UPH), which is the average of the CUPs generated by that user. Call volume is not taken into consideration, and, since the UPH is an average, it does not reflect rare yet normal patterns.
3 Three Level Profiling The technique for fraud detection suggested in this paper does not search for specific fraud patterns; instead, it points out any significant deviation from the customer’s normal behavior as a potential fraud. In order to detect such deviations, a comprehensive representation of behavior, which can capture a variety of behavior patterns, must be used, and the definition of „deviation“ should reflect the actual degree of dissimilarity between „behavior instances“. To meet these goals, three profile levels are used:
Unsupervised Profiling for Identifying Superimposed Fraud Call Profile duration
start time
dest. type
call type
….
Call Duration
Call Duration
First level prototyping
First level profile
Toll Free
Toll Free
PRS
PRS
International
International
Local
Local Data
Voice
Second level profile
253
Time of Day
Quantitative Profile
Time of Day
10%
70%
20%
Qualitative Profile
Second level prototyping
Daily Profile
Third level profile
Daily Profiles Space Week-days Weekend DP DP DP type type type 1 2 3
Overall Profile
Fig. 1. Three Level Profiling
Call Profile – represents a single call and includes all fields of the Call Detail Record (CDR) that are relevant to behavior. Daily profile – represents the short-term behavior of a customer. The daily profile consists of two parts: The qualitative profile describes the kind of calls (when, from where, to where) the customer made during the day, and the quantitative profile describes the usage volume. Each attribute in the call profile is seen as a random variable, and the qualitative profile is the empirical multi-dimensional probability distribution of calls on a given day. Since the space of call profiles is huge, it is partitioned into a finite number of subspaces, each of which is represented by a „prototypical call“. The qualitative profile is a vector, which contains an entry for each „call prototype“ with the percentage of calls (on a given day) of that prototype. Overall Profile – represents the long-term „normal“ behavior of a customer, and reflects the daily behaviors normally exhibited by the customer. Since the space of Daily Profiles is infinite, a clustering algorithm is applied to the daily profiles, which extracts „prototypical days“. The overall profile is a vector with entries for all „daily prototypes“, containing information about the „normal“ usage level for that customer in days of that type. Since customers may behave differently on different „types of
254
U. Murad and G. Pinkas
days“, a separate vector is kept for each type. In the current framework there are two types: weekdays and weekends. Fig. 1 illustrates the concept of Three Level Profiling.
4
The Call Profile
The call profile c=(c1,c2,…,cd) includes all those features of the Call Detail Record (CDR) that are relevant to behavior. For the data set used for the first implementation (wireline data), the following attributes were used: call start time, call duration, destination type (local, international, premium rate service or toll-free) and call type (voice or data). Other attributes should be considered for different data sets. In a cellular context, for example, the originating location would probably be included. Each attribute ci corresponds to a random variable Xi, which gets values from domain Di. The domain of call profiles D = D1 x D2 x …x Dd is huge; call duration may be any number of seconds, and start time may be any time (in seconds) in the day. The huge domain of CDR profiles should be represented by a relatively small number of call prototypes. For our purpose, a good set of prototypes will be such that any new call has a prototype similar enough to it, and different prototypes are dissimilar enough. We extract call prototypes as follows. For continuous or ordinal attributes (call duration and start time) the domain Di of attribute i is split into ni ranges, resulting * * in a discrete domain Di ={ x , x ,..., x }. For discrete non-ordinal attributes Di = Di. In our framework, we partitioned the start time into 12 two-hour time windows. Call duration was partitioned to 12 five-minute ranges, which cover calls up to an hour long. There is, of course, inaccuracy in calls longer than one hour. However, since 99.9% of the calls are shorter than one hour, and long duration calls can be monitored easily by other means, this compromise is reasonable. The resulting discrete domain contains l = n1 ⋅ n2 ⋅...⋅nd values. i 1
i 2
i ni
5 The Daily Profile and the Curse of Dimensionality The daily profile DP consists of a quantitative profile DP.n (the number of calls made in that day) and a qualitative profiles DP. q = (q1 , q2 ,..., ql ) where 0 ≤ qi ≤ 1 , Σqi = 1 and l is the number of call prototypes. Each entry qi corresponds to a call prototype, and contains the percentage of calls of that type in the given day. , in our case) seems to be The huge dimension of the vector ( 12 ⋅ 12 ⋅ 4 ⋅ 2 = 1152 problematic in terms of both storage space and computation time. However, only few prototypes actually appear in a single daily profile. The average number of non-zero call prototypes in a daily profile (calculated using over 400,000 Daily Profiles) is 2.8, and 99% of the daily profiles have 13 or less non-zero call prototypes. If time and space complexity depend on the number of non-zero prototypes only, then the dimensionality problem ceases to exist. To achieve this, daily profiles are saved in variablelength arrays. Computations with daily profiles are discussed next.
Unsupervised Profiling for Identifying Superimposed Fraud
255
6 Distances Between Qualitative Profiles A distance function between daily profiles is required for clustering daily profiles and for detection of deviations. It is extremely important that the distance function reflect the actual level of similarity between two daily profiles. We examined several known distance functions. The Euclidean distance is not adequate for the problem. Consider, for example, the three following call prototypes: Call prototype 1: 5-minute local voice call at 2:00PM Call prototype 2: 10-minute local voice call at 2:00PM Call prototype 3: 60-minute international data call at 2:00AM and three daily profiles, each of which contains only calls of prototype 1, 2 and 3, respectively (i.e., each daily profile has value 1 in one entry and 0 in all others). While it is obvious that the first daily profile is similar to the second, and absolutely different from the third, the Euclidean distances are identical and equal to 2 , which is the maximal distance. The reason for this distortion is that Euclidean distance does not take into account the similarity between attributes. Each attribute is treated as totally different in meaning from all other attributes. In our case, however, the attributes represent points in the d-dimensional space of call profiles and, as such, some attributes are „closer“ (in meaning) to each other than others. The Helinger distance (suggested by [1]) and the Mahalonobis distance [5] suffer from the same distortion. Since the qualitative profiles represent multidimensional probability distributions, distance functions between probability distributions should be considered. Distance functions discussed in [6] (for example, Patrick-Fisher, Matusita and Divergence) share the same problem. An obvious solution is to compare the probability for neighborhoods of values. However, the performance depends on the neighborhood size. Small neighborhoods will not be sensitive enough in cases where the values are distant, and large neighborhoods will loose information. Moreover, large neighborhoods may require many more prototypes to be handled than the non-zero ones. Cumulative Distribution Based Distance. We propose the CD-distance, which is based on cumulative distribution. The cumulative distribution enables to capture the „closeness“ of values, in addition to their probabilities. The distance function sums the squared differences of the cumulative distribution functions (instead of the density functions). Formally, in the one-dimensional case we define the distance as follows. Let f1(x) and f2(x) be two continuous probability distributions of a random variable X. The distance between f1(x) and f2(x), denoted by d(f1, f2) is
d ( f1 , f 2 ) =
xmax
xmax 1 2 ( F1 ( x) − F2 ( x )) dx . ∫ − xmin xmin
(1)
where F(x) is the cumulative distribution ( F ( x ) = P( X ≤ x ) ), and xmax and xmin are the maximum and minimum values, respectively, of the random variable X. We define xmax
256
U. Murad and G. Pinkas
and xmin such that the probability of values outside (xmin, xmax) is redundant. xmax and xmin do not depend on the two density functions being compared, but solely on the random variable X. In order to obtain distance values on a normalized scale, the sum is divided by (xmax - xmin), so we get 0 ≤ d ( f 1 , f 2 ) ≤ 1 . In the discrete, ordinal case, the domain of the random variable X is an ascending list of values ( D = {x1 , x2 ,..., xn } , xi > x j ⇔ i > j ), and the distance between two probability distributions is
d ( f1 , f2 ) =
2 1 n −1 ∑ ( F1 ( x j ) − F2 ( x j )) δ j xn − x1 j =1
(2)
where δj = x j +1 − x j . In this case, the cumulative distribution is a step-function, and (2) is a straightforward simplification of (1) for the subset of discrete probability distributions. x1 and xn, the smallest and largest discrete values, respectively, serve as xmin and xmax. δj are the differences between consecutive discrete values. Note that δj are not necessarily equal, therefore, to improve computation efficiency, we can consider only the non-zero entries (entries k such that Pi ( X = xk ) ≠ 0, i=1,2). Binary random variables are treated as discrete ordinal random variables, with the domain {0,1}. In the discrete, non-ordinal case (e.g., call type), the domain consists of non-numeric values. We assume that the similarity between each pair of values is equal. In this case, we replace the call profile attribute i with ni binary attributes, where ni is the cardinality of domain Di. The generalized distance function in the multidimensional case is:
d ( f1, f 2 ) =
d
1 i i xni − x1
∑ wi i =1
∑ ( F1 ( x ij ) − F2 ( x ij )) δ ij
ni −1 j =1
2
(3)
where x ij is the jth prototype of call profile attribute i, F ( x ij ) = P( X i ≤ x ij ) , and wi is a weight for call profile attribute i. Σwi = 1. It can be shown [11] that the cd-distance function maintains, for any probability distribution functions f1, f 2 and f3: (1) 0 ≤ d ( f1 , f 2 ) ≤ 1 d ( f1 , f1 ) = 0 (2) d ( f1, f 2 ) = d ( f 2 , f1 ) (3) (symmetry) d ( f1 , f 2 ) + d ( f 2 , f 3 ) ≥ d ( f1 , f 3 ) (4) (triangle inequality)
7 Extracting Daily Prototypes The space of daily qualitative profiles is infinite, and we need to represent it by K „prototypical“ qualitative profiles, which represent prototypical behaviors. This dis-
Unsupervised Profiling for Identifying Superimposed Fraud
257
cretization is important not only for performance, but also to facilitate the investigation of alerts by the human analyst. We use a clustering algorithm to extract such prototypes. Since the distance function is not Euclidean, a proximity-matrix-based algorithm seems to be essential. However, such algorithms are restricted to small sample sets, due to their space and time complexity. In our case, however, extracting a sufficiently small sample may result in losing unique prototypes. The partitional algorithm K-means is based on Euclidean distance and does not require a proximity matrix. The algorithm seeks to minimize the sum of squared distances between samples and their associated cluster: N
∑ d 2 ( pi , c p ) . i =1
(4)
i
Recalculating the new cluster center as the Euclidean centroid of samples assigned to it locally minimizes this criterion. It can be shown [11] that this method of recalculating cluster centers also minimizes the criterion function with the CD-distance, therefore the K-means algorithm can be used, with the CD-distance replacing the Euclidean distance. An „adaptive“ version K-means [8] is used in order to determine dynamically the number of clusters.
8 The Overall Profile The overall profile OP is an array containing an entry for each daily prototype. Each entry OPi contains the number of days of prototype i observed for the account (OPi.n), the sum (OPi .sn) and sum of squares (OPi.ssn) of number of calls in these days. These components are later used to calculate the average Mi and standard deviation σi of the number of calls per day of a prototype, as described in [10]. In order to adapt to changes in customer behavior, the components OPi.x with x = n, sn, ssn may be updated using a decay function, like the one used in [10]. In our current prototype, two such arrays are used; one for business days and another for weekends.
9 Deviation Detection Matching a daily profile to the overall profile includes qualitative and quantitative checks. The qualitative profile matches the overall profile if it is closer than threshold Tqualitative to the nearest non-zero daily prototype of that customer. The quantitative profile matches the overall profile if ( DP. n − M i ) σ i ≤ Tquantitative, where i is the prototype closest to DP) and Tquantitative is a threshold in terms of standard deviations. In order to reduce false alerts on daily profiles in which the deviating activity is low, we wish to ignore such profiles, which are of no interest to investigate. We assign a value to each daily profile based on call durations and destinations, which represents the „interestingness level“ of that day (not necessarily the cost of calls). Daily profiles
258
U. Murad and G. Pinkas
with a value smaller than Tvalue are not checked, but still update the overall profiles. Finally, quantitative deviations on days with less than Tncalls calls do not issue alerts. A detection process constantly reads CDRs and updates the daily profiles. Once a day it performs the deviation detection and updates the overall profiles.
10 Evaluation To test the technique, we used the data set of wireline CDRs. The data set covered three months’ usage of about 7,000 accounts. We used the first two months’ calls only to learn overall profiles. The third month’s data was checked for behavioral changes and used to update the profiles. We considered only customers with more than 20 active days in the first two months and without fraudulent usage during this period. A total of 6,334 accounts maintained these conditions, out of which 82 accounts (1.3%) included superimposed fraudulent usage in the third month. We ran the system with 240 combinations of the four thresholds. Alerts on one of the first two fraudulent days are considered as „hits“. For comparison we took the widely used rule-based method (for example, [10]). This method is also based on unsupervised learning, and attempts to detect significant changes (increase) in usage. This method accumulates usage (number or duration of calls of certain types) and calculates the average and standard deviation of the daily values of each accumulator. The average and standard deviation are updated with each newly introduced accumulator value. An alert is issued whenever the value of a certain accumulator exceeds a threshold Tstdevs, defined in terms of a standard deviation from the average. We used accumulators of voice, data, international, PRS, toll-free and nightly calls. We accumulated both the number and duration of calls of each type. Here also, only the first two months’ calls were used for learning. In order to prevent alerts on days with low activity, we also used Tvalue (where the daily value was calculated as in our method), Tncalls and Tduration thresholds. We ran the algorithm with 240 different combinations of values for these four thresholds. Evaluation Technique. For our purpose, we cannot compare classifiers using accuracy alone, since the class distribution is not constant and the costs of false negative errors and false positive errors are not equal. In addition, the class distribution is very skewed [4]; in our case, a „do nothing“ strategy will give 98.7% accuracy. In both methods, the thresholds control the number of alerts. The lower the thresholds, the more true positives and more false positives are produced. Therefore, to evaluate the system’s performance, we use two measures: • True positive (hit) rate. the ratio of detected fraud cases of all fraud cases • False positive (false alarm) rate. the ratio of nofraud cases classified as fraud of all nofraud cases. Then we compare the hit rates achieved by each method, given a fixed false alarm rate.
Unsupervised Profiling for Identifying Superimposed Fraud
100%
Allowed False Positive Rate
% True Positive
80% 60% 40% 3L
20%
RB
0% 0%
10%
20%
30%
% False Positive
40%
50%
1% 2% 3% 4% 5% 10% 15%
259
Possible True Positive Rate 3L 65.85% 93.90% 95.12% 95.12% 96.34% 97.56% 100.00%
RB 25.61% 50.00% 59.76% 63.41% 68.29% 81.71% 86.59%
Fig. 2. Performance comparison on real data. Hit rate vs. false alarm rate of all 240 runs of each method, with the non-decreasing hulls (graph), and a summary of best results (table).
The graph in Fig. 2 depicts the results of the 240 runs of each algorithm, with the corresponding non-decreasing hulls. The non-decreasing hulls reflect the best results of each method. The best results are also summarized in the table in Fig. 2. It can be seen that the 3L method outperforms the naive RB approach, as even the worst results of 3L dominate the best results of RB. 3L reaches high detection rate given low false alarm rates, where RB performs poorly. Semi-Synthetic Data. In order to evaluate the 3L method on a larger set of fraud cases, and to analyze its sensitivity to various fraud patterns and to different volumes of fraud, we use semi-synthetic data. For this purpose, we added a synthetically generated fraudulent usage to the original usage of 10% (625) of the fraud-free accounts. To each of the selected accounts we added „fraudulent“ usage covering three consecutive days and matching one of six fraud patterns. The patterns differ in quantity of calls and in structure and are distributed equally over the 625 cases. Table 1 shows the total hit rate, as well as the rates of the detected cases of 4 selected patterns, given various false alarm rates. The hit rates correspond to the configuration which generated the best total hit rate (rather than best hit rate for each pattern separately), therefore the hit rates of the patterns are not always increasing. Table 1. Performance comparison on semi- synthetic data. The average daily number of added cdrs is shown in parentheses in the titles, following by the pattern’s description Allowed False Positive Rate 1% 1% 2% 3% 4% 5% 10%
Pattern 1 (7) Long PRS 3L RB 90.7% 91.6% 23.4% 91.6% 100.0% 91.6% 100.0% 91.6% 100.0% 91.6% 100.0% 92.5% 100.0%
Possible True Positive Rate Pattern 2 (100) Pattern 3 (4) Pattern 4 (25) Call reselling Intl. In evenings No pattern 3L RB 3L RB 3L RB 100.0% 68.7% 70.5% 100.0% 100.0% 71.7% 1.0% 72.4% 56.2% 100.0% 100.0% 71.7% 1.0% 81.0% 99.0% 100.0% 100.0% 72.7% 3.0% 89.5% 99.0% 100.0% 100.0% 72.7% 3.0% 89.5% 100.0% 100.0% 100.0% 73.7% 25.3% 90.5% 99.0% 100.0% 100.0% 73.7% 41.4% 95.2% 100.0%
Total 3L 78.6% 81.9% 85.1% 88.0% 88.0% 90.9% 93.4%
RB 50.1% 77.6% 82.2% 83.5% 86.4% 89.8%
260
U. Murad and G. Pinkas 100%
Best Hit Rate
Best Hit Rate
100% 80% 60% 40%
3L
20%
RB
0%
60% 40%
3L
20%
RB
0% 0
10
20
30
40
50
60
0
Average number of added CDRs
a. pattern 2, 1% false alarm rate
10
20
30
40
50
60
Average number of added CDRs
b. pattern 2, 3% false alarm rate
100%
100%
Best Hit Rate
Best Hit Rate
80%
80% 60% 40%
3L RB
20%
80% 60% 40%
3L
20%
RB
0%
0% 0
10
20
30
40
50
60
Average number of added CDRs
c. pattern 3, 1% false alarm rate
0
10
20
30
40
50
60
Average number of added CDRs
d. pattern 3, 3% false alarm rate
Fig. 3. Sensitivity to pattern size. The points on each line represent the highest hit rate achieved under the given false alarm rate
On the „intensive“ patterns (especially pattern 2) both systems perform well, with the rule-based method performing slightly better. On the other hand, the 3L method is more sensitive to „less obvious“ patterns (such as pattern 4). The 3L method performs better under a lower false positive rate, but gradually loses its superiority when given higher false alarm rates. We have also examined the sensitivity of the technique to the „size“ of fraud (i.e., the number of fraudulent calls). To do this, we used fraud patterns 2 and 3. Pattern 2 („call reselling“) is „general“: all types of voice calls, throughout the day, with various durations. Pattern 3 is „specific“: international data calls in the evenings. For each pattern, we added fraudulent calls to the selected accounts, increasing gradually the number of added calls per day from 3 to 60. For each number of added calls, we ran both methods, each with 240 combinations of thresholds. We then checked how the hit rate increases with the number of added calls, given a fixed rate of false alarm rate. Fig. 3 shows the results. On the general pattern (Fig. 3(a) and 3(b)), the 3L method is more sensitive to small cases of fraud, and gradually looses its superiority when the number of added calls increases. The difference is more significant under lower false alarm rates. On the specific pattern (Fig. 3(c) and 3(d)) 3L reaches relatively high hit rate for small numbers of added calls, significantly better than that of RB.
11 Conclusion and Future Work Three Level Profiling provides a comprehensive representation of customer behavior. Using this profiling method, rather than profiles based on predefined usage patterns, the system can cope with the dynamic nature of telecommunication fraud.
Unsupervised Profiling for Identifying Superimposed Fraud
261
Initial experiments show a superiority of this method over the rule-based method. In the future, we intend to apply the profiling technique to identify changes in calling behavior for other purposes, such as identifying marketing opportunities and offering new incentives or price plans to customers following a behavioral change.
Acknowledgements This paper is based on an M.Sc. thesis written by Uzi Murad under the supervision of Professor Victor Brailovsky at Tel Aviv University. We thank Victor Brailovsky for insightful discussions and Saharon Rosset for helpful comments and suggestions.
References 1.
Burge, P., Shawe-Taylor, J.: Detecting Cellular Fraud Using Adaptive Prototypes. In: Proceedings of AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management. Providence RI (1997) 9-13 2. Duda, R. O., Hart, P.E.: Pattern Classification and Scene Analysis. John Wiley & Sons, Inc., New York NY (1973) 3. Fawcett, T., Provost, F.: Adaptive Fraud Detection. In: Fayyad, U., Mannila, H., PiatetskyShapiro, G. (Eds.): Data Mining and Knowledge Discovery, vol 1. Kulwer Academic Publishers, Boston CA (1997) 291-316 4. Fawcett, T., Provost, F.: Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions. In: Agrawal, R., Stolorz, P., PiatetskyShapiro, G. (Eds.): Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, AAAI Press, Menlo Park CA (1997) 43-48 5. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs NJ (1988) 6. Kittler, J.: Feature Selection and Extraction. In: Young, T.Y., Fu, K. (Eds.): Handbook of Pattern Recognition and Image Processing. Academic Press Inc., Orlando FA (1986) 59-83 7. Kokkinaki, A.I.: On Atypical Database Transactions: Identification of Probable Fraud using Machine Learning for User Profiling. In: Proceedings of IEEE Knowledge and Data Engineering Exchange Workshop (1997) 107-113 8. McQueen, J.B.: Some Methods for Classification and Analysis of Multivariante Observations. In: Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability (1967). 281-297 9. Moreau, Y., Vandewalle, J: Detection of Mobile Phone Fraud using Supervised Neural Networks: A First Prototype. Available via ftp://ftp.esat.kuleuven.ac.be/pub/SISTA/ moreau/reports/icann97_TR97-44.ps (1997) 10. Moreau, Y., Preneel, B., Burge, P., Shawe-Taylor, J., Stoermann C., Cook, C.: Novel Techniques for Fraud Detection in Mobile Telecommunication Networks. In: ACTS Mobile Summit, Grenada Spain (1997) 11. Murad, U.: Three Level Profiling for Telecommunication Fraud Detection. M.Sc. thesis, Tel Aviv University, Israel (1999)
OPTICS-OF: Identifying Local Outliers Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng1, Jörg Sander Institute for Computer Science, University of Munich Oettingenstr. 67, D-80538 Munich, Germany {breunig | kriegel | ng | sander}@dbs.informatik.uni-muenchen.de phone: +49-89-2178-2225 fax: +49-89-2178-2192
Abstract: For many KDD applications finding the outliers, i.e. the rare events, is more interesting and useful than finding the common cases, e.g. detecting criminal activities in E-commerce. Being an outlier, however, is not just a binary property. Instead, it is a property that applies to a certain degree to each object in a data set, depending on how ‘isolated’ this object is, with respect to the surrounding clustering structure. In this paper, we formally introduce a new notion of outliers which bases outlier detection on the same theoretical foundation as density-based cluster analysis. Our notion of an outlier is ‘local’ in the sense that the outlier-degree of an object is determined by taking into account the clustering structure in a bounded neighborhood of the object. We demonstrate that this notion of an outlier is more appropriate for detecting different types of outliers than previous approaches, and we also present an algorithm for finding them. Furthermore, we show that by combining the outlier detection with a density-based method to analyze the clustering structure, we can get the outliers almost for free if we already want to perform a cluster analysis on a data set.
1.
Introduction
Larger and larger amounts of data are collected and stored in databases, increasing the need for efficient and effective analysis methods to make use of the information contained implicitly in the data. Knowledge discovery in databases (KDD) has been defined as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [9]. Corresponding to the kind of patterns to be discovered, several KDD tasks can be distinguished. Most research in KDD and data mining is concerned with identifying patterns that apply to a large percentage of objects in a data set. For example, the goal of clustering is to identify a set of categories or clusters that describes the structure of the whole data set. The goal of classification is to find a function that maps each data object into one of several given classes. On the other hand, there is another important KDD task applying only to very few objects deviating from the majority of the objects in a data set. Finding exceptions and outliers has not yet received much attention in the KDD area (cf. section 2). However, for applications such as detecting criminal activities of various kinds (e.g. in electronic commerce), finding rare events, deviations from the majority, or exceptional cases may be more interesting and useful than the common cases. 1. On sabbatical from: Dept. of CS, University of British Columbia, Vancouver, Canada, [email protected].
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 262-270, 1999. Springer-Verlag Berlin Heidelberg 1999
264
M.M. Breunig et al.
Outliers and clusters in a data set are closely related: outliers are objects deviating from the major distribution of the data set; in other words: being an outlier means not being in or close to a cluster. However, being an outlier is not just a binary property. Instead, it is a property that applies to a certain degree to each object, depending on how ‘isolated’ the object is. Formalizing this intuition leads to a new notion of outliers which is ‘local’ in the sense that the outlier-degree of an object takes into account the clustering structure in a bounded neighborhood of the object. Thus, our notion of outliers is strongly connected to the notion of the density-based clustering structure of a data set. We show that both the cluster-analysis method OPTICS (“Ordering Points To Identify the Clustering Structure”), which has been proposed recently [1], as well as our new approach to outlier detection, called OPTICS-OF (“OPTICS with Outlier Factors”), are based on a common theoretical foundation. The paper is organized as follows. In section 2, we will review related work. In section 3, we show that global definitions of outliers are inadequate for finding all points that we wish to consider as outliers. This observation leads to a formal and novel definition of outliers in section 4. In section 5, we give an extensive example illustrating the notion of local outliers. We propose an algorithm to mine these outliers in section 6 including a comprehensive discussion of performance issues. Conclusions and future work are given in section 7.
2.
Related Work
Most of the previous studies on outlier detection were conducted in the field of statistics. These studies can be broadly classified into two categories. The first category is distribution-based, where a standard distribution (e.g. Normal, Poisson, etc.) is used to fit the data best. Outliers are defined based on the distribution. Over one hundred tests of this category, called discordancy tests, have been developed for different scenarios (see [4]). A key drawback of this category of tests is that most of the distributions used are univariate. There are some tests that are multivariate (e.g. multivariate normal outliers). But for many KDD applications, the underlying distribution is unknown. Fitting the data with standard distributions is costly, and may not produce satisfactory results. The second category of outlier studies in statistics is depth-based. Each data object is represented as a point in a k-d space, and is assigned a depth. With respect to outlier detection, outliers are more likely to be data objects with smaller depths. There are many definitions of depth that have been proposed (e.g. [13], [15]). In theory, depthbased approaches could work for large values of k. However, in practice, while there exist efficient algorithms for k = 2 or 3 ([13], [11]), depth-based approaches become inefficient for large data sets for k ≥ 4. This is because depth-based approaches rely on the computation of k-d convex hulls which has a lower bound complexity of Ω(nk/2). Recently, Knorr and Ng proposed the notion of distance-based outliers [12]. Their notion generalizes many notions from the distribution-based approaches, and enjoys better computational complexity than the depth-based approaches for larger values of k. Later in section 3, we will discuss in detail how their notion is different from the notion of local outliers proposed in this paper. Given the importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application do-
OPTICS-OF: Identifying Local Outliers
265
mains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models.
3.
Problems of Current (non-local) Approaches
As we have seen in section 2, most of the existing work in outlier detection lies in the field of statistics. Intuitively, outliers can be defined as given by Hawkins [10]. Definition 1: (Hawkins-Outlier) An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. This notion is formalized by Knorr and Ng [12] in the following definition of outliers. Definition 2: (DB(p,d)-Outlier) An object o in a data set D is a DB(p,d)-outlier if at least fraction p of the objects in D lies greater than distance d from o. Below, we will show that definition 2 captures o1 only certain kinds of outliers. Its shortcoming is C1 that it takes a global view of the data set. The fact that many interesting real-world data sets exhibit a more complex structure, in which objects are only outliers relative to their local, surrounding object distribution, is ignored. We give an examples of a data set containing objects that are outliers accordo3 o2 ing to Hawkins’ definition for which no values for C2 p and d exist such that they are DB(p,d)-outliers. Figure 1 shows a 2-d dataset containing 43 objects. Fig. 1. 2-d dataset DS1 It consists of 2 clusters C1 and C2, each consisting of 20 objects, and there are 3 additional objects o1, o2 and o3. Intuitively, and according to definition 1, o1, o2 and o3 are outliers, and the points belonging to the clusters C1 and C2 are not. For an object o and a set of objects S, let d(o,S) = min{ d(o,s) | s ∈ S }. Let us consider the notion of outliers according to definition 2: • o1: For every d ≤ d(o1,C1) and p ≤ 42/43, o1 is a DB(p,d) outlier. For smaller values of p, d can be even larger. • o2: For every d ≤ d(o2, C1) and p ≤ 42/43, o2 is a DB(p,d) outlier. Again, for smaller values of p, d can be even larger. • o3: Assume that for every point q in C1, the distance from q to its nearest neighbor is larger than d(o3, C2). In this case, no combination of p and d exists such that o3 is an DB(p,d) outlier and the points in C1 are not: - For every d ≤ d(o3, C2), p=42/43 percent of all points are further away from o3 than d. However, this condition also holds for every point q ∈ C1. Thus, o3 and all q ∈ C1 are DB(p,d)-outliers. - For every d > d(o3, C2), the fraction of points further away from o3 is always smaller than for any q ∈ C1, so either o3 and all q ∈ C1 will be considered outliers or (even worse) o3 is not an outlier and all q ∈ C1 are outliers.
266
M.M. Breunig et al.
From this example, we infer that definition 2 is only adequate under certain, limited conditions, but not for the general case that clusters of different densities exist. In these cases definition 2 will fail to find the local outliers, i.e. outliers that are outliers relative to their local surrounding data space.
4.
Formal Definition of Local Outliers
In this section, we develop a formal definition of outliers that more truly corresponds to the intuitive notion of definition 1, avoiding the shortcomings presented in section 3. Our definition will correctly identify local outliers, such as o3 in figure 1. To achieve this, we do not explicitly label the objects as “outlier” or “not outlier”; instead we compute the level of outlier-ness for every object by assigning an outlier factor. Definition 3: (ε-neighborhood and k-distance of an object p) Let p be an object from a database D, let ε be a distance value, let k be a natural number and let d be a distance metric on D. Then: • the ε-neighborhood Nε(p) are the objects x with d(p,x)≤ε: Nε(p) = { x∈D | d(p,x)≤ε}, • the k-distance of p, k-distance(p), is the distance d(p,o) between p and an object o ∈D such that at least for k objects o’∈D it holds that d(p,o’) ≤ d(p,o), and for at most k-1 objects o’∈D it holds that d(p,o’) < d(p,o). Note that k-distance(p) is unique, although the object o which is called ‘the’ k-nearest neighbor of p may not be unique. When it is clear from the context, we write Nk(p) as a shorthand for Nk-distance(p)(p), i.e. Nk(p) = { x ∈ D | d(p,x) ≤ k-distance(p)}. The objects in the set Nk(p) are called the “k-nearest-neighbors of p” (although there may be more than k objects in Nk(p) if the k-nearest neighbor of p is not unique). Before we can formally introduce our notion of outliers, we have to introduce some basic notions related to the density-based cluster structure of the data set. In [7] a formal notion of clusters based on point density is introduced. The point density is measured by the number of objects within a given area. The basic idea of the clustering algorithm DBSCAN is that for each object of a cluster the neighborhood of a given radius (ε) has to contain at least a minimum number of objects (MinPts). An object p whose ε-neighborhood contains at least MinPts objects is said to be a core object. Clusters are formally defined as maximal sets of density-connected objects. An object p is density-connected to an object q if there exists an object o such that both p and q are density-reachable from o (directly or transitively). An object p is said to be directly density-reachable from o if p lies in the neighborhood of o and o is a core object [7]. A ‘flat’ partitioning of a data set into a set of clusters is useful for many applications. However, an important property of many real-world data sets is that their intrinsic cluster structure cannot be characterized by global density parameters. Very different local densities may be needed to reveal and describe clusters in different regions of the data space. Therefore, in [1] the density-based clustering approach is extended and generalized to compute not a single flat density-based clustering of a data set, but to create an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings. This cluster-ordering of a data set is based on the notions of core-distance and reachability-distance.
OPTICS-OF: Identifying Local Outliers
267
Definition 4: (core-distance of an object p) Let p be an object from a database D, let ε be a distance value and let MinPts be a natural number. Then, the core-distance of p is defined as core-distanceε,MinPts(p) =
UNDEFINED, if |N ε(p) | < MinPts MinPts-distance(p), otherwise
The core-distance of object p is the smallest distance ε’ ≤ ε such that p is a core object with respect to ε’ and MinPts if such an ε’ exists, i.e. if there are at least MinPts objects within the ε-neighborhood of p. Otherwise, the core-distance is UNDEFINED. Definition 5: (reachability-distance of an object p w.r.t. object o) Let p and o be objects from a database D, p ∈ Nε(o), let ε be a distance value and let MinPts be a natural number. Then, the reachability-distance of p with respect to o is defined as reachability-distanceε,MinPts(p, o) = UNDEFINED, if |N ε(o) | < MinPts max ( core-distance ε, MinPts ( o ), d ( o, p ) ) otherwise
core (o
)
The reachability-distance of an object p with respect to object o is the smallest distance such that p is directly density-reachable from o if o is a core object within p’s ε-neighborhood. To capture this idea, the reachability-distance of p with respect to o cannot be smaller than the core-distance of o since for smaller distances no object is directly density-reachable from o. Otherwise, if o is not a core object, the reachability-distance is UNDEFINED. Figure 2 illustrates the core-distance and the reachability-distance. The core-distance and reachability-distance were originally introduced for the OPTICS-algorithm [1]. The OPTICS-algorithm computes a “walk” through ε p the data set, and calculates for each object o the corer(p 1 distance and the smallest reachability-distance with o respect to an object considered before o in the walk. Such a walk through the data satisfies the following r(p 2 condition: Whenever a set of objects C is a densitybased cluster with respect to MinPts and a value ε’ p smaller than the value ε used in the OPTICS algoFig. 2. Core-distance(o), rithm, then a permutation of C (possibly without a reachability-distances r(p1,o), few border objects) is a subsequence in the walk. r(p2,o) for MinPts=4 Therefore, the reachability-plot (i.e. the reachability values of all objects plotted in the OPTICS ordering) yields an easy to understand visualization of the clustering structure of the data set. Roughly speaking, a low reachability-distance indicates an object within a cluster, and a high reachability-distance indicates a noise object or a jump from one cluster to another cluster. The reachability-plot for our dataset DS1 is depicted in figure 3 (top). The global structure revealed shows that there are the two clusters, one of which is more dense than the other, and a few objects outside the clusters. Another example of a reachability-plot for the more complex data set DS2 (figure 4) containing hierarchical clusters is depicted in figure 5.
268
M.M. Breunig et al.
Definition 6: (local reachability density of an object p) Let p be an object from a database D and let MinPts be a natural number. Then, the local reachability density of p is defined as
∑
o∈N
reachability-distance ∞, MinPts(p, o) (p )
MinPts lrd MinPts(p) = 1 ⁄ --------------------------------------------------------------------------------------------------------------------N MinPts(p)
outlier factor
reachability
The local reachability density of an object p is the inverse of the average reachabilitydistance from the MinPts-nearest-neighbors of p. The reachability-distances occurring in this definition are all defined, because ε=∞. The lrd is ∞ if all reachability-distances are 0. This may occur for an object p if there are at least MinPts objects, different from p, but sharing the same spatial coordinates, i.e. if there are at least MinPts duplicates of p in the data set. For simplicity, we will not handle this case explicitly but simply assume that there are no duplicates. (To deal with duplicates, we can base our notion of neighborhood on a k-distinct-distance, defined analogously to k-distance in definition 3 with the additional requirement that there be at least k different objects.) o1 The reason for using the reachability-distance instead of simply the distance between p and its o2 neighbors o is that it will significantly weaken C2 C1 statistical fluctuations of the inter-object diso3 tances: lrds for objects which are close to each other in the data space (whether in clusters or noise) will in general be equaled by using the reachability-distance because it is at least as large as the core-distance of the respective 1 object o. The strength of the effect can be conobjects trolled by the parameter MinPts. The higher the value for MinPts the more similar the reachabilFig. 3. reachability-plot and ity-distances for objects within the same area of outlier factors for DS1 the space. Note that there is a similar ‘smoothing’ effect for the reachability-plot produced by the OPTICS algorithm, but in this case of clustering we also weaken the so-called ‘single-link effect’ [14]. Definition 7: (outlier factor of an object p) Let p be an object from a database D and let MinPts be a natural number. Then, the lrd MinPts(o) ----------------------------lrd MinPts(p) o ∈ N MinPts(p) outlier factor of p is defined as OF MinPts(p) = ------------------------------------------------------------N MinPts(p)
∑
The outlier factor of the object p captures the degree to which we call p an outlier. It is the average of the ratios of the lrds of the MinPts-nearest-neighbors and p. If these are identical, which we expect for objects in clusters of uniform density, the outlier factor is 1. If the lrd of p is only half of the lrds of p’s MinPts-nearest-neighbors, the outlier factor of p is 2. Thus, the lower p’s lrd is and the higher the lrds of p’s MinPts-nearestneighbors are, the higher is p’s outlier factor.
OPTICS-OF: Identifying Local Outliers
269
Figure 3, (top) shows the reachability-plot for DS1 generated by OPTICS [1]. Two clusters are visible: first the dense cluster C2, then points o3 and o1 (larger reachability values) and - after the large reachability indicating a jump - all of cluster C1 and finally o2. Depicted below the reachability-plot are the corresponding outlier factors (the objects are in the same order as in the reachability-plot). Object o1 has the largest outlier factor (3.6), followed by o2 (2.0) and o3 (1.4). All other objects are assigned outlier factors between 0.993 and 1.003. Thus, our technique successfully highlights not only the global outliers o1 and o2 (which are also DB(p,d)-outliers), but also the local outlier o3 (which is not a reasonable DB(p,d)-outlier).
5.
An Extensive Example o3
o4
o2 o1
o5
Fig. 4. Example dataset DS2
In this section, we demonstrate the effectiveness of the given definition using a complex 2-d example data set (DS2, figure 4, 473 points), containing most characteristics of real-world data sets, i.e. hierarchical/overlapping clusters and clusters of widely differing densities and arbitrary shapes. We give a small 2-d example to make it easier to understand the concepts. Our approach, however, works equally well in higher dimensional spaces. DS2 consists of 3 clusters of uniform (but different) densities and one hierarchical cluster of a low density containing 2 small and 1 bigger subcluster. The data set also contains 12 outliers.
outlier factor
reachability
Figure 5 (top) shows the reachability-plot N1 generated by OPTICS. We see 3 clusters with N2 1 2 3 4 different, but uniform densities in areas 1, 2 and N3 N4 3, a large, hierarchical cluster in area 4 and its 4.1 4.2 4.3 subclusters in areas 4.1, 4.2 and 4.3. The noise points (outliers) have to be located in areas N1, objects N2, N3 and N4. o4 Figure 5 (bottom) shows the outlier factors o3 for MinPts=10 (objects in the same order). o5 o1 Most objects are assigned outlier factors of o2 around 1. In areas N3 and N4 there is one point each (o1 and o2) with outlier factors of 3.0 and 1 2.7 respectively, characterizing outliers with loobjects cal reachability densities about half to one third Fig. 5. Reachability-plot (ε=50, of the surrounding space. The most interesting MinPts=10) and outlier factors OF10 for DS2 area is N1. The outlier factors are between 1.7 and 6.3. The first two points with high outlier factors 5.4 and 6.3 are o3 and o4. Both only have one close neighbor (the other one) and all other neighbors are far away in the cluster in area 3, which has a high density (recall that for MinPts=10 we are looking at the 10-nearest-neighbors). Thus, o3 and o4 are assigned large outlier factors. The other points in N1, however, are assigned much smaller
270
M.M. Breunig et al.
(but still significantly larger than 1) outlier factors between 1.7 and 2.4. These are the points surrounding o5 which can either be considered a small, low density cluster or outliers, depending on ones viewpoint. Object o5 as the center point of this low density cluster is assigned the lowest outlier factor of 1.7, because it is surrounded by points of equal local reachability density. We also see that, from the reachability-plot, we can only infer that all points in N1 are in an area of low density, because the reachability values are high. However, no evaluation concerning their outlierness is possible.
6.
Mining Local Outliers - Performance Considerations
To compute the outlier-factors OFMinPts(p) for all objects p in a database, we have to perform three passes over the data. In the first pass, we compute NMinPts(p) and coredistance∞,MinPts(p). In the second pass, we calculate reachability-distance∞,MinPts(p,o) of p with respect to its neighboring objects o∈NMinPts(p) and lrdMinPts(p) of p. In the third pass, we can compute the outlier factors OF(p). The runtime of the whole procedure is heavily dominated by the first pass over the data since we have to perform k-nearest-neighbor queries in a multidimensional database, i.e. the runtime of the algorithm is O(n * runtime of a MinPts-nearest-neighborhood query). Obviously, the total runtime depends on the runtime of k-nearest-neighbor query. Without any index support, to answer a k-nearest-neighbor query, a scan through the whole database has to be performed. In this case, the runtime of our outlier detection algorithm would be O(n2). If a tree-based spatial index can effectively be used, the runtime is reduced to O (n log n) since k-nearest-neighbor queries are supported efficiently by spatial access methods such as the R*-tree [3] or the X-tree [2] for data from a vector space or the M-tree [5] for data from a metric space. The height of such a tree-based index is O(log n) for a database of n objects in the worst case and, at least in low-dimensional spaces, a query with a reasonable value for k has to traverse only a limited number of paths. If also the algorithm OPTICS is applied to the data set, i.e. if we also want to perform some kind of cluster analysis, we can drastically reduce the cost for the outlier detection. The algorithm OPTICS retrieves the ε-neighborhood Nε(p) for each object p in the database, where ε is an input parameter. These ε-neighborhoods can be utilized in the first pass over the data for our outlier detection algorithm: only if this neighborhood Nε(p) of p does not already contain MinPts objects, we have to perform a MinPts-nearest-neighbor query for p to determine NMinPts(p). In the other case, we can retrieve NMinPts(p) from Nε(p) since then it holds that NMinPts(p) ⊆ Nε(p). Our experiments indicate that in real applications, for a reasonable value of ε and MinPts, this second case is much more frequent than the first case.
7.
Conclusions
Finding outliers is an important task for many KDD applications. All proposals so far considered ‘being an outlier’ a binary property. We argue instead, that it is a property that applies to a certain degree to each object in a data set, depending on how ‘isolated’ this object is, with respect to the surrounding clustering structure. We formally defined
OPTICS-OF: Identifying Local Outliers
271
the notion of an outlier factor, which captures exactly this relative degree of isolation. The outlier factor is local by taking into account the clustering structure in a bounded neighborhood of the object. We demonstrated that this notion is more appropriate for detecting different types of outliers than previous approaches. Our definitions are based on the same theoretical foundation as density-based cluster analysis and we show how to analyze the cluster structure and the outlier factors efficiently at the same time. In ongoing work, we are investigating the properties of our approach in a more formal framework, especially with regard to the influence of the MinPts value. Future work will include the development of a more efficient and an incremental version of the algorithm based on the results of this analysis.
References 1. Ankerst M., Breunig M. M., Kriegel H.-P., Sander J.: “OPTICS: Ordering Points To Identify the Clustering Structure”, Proc. ACM SIGMOD Int. Conf. on Management of Data, Philadelphia, PA, 1999. 2. Berchthold S., Keim D., Kriegel H.-P.: “The X-Tree: An Index Structure for HighDimensional Data”, 22nd Conf. on Very Large Data Bases, Bombay, India, 1996, pp. 28-39. 3. Beckmann N., Kriegel H.-P., Schneider R., Seeger B.: “The R*-tree: An Efficient and Robust Access Method for Points and Rectangles”, Proc. ACM SIGMOD Int. Conf. on Management of Data, Atlantic City, NJ, ACM Press, New York, 1990, pp. 322-331. 4. Barnett V., Lewis T.: “Outliers in statistical data”, John Wiley, 1994. 5. Ciaccia P., Patella M., Zezula P.: “M-tree: An Efficient Access Method for Similarity Search in Metric Spaces”, Proc. 23rd Int. Conf. on Very Large Data Bases, Athens, Greece, 1997, pp. 426-435. 6. DuMouchel W., Schonlau M.: “A Fast Computer Intrusion Detection Algorithm based on Hypothesis Testing of Command Transition Probabilities”, Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, AAAI Press, 1998, pp. 189-193. 7. Ester M., Kriegel H.-P., Sander J., Xu X.: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996, pp. 226-231. 8. Fawcett T., Provost F.: “Adaptive Fraud Detection”, Data Mining and Knowledge Discovery Journal, Kluwer Academic Publishers, Vol. 1, No. 3, pp. 291-316. 9. Fayyad U., Piatetsky-Shapiro G., Smyth P.: “Knowledge Discovery and Data Mining: Towards a Unifying Framework”, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 82-88. 10. Hawkins, D.: “Identification of Outliers”, Chapman and Hall, London, 1980. 11. Johnson T., Kwok I., Ng R.: “Fast Computation of 2-Dimensional Depth Contours”, Proc. 4th Int. Conf. on KDD, New York, NY, AAAI Press, 1998, pp. 224-228. 12. Knorr E. M., Ng R. T.: “Algorithms for Mining Distance-Based Outliers in Large Datasets”, Proc. 24th Int. Conf. on Very Large Data Bases, New York, NY, 1998, pp. 392-403. 13. Preparata F., Shamos M.: “Computational Geometry: an Introduction“, Springer, 1988. 14. Sibson R.: “SLINK: an optimally efficient algorithm for the single-link cluster method”, The Computer Journal, Vol. 16, No. 1, 1973, pp. 30-34. 15. Tukey J. W.: “Exploratory Data Analysis”, Addison-Wesley, 1977.
Selective Propositionalization for Relational Learning ´ Erick Alphonse, C´eline Rouveirol Inference and Learning Group, LRI - UMR 8623 CNRS Bˆ atiment 490, Universit´e Paris-Sud 91405 - Orsay Cedex (France) {alphonse,celine}@lri.fr ? Abstract. A number of Inductive Logic Programming (ILP) systems have addressed the problem of learning First Order Logic (FOL) discriminant definitions by first reformulating the problem expressed in a FOL framework into a attribute-value problem and then applying efficient algebraic learning techniques. The complexity of such propositionalization methods is now in the size of the reformulated problem which can be exponential. We propose a method that selectively propositionalizes the FOL training set by interleaving boolean reformulation and algebraic resolution. It avoids, as much as possible, the generation of redundant boolean examples, and still ensures that explicit correct and complete definitions are learned.
1
Introduction
Learning relational concepts from examples stored in a multi-relational database has been identified as a challenge for Inductive Logic Programming (ILP) techniques by both KDD and ILP communities [3]. However, it is a well-known fact that the counterpart of learning in restrictions of FOL, even relational ones, is the dramatic complexity of the coverage test between a hypothesis and an example. Here, we address discriminant concept learning in Datalog target concept languages1 . In such languages, the exponential complexity of subsumption (classically θ-subsumption [9]) is inherent to the non determinacy of the computation of “matching” substitutions between a hypothesis and an example. This can happen when the Entity-Relationship schema of the target relational database contains 1-n or n-n associations. While a number of specific biases have been developed directly in an FOL framework to control this indeterminacy by restricting the target concept language (see for instance, ij-determination [8]), a family of ILP methods (among others, LINUS [6], STILL [13], REPART [15], SP [5]) have addressed this problem by propositionalizing the ILP problem, i.e., by reformulating the ILP learning problem into an attribute value or even boolean one, which can then be ? 1
This work has been partially supported by ESPRIT through LTR ILP 2 n. 20237. Horn clause languages without function symbols other than constants. •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 271−276, 1999. Springer−Verlag Berlin Heidelberg 1999
272
É. Alphonse and C. Rouveirol
handled by learning techniques dedicated to this simpler formalism. Once the representation change has been performed, robust and efficient algorithms can be successfully applied, provided that the discriminant features of the FOL learning problem are preserved by propositionalization. Propositionalizations in those systems all adopt the same schema: given a pattern P , FOL examples are reformulated into their (potentially multiple) matchings with P , yielding a tabular representation. Of course, the subsumption test being of exponential complexity in an unrestricted Datalog language, the size of the reformulated problem can be exponential [1] as well as highly redundant, and cannot be directly addressed as such for complex relational learning problems. This paper presents a selective propositionalization that controls the size of the reformulated problem: instead of generating the whole boolean reformulation of the FOL problem before resolution, this method interleaves boolean reformulation and algebraic resolution. Information gathered during algebraic resolution is used to constrain the generation of the reformulated boolean problem to the boolean vectors that are useful for next refinement step(s) only. In doing so, it avoids, as much as possible, the generation of redundant boolean examples, enables partial storing of positive boolean instances only and still ensures that correct and complete definitions are learned.
2
Background
After [12, 15, 11], a learning problem can be decomposed into two subproblems, a relational (or structural) one and a functional one. To illustrate this decomposition, consider learning from examples stored in a multi-relational database. Here, learning from litterals representing the multiple foreign key links [14] among tuples of different relations is a structural learning problem, whereas learning on the other (mono-valued) attributes of those relations is a functional one. Consequently, this paper focuses on relational learning, which is typically a non-determinate learning problem, within Datalog target concept language without constants and without restriction on the depth or level of “indeterminacy” of existential variables [8]. In such a language, the propositionalization process is described as follow: Definition 1. The pattern P is built from a seed positive example e, as the maximal generalization of e plus equality constraints between pairs of variables in the pattern which are satisfied by e (see example below). Each training example is then translated into a set of boolean vectors. For each matching σi of P variables onto constants of e, the attributes of the boolean vector associated to a FOL example indicate which constraints of the pattern (presence/absence of a literal, links between variables of the pattern) are satisfied by σi . Thus, the FOL search space is shifted to a boolean lattice ordered by boolean inclusion, denoted ≺b . The search space of the reformulated problem is then that of concepts more general than or equal to the seed example: a partial mapping of P literals to FOL example literals yields a more general boolean vector than P , whereas a complete mapping yields a boolean vector equivalent to P . For instance, if E, CE are a positive and a negative example of the target concept and E 0 is the seed example, the obtained tabular representation is:
Selective Propositionalization for Relational Learning
273
E : c(a) ← p(a, b), p(b, c), q(c), q(a). E 0 : c(a) ← p(a, b), q(b), q(a), r(c). CE : c(a) ← p(a, b), p(b, c), q(b), q(c). P c(U ) p(V, W ) q(X) q(Y ) r(Z) U = V U = Y V = Y W = X θE,1 1 1 1 1 0 1 0 0 0 θE,2 1 1 1 1 0 1 1 1 0 θE,i 1 1 1 1 0 0 0 0 1 θCE,1 1 1 1 1 0 1 0 0 1 θCE,2 1 1 1 1 0 1 0 0 1 θCE,j 1 1 1 1 0 0 0 1 1 Fig. 1. Excerpt of boolean representation of a FOL problem
Moreover, as pointed out in [15], the learning task is no longer to induce a Datalog concept consistent with all boolean vectors, but what is referred to as the multi-part problem2 : Definition 2. (after [15]) The multi-part problem consists of finding a description that covers, for all FOL positive examples, at least one of their associated boolean vectors (completeness) and none of the boolean vectors associated to any FOL negative example (correctness).
3
State of the Art
Although the reformulated problem can be delegated to efficient and robust algorithms (SP [5] with C4.5 [10], REPART with CHARADE [4]), the space complexity becomes intractable, as does the time complexity. As pointed out by [1], boolean learners working on the reformulated problem must deal with data of exponential size wrt the FOL problem. Indeed, following definition 1, each positive or negative FOL example is described by a set of boolean vectors (termed in the remainder of the paper positive and negative boolean vectors respectively), the cardinality of which is equal to its multiple matchings with the propositionalization pattern. As far as we know, two learning systems have addressed this problem: STILL [13] and REPART [15]. For the former, propositionalization is performed through a stochastic selection of η example matchings (η is a system parameter) with the pattern, which allows for bounding the size of the reformulated problem, yielding a polynomial generalization process. To offset the imperfection of such generalizations as ”standalone” classifiers, STILL learns a committee of them, that classifies unseen examples in a nearest neighbor-like way. For the latter, restriction of the reformulated problem is performed through the choice of a relevant propositionalization pattern. The user/expert must provide a pattern as a (strong) bias which allows him to drastically decrease the matching space. The validity of the method relies on the assumption that the selected pattern preserves the discrimination information sought for. As the FOL learning problem is propositionalized before resolution, this system nevertheless has to cope with the size of the reformulated data. We propose one alternative method in order to both reduce the data size of the reformulated problem and avoid data storing as much as possible. 2
As noted by the authors, this learning problem is closely related to what Dietterich termed the multi-instance problem [2].
274
4
É. Alphonse and C. Rouveirol
Selective Propositionalization
build P from a seed positive example (see def. 1) initialize G as the universal element of the search space For each ce ∈ CE do Repeat Select g ∈ G (1) Compute a boolean vector b from ce (* P ≺b b b g *) If b is equal to P Then no structural discrimination is possible Else Specialize G to discriminate b (* algebraic resolution *) (2) Evaluate each element of G wrt positive example coverage Update G (* beam search strategy *) Endif Until all elements of G are correct Endfor return(G) Fig. 2. Computation of n elements of G
The overall structure of our algorithm is quite classical. It is based on the Candidate Elimination Algorithm [7] and implements a covering method for learning disjunctive concepts. The algorithm computes a set of maximally general and correct solutions by a top-down search in a boolean search space. The two original ideas of the algorithm stem from the fact that the boolean examples handled by the algorithm are not generated before learning proceeds, but during resolution. Therefore, as opposed to classical propositionalization methods (see sec. 1) which compute as many boolean examples as the number of matchings between the pattern and FOL examples, this algorithm constructively exploits: i) information gathered during resolution to only generate boolean examples that are useful for (in)validating the current specialization step; ii) the partial ordering on the instance space in order to generate useful examples, that is, the “close to” most specific ones. 4.1
Exploiting Current Resolution Information
In classical propositionalization techniques, all (positive and negative) FOL examples are reformulated into their multiple matchings with a given pattern P . In contrast, our method only looks for boolean vectors that may invalidate the current hypothesis g and therefore yields a specialization of g. At each search step, it therefore attempts to build a negative boolean vector more specific than g, i.e., which contains at least all boolean attributes of g. For a Datalog language, several tentative matchings may be necessary (in the worst case, an exponential number) in order to build a matching substitution σ. However, the benefit of selective propositionalization wrt a classical propositionalization, in terms of the matching space explored, is theoretically (and empirically, as shown in our first experiments, sec. 5) substantial: the space of matching substitutions to be searched is induced by literals belonging to g as opposed to P , that is, by relevant predicates wrt the current discriminant task.
Selective Propositionalization for Relational Learning
4.2
275
Partial Ordering on the Instance Space
The size of the reformulated boolean problem is upper-bounded by the number of matching substitutions between the pattern and the FOL examples (positive and negative), but a large fraction of these boolean vectors are redundant (see in fig. 1 θE,1 wrt θE,2 and θCE,1 wrt θCE,2 ) in that they do not directly take part in the process of building a correct and complete discriminant solution. Such redundant data occur when propositionalizing both negative and positive examples (respectively, steps 1 and 2 of the algorithm). Indeed, as far as negative examples are concerned, and after [12], there is a partial ordering (nearest-miss) of the negative instance space and it has been shown that only maximally specific negative examples wrt this partial ordering are sufficient for solving the discriminant learning problem. For positive examples, if we refer to definition 2, a FOL example is covered in the boolean search space if at least one of its corresponding boolean vectors is covered. Therefore, only the most specific ones wrt boolean inclusion are sufficient for our learning problem. After computing the matching substitution σ as stated above, the propositionalization will be as efficient as the extracted boolean vector is specific. We therefore complete σ by deterministically3 matching literals of P with the FOL example. In doig so, we therefore cannot ensure that the extracted boolean vector is a most specific one, which would require an exponential complexity, but is only a ”close to” most specific one.
5
Experimentations
The efficiency of our approach is evaluated by both percentages of boolean vectors computed and that of the matching space explored by our approach wrt classical propositionalization methods as presented in section 2. The former reflects the amount of non-redundant boolean vectors empirically computed by the selective propositionalization, that is the complexity of the learning problem resolution. As for the latter, it reflects the complexity of the selective propositionalization itself. As a learning database, we have used a hard artificial problem derived from Michalski’s trains involving an intractable number of data (about one hundred million) for classical propositionalization methods. As a result, we have obtained an amount of 0,0018% boolean vectors computed (with a standard deviation of 0,0017%), by exploring 1,62% of the whole matching space (with a standard deviation of 1,69%). As a corollary, learning methods implementing selective propositionalization are empirically about 62 times faster than classical propositionalization methods.
6
Conclusion
We have proposed an original propositionalization method which benefits from the advantages of both generate and test methods, using a ”more general than or equal to” partial ordering, and from a sound and efficient algebraic specialization 3
therefore, with polynomial time complexity.
276
É. Alphonse and C. Rouveirol
operator. The selective propositionalization method has been validated on an artificial, yet complex relational problem, involving a huge matching space and seems well-suited for handling highly indeterminate FOL learning problem. The generation of a large amount of redundant data is avoided. On the other hand, the Version Space approach allows for storing just a few boolean vectors computed from positive FOL examples only. Finally, this selective propositionalization technique can be adapted to any subsumption relation in the original FOL search space, and it can be combined with additional biases that can further improve the overall efficiency. For instance, user bias [15] can be incorporated in the pattern definition to further decrease the size of the matching space.
References 1. L. De Raedt. Attribute-value learning versus inductive logic programming : The missing link. In D. Page, editor, Proc. of the 8th International Workshop on ILP, pages 1–8. Springer Verlag, 1998. 2. T. Dietterich, R. Lathrop, and T. Lozano-Perez. Solving the multi-instance problem with axis-parallel rectangles. Artificial Intelligence, 89:31–71, 1996. 3. U. Fayyad. Knowledge discovery in databases : An overview. In N. Lavraˇc and S. Dˇzeroski, editors, Proc. of the 7th International Workshop on ILP, pages 1–16. Springer Verlag, 1997. 4. J.-G. Ganascia. A rule system learning system. In R. Bajcsy, editor, Proc. of International Joint Conference of Artificial Intelligence, pages 432–438. Morgan Kaufmann, 1993. 5. S. Kramer, B. Pfahringer, and C. Helma. Stochastic propositionalization of nondeterminate background knowledge. In D. Page, editor, Proc. of the 8th International Workshop on ILP, pages 80–94. Springer Verlag, 1998. 6. N. Lavraˇc and S. Dˇzeroski. Inductive Logic Programming : techniques and Applications. Ellis Horwood, 1994. 7. T. M. Mitchell. Generalization as search. Artificial Intelligence, 18:203–226, 1982. 8. S. Muggleton and C. Feng. Efficient induction in logic programs. In S. Muggleton, editor, International Workshop on ILP, pages 281–298. Academic Press, 1992. 9. G. Plotkin. A note on inductive generalization. In Machine Intelligence, volume 5. Edinburgh University Press, 1970. 10. J. R. Quinlan. C4. 5: Programs for Machine Learning. MK, San Mateo, CA, 1993. 11. M. Sebag. Resource bounded induction and deduction in FOL. In Proc. on Multi Strategy Learning, 1998. 12. M. Sebag and C. Rouveirol. Induction of maximally general clauses consistent with integrity constraints. In S. Wrobel, editor, Proc. of the 4th International Workshop on ILP, pages 195–216, 1994. 13. M. Sebag and C. Rouveirol. Tractable induction and classification in first order logic via stochastic matching. In 15th Int. Join Conf. on Artificial Intelligence (IJCAI’97), Nagoya, Japan, pages 888–893. Morgan Kaufmann, 1997. 14. S. Wrobel. An algorithm for multi-relational discovery of subgroups. In Proc. of PKDD, pages 78–87. Springer Verlag, 1997. 15. J.-D. Zucker and J.-G. Ganascia. Learning strcutural indeterminate clauses. In D. Page, editor, Proc. of the 8th International Workshop on ILP, pages 235–244. Springer Verlag, 1998.
Circle Graphs: New Visualization Tools for Text-Mining Yonatan Aumann, Ronen Feldman, Yaron Ben Yehuda, David Landau, Orly Liphstat, Yonatan Schler Department of Mathematics and Computer Science Bar-Ilan University Ramat-Gan, ISRAEL Tel: 972-3-5326611 Fax: 972-3-5326612 [email protected] Abstract. The proliferation of digitally available textual data necessitates automatic tools for analyzing large textual collections. Thus, in analogy to data mining for structured databases, text mining is defined for textual collections. A central tool in text-mining is the analysis of concept relationship, which discovers connections between different concepts, as reflected in the collection. However, discovering these relationships is not sufficient, as they also have to be presented to the user in a meaningful and manageable way. In this paper we introduce a new family of visualization tools, which we coin circle graphs, which provide means for visualizing concept relationships mined from large collections. Circle graphs allow for instant appreciation of multiple relationships gathered from the entire collection. A special type of circle-graphs, called Trend Graphs, allows
tracking of the evolution of relationships over time.
1 Introduction Most informal definitions [2] introduce knowledge discovery in databases (KDD) as the extraction of useful information from databases by large-scale search for interesting patterns. The vast majority of existing KDD applications and methods deal with structured databases, for example, client data stored in a relational database, and thus exploits data organized in records structured by categorical, ordinal, and continuous variables. However, a tremendous amount of information is stored in documents that are essentially unstructured. The availability of document collections and especially of online information is rapidly growing, so that an analysis bottleneck often arises also in this area. Thus, in analogy to data mining for structured data, text mining is defined for textual data. Text mining is the science of extracting information from hidden patterns in large textual collections. Text mining shares many characteristics with classical data mining, but also differs in some. Thus, it is necessary to provide special tools geared specifically to text mining. A central tool, found valuable in text mining, is the analysis of concept •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 277−282, 1999. Springer−Verlag Berlin Heidelberg 1999
278
Y. Aumann et al.
relationship [3,4], defined as follows. Large textual corpuses are most commonly composed of a collection of separate documents (e.g. news articles, web pages). Each document refers to a set of concepts (terms). Text mining operations consider the distribution of concepts on the inter-document level, seeking to discover the nature and relationships of concepts as reflected in the collection as a whole. For example, in a collection of news articles, a large number of articles on politician X and "scandal" may indicate a negative image of the character, and alert for a new PR campaign. Or, for another example, a growing number of articles on both company Y and product Z may indicate a shift of focus in the company’s interests, a shift which should be noted by its competitors. Notice that in both of these cases, the information is not provided by any single document, but rather from the totality of the collection. Thus, concept relationship analysis seeks to discover the relationship between concepts, as reflected by the totality of the corpus at hand. Clearly, discovering the concept relationships is only useful insofar as this information can be conveyed to the end-user. In practice, even medium-sized collections tend to give rise to a very large number of relationships. Thus, a mere listing of the discovered relationships is of little practical use for the end-user, as it is too large to comprehend. In addition, a linear list fails to show the structure arising from the entirety of relationships. Thus, we find that in order for mining of concept-relationship to be a useful tool for text-analysis, proper visualization techniques are necessary for presenting the results to the end user in a meaningful and manageable form. In this paper we introduce a new family of visualization tools for text-mining, which we call Circle Graphs. Circle graphs prove to be an effective tool for visualizing concept relationships discovered in text-mining. The graphs provide the user with an instant overall view of many relationships at once. Thus, circle graphs provide the extra benefit of surfacing the overall structure emerging from the multitude of relationships. We describe two specific types of circle graphs: 1. Category Connection Graphs: Provide a graphic representation of relationships between concept in different categories. 2. Context Connection Circle Graphs: Provide the user with a graphic representation of the connection between entities in a give context.
2 Circle Graphs We now describe the circle graphs visualization. We first give some basic definitions and notations.
Circle Graphs: New Visualization Tools for Text−Mining
279
2.1 Definitions and Notations Let T a be taxonomy. T is represented as a DAG (Directed Acyclic Graph), with the terms at the leaves. For a given node v∈T, we denote by Terms(v) the terms which are decedents of v. Let D be a collection of documents. For terms e1 and e2 we denote supD(e1,e) the number of documents in D which indicate a relationship between e1 and e2. The nature of the indication can be defined individually according to the context. In the current implementation we say that a document indicates a relationship if both terms appear in the document in the same sentence. This has proved to be a strong indicator. Similarly, for a collection D and terms e1, e2 and c, we denote by supD(e1,e2,c) the number of documents which indicate a relationship between e1 and e2 in the context of c (e.g. relationship between the UK and Ireland in the context of peace talks). Again, the nature of indication may be determined in many ways. In the current implementation we require that they all appear in the same sentence.
2.2 Category Connection Maps Category Connection Maps provide a means for concise visual representation of connections between different categories, e.g. between companies and technologies, countries and people, or drugs and diseases. In order to define a category connection map, the user chooses any number of categories from the taxonomy. The system finds all the connections between the terms in the different categories. To visualize the output, all the terms in the chosen categories are depicted on a circle, with each category placed on a separate part on the circle. A line is depicted between terms of different categories which are related. A color coding scheme represents stronger links with darker colors.
Formally, given a set C={v1,v2,…,vk} of taxonomy nodes, and a document collection D, the category connection map is the weighted G defined as follows. The nodes of the graph are the set V=terms(v1)∪terms(v2)∪… ∪terms(vk). Nodes u,w∈V are connected by an edge if: 1. u and w are from different categories 2. supD(u,w)>0. The weight of the edge (u,w) is supD(u,w). 3. An important point to notice regarding Category Connection Maps is that the map presents in a single picture information from the entire collection of documents. In the specific example, there is no single document that has the relationship between all the companies and the technologies. Rather, the graphs depicts aggregate knowledge from hundreds of documents. Thus, the user is provided with a bird’s-eye summary view of data from across the collection.
280
Y. Aumann et al.
The Category Connection Maps are dynamic in several ways. Firstly, the user can choose any node in the graph and the links from this node are highlighted. In addition, a double-click on any of the edges brings the list of documents which support the given relationship, together with the most relevant sentence in each document. Thus, in a way, the system is the opposite of search engines. Search engines point to documents, in the hope that the user will be able to find the necessary information. Circle category connection maps present the user with the information itself, which can then be backed by a list of documents.
Figure 1 – Context Circle Graph. The graph presents the connections between companies in the context of “joint venture”. Clusters are depicted separately. Color coded lines represent the strength of the connection. The information is based on 5,413 news articles obtained from Marketwatch.com. Only connection with weight 2 or more are depicted.
2.3 Context Circle-Graphs Context Circle-Graphs provide a visual means for concise representation of the relationship between many terms in a given context. In order to define a context circle graph the user defines: 1. A taxonomy category (e.g. “companies”), which determines the nodes of the circle graph (e.g. companies) 2. An optional context node (e.g. “joint venture”): which will determine the type of connection we wish to find among the graph nodes. Formally, for a set of taxonomy nodes vs, and a context node C, the Context Circle Graph is a weighted graph on the node set V=terms(vs). For each pair u,w∈V there is an edge between u and w, if there exists a context term c∈C, such that supD(u,w,c)>0. In this case the weight of the edge is Σc∈C supD(u,w,c). If no
Circle Graphs: New Visualization Tools for Text−Mining
281
context node is defined, then the connection can be in any context. Formally, in this case the root of the taxonomy is considered as the context. A Context Circle Graph for “companies” in the context of “joint venture” is depicted in Figure 1. In this case, the graph is clustered, as described below. The graph is based on 5,413 news documents downloaded from Marketwatch.com. The graph gives the user a summary of the entire collection in one visualization. The user can appreciate the overall structure of the connections between companies in this context, even before reading a single document!
2.3.1 Clustering For Context Circle Graph we use clustering to identify clusters of nodes which are strongly inter-related in the given context. In the example of figure 1, the system identified six separate clusters. The edges between members of each cluster are depicted in a separate small Context Circle Graph, adjacent to the center graph. The center graph shows connections between terms of different clusters, and those with terms which are not in any cluster. We now describe the algorithm for determining the clusters. Note that the clustering problem here is different from the classic clustering problem. In the classic problem we are given points in some space, and seek to find clusters of points which are close to each other. Here, we are given a graph in which we are seeking to find dense sub-graphs of the graph. Thus, a different type of clustering algorithm is necessary. The algorithm is composed of two main steps. In the first step we assign weights to edges in the graph. The weight of an edge reflects the strength of the connection between the vertices. Edges incident to vertices which are in the same cluster should be associated with high weights. In the next step we identify sets of vertices which are dense-subgraphs. This step uses the weights assigned to the edges in the previous one. We first describe the weight-assignment method. In order to evaluate the strength of a link between a pair of vertices u and v, we consider two criteria: Let u be a vertex in the graph. We use the notation Γ(u) to represent the neighborhood of u. The cluster weight of (u,v) is affected by the similarity of Γ(u) and Γ(v). We assume that vertices within the same clusters have many common neighbors. Existence of many common neighbors is not a sufficient condition, since in dense graphs any two vertices may have some common neighbors. Thus, we emphasize the neighbors which are close to u and v in the sense of cluster weight. Suppose x∈Γ(u)∩ Γ(u), if the cluster weights of (x,u) and (x,v) are high, there is a good chance that x belongs to the same cluster as u and v.
282
Y. Aumann et al.
We can now define an update operation on an edge (u,v) which takes into account both criteria: w(u,v) = w(x,u) + w(x,v)
∑
x ∈Γ (u) I Γ(v)
∑
x ∈Γ (u) I Γ (v)
The algorithm starts with initializing all weights to be equal, w(u,v)=1 for all u,v. Next, the update operation is applied to all edges iteratively. After a small number of iterations (set to 5 in the current implementation) it stops and outputs the values associated with each edge. We call this the cluster weight of the edge. The cluster weight has the following characteristic. Consider two vertices u and v within the same dense sub-graph. The edges within this sub-graph mutually affect each other. Thus the iterations drive cluster weight w(u,v) up. If, however, u and v do not belong to the dense sub-graph, the majority of edges affecting w(u,v) will have lower weights, resulting in a low cluster weight assigned to (u,v). After computing the weights, the second step of the algorithm finds the clusters. We define a new graph with the same set of vertices. In the new graph we consider only a small subset of the original edges, whose weights were the highest. In our experiments we took the top 10% of the edges. Since now almost all of the edges are likely to connect vertices within the same dense sub-graph, we thus separate the vertices into clusters by computing the connected components of the new graph and considering each component as a cluster. Figure 1 shows a circle context graph with six clusters. The cluster are depicted around the center graph. Each cluster is depicted in a different color. Nodes which are not in any cluster are colored gray. Note that the nodes of cluster appear both in the central circle and in the separate cluster graph. Edges within the cluster are depicted in the separate cluster graph. Edges between clusters are depicted in the central circle.
References 1.
Eick, S.G. and Wills, G.J.; Navigating Large Networks with Hierarchies. Visualization ’93, pp. 204-210, 1993.
2.
Fayyad, U,; Piatetsky-Shapiro, G.; and Smyth P. Knowledge Discover and Data Mining: Towards a Unifying Framework. In Proceedings of the 2nd International Conference of Knowledge Discovery and Data Mining, 82-88, 1996.
3.
Feldman, R.; and Dagan, I.; KDT - Knowledge Discovery in Texts. In Proceedings of the 1sr International Conference of Knowledge Discovery and Data Mining. 1995.
4.
Feldman, R.; Klosgen, W.; and Zilberstein, A.; Visualization Techniques to Explore Data Mining Results for Document Collections. In Proceedings of the 3rd International Conference of Knowledge Discovery and Data Mining, 16-23. 1997.
5.
Hendley, R.J., Drew, N.S., Wood, A.M., and Beale R.; Narcissus: Visualizing Information. In Proc. Int. Symp. On Information Visualization, pp. 90-94, 1995.
On the Consistency of Information Filters for Lazy Learning Algorithms Henry Brighton1 and Chris Mellish2 1 2
SHARP Laboratories of Europe Ltd., Oxford Science Park, Oxford, England, UK [email protected] Department of Artificial Intelligence, The University of Edinburgh, Scotland, UK [email protected]
Abstract. A common practice when filtering a case-base is to employ a filtering scheme that decides which cases to delete, as well as how many cases to delete, such that the storage requirements are minimized and the classification competence is preserved or improved. We introduce an algorithm that rivals the most successful existing algorithm in the average case when filtering 30 classification problems. Neither algorithm consistently outperforms the other, with each performing well on different problems. Consistency over many domains, we argue, is very hard to achieve when deploying a filtering algorithm.
1
Introduction
Information filtering is an attractive proposition when working with lazy learning algorithms. The lazy learning paradigm is characterized by the indiscriminate storage of training cases during the training stage. To classify an unseen query case a lazy learner applies the nearest neighbor algorithm [4]. We focus on how the size of the database holding the cases can be minimised such that the classification response time can be improved. By removing harmful cases we can also increase the overall classification competence of the learner. In this paper we introduce a new algorithm for filtering case-bases used by lazy learning algorithms. After comparing the algorithm with the most successful existing filter on 30 datasets from the UCI repository for machine learning databases [5], we conclude that neither approach is consistently superior.
2
Issues in Case Filtering
By removing a set of cases from a case-base the response time for classification decisions will decrease, as fewer cases are examined when a query case is presented. The removal can also lead to either an increase or decrease in classification competence. Therefore, when applying a filtering scheme to a case base we must be clear about the degree to which we are willing to let the original classification competence depreciate. Typically, the principle objective of a filtering algorithm is unintrusive storage reduction. Here, classification competence is primary: we ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 283–288, 1999. c Springer-Verlag Berlin Heidelberg 1999
284
H. Brighton and C. Mellish
(a)
(b)
Fig. 1. (a) The 2d-dataset. (b) The cases remaining from the 2d-dataset after 5 iterations of the ICF algorithm.
desire the same (or higher) learning competence but we require it faster and taking up less space. Ideally, competence should not suffer at the expense of improved performance. If our filtering decisions are not to harm the classification competence of the learner, we must be clear about the kind of deletion decisions that introduce misclassifications. Consider the following reasons why a k-nearest neighbor classifier might misclassify an unseen query case:
1. When noise is present in locality of the query case. The noisy cases(s) win the majority vote, resulting in the incorrect class being predicted. 2. When the query case occupies a position close to an inter-class border where discrimination is harder due to the presence of multiple classes. 3. When the region defining the class, or fragment of the class, is so small that cases belonging to the class that surrounds the fragment win the majority vote. This situation depends on the value of k being large. 4. When the problem is unsolvable by a lazy learner. This may be due to the nature of the underlying function, or due to the sparse data problem. In the context of filtering, we can address point (1) and try and improve classification competence by removing noise. We can do nothing about (4) as this situation is a given and defines the intrinsic difficulty of the problem. However, issues (2) and (3) should guide our removal decisions. Removing cases close to borders is not recommended as these cases are relevant to discrimination between classes. We should be aware of point (3), but as k is typically small, the occurence of such a problem is likely to be rare. Consider our example dataset shown in Figure 1(a), we can imagine that removing the interior of the class regions would not lead to a misclassification of a query case at these points: the border cases still supply the required information.
On the Consistency of Information Filters for Lazy Learning Algorithms
3
285
Review
Filtering the set of stored instances has been an issue since the early work on nearest neighbor (NN) classification [4]. The early schemes typically concentrate on either competence enhancement (noise removal)[8] or competence preservation [6]. More recent schemes attempt both [1,9]. A novel approach to competence preservation is the Footprint Deletion policy of Smyth and Keane [6] which is a filtering scheme designed for use within the paradigm of Case- Based Reasoning (CBR). In previous work [3] we have shown that some of the concepts introduced by Smyth and Keane transfer to the simpler context of lazy learning. Much of Smyth and Keane’s work relies on the notion of case adaptation. They use the property Adaptable(c, c0 ) to mean case c can be adapted to c0 . Generally speaking, we can delete a case for which there are many other cases that can be adapted to it. In our previous work we introduced a Lazy Learning parallel termed the Local-Set of a case c to capture this property [2]. We define the Local-set of a case c as: The set of cases contained in the largest hypersphere centered on c such that only cases in the same class as c are contained in the hypersphere. The novelty of Smyth and Keane’s work stems from their proposed taxonomy of case groups. By defining four case categories, which reflect the contribution to overall competence the case provides, we gain an insight into the effect of removing a case. We define these categories in terms of two properties: Reachability and Coverage. These properties are important, as the relationship between them has been used in crucial work which we discuss later. For a casebase CB = {c1 , c2 , . . . , cn } , we define Coverage and Reachability as follows: Coverage(c) = {c0 ∈ CB : Adaptable(c, c0 )}
(1)
Reachable(c) = {c0 ∈ CB : Adaptable( c0 , c)}
(2)
Using these two properties we can define the four groups using set theory. For example, a case in the pivotal group is defined as a case with an empty reachable set. For a more thorough definition we refer the reader to the original article. Our investigation into the Lazy Learning parallel of Footprint Deletion differs only in the replacement of Adaptable with the Local-set property. Whether a case c can be Adapted to a case c’ relies on whether c is relevant to the solution of c0 . In lazy learning this means that c is a member of nearest neighbors of c0 . However, we cannot assume that a case of a differing class is relevant to the solution (correct prediction) of c0 . We therefore bound the neighborhood of c’ by the first case of a differing class. Armed with this parallel we found that Footprint deletion performed well [2]. Perhaps more interestingly, we found that a simpler method which uses only the local-set property, and not the case taxonomies, performs just as well. With local-set deletion, we choose to delete cases with large localsets, as these are cases located at the interior of class regions. Local-set deletion has subsequently been employed in the context of natural language processing [7].
286
H. Brighton and C. Mellish
ICF(T ) 1 /* Perform Wilson Editing */ 2 for all x ∈ T do 3 if x classified incorrectly by k nearest neighbours then 4 flag x for removal 5 for all x ∈ T do 6 if x flagged for removal then T = T − {x} 7 /* Iterate until no cases flagged for removal */ 8 repeat 9 for all x ∈ T do 10 compute reachable(x) 11 compute coverage(x) 12 progress = false 13 for all x ∈ T do 14 if |reachable(x)| > |coverage(x)| then 15 flag x for removal 16 progress = true 17 for all x ∈ T do 18 if x flagged for removal then T = T − {x} 19 until not progress 20 return T Fig. 2. The Iterative Case Filtering Algorithm.
4
An Iterative Case Filtering Algorithm
We now present a new algorithm which uses an iterative approach to case deletion. We term the algorithm the Iterative Case Filtering Algorithm (ICF). The ICF algorithm uses the lazy learning parallels of case coverage and reachability we developed when transferring the CBR footprint deletion policy, discussed above. We apply a rule which identifies cases that should be deleted. These cases are then removed, and the rule is applied again, iteratively, until no more cases fulfil the pre-conditions of the rule. The ICF algorithm uses the reachable and coverage sets described above, which we can liken to the neighborhood and associate sets used by [9]. An important difference is that the reachable set is not fixed in size but rather bounded by the nearest case of different class. This difference is crucial as our algorithm relies on the relative sizes of these sets. Our deletion rule is simple: we remove cases which have a reachable set size greater than the coverage set size. A more intuitive reading of this rule is that a case c is removed when more cases can solve c than c can solve itself. After removing these cases the case-space will typically contain thick bands of cases either side of class borders. The algorithm is depicted in Figure 2. We also employ the noise filtering scheme based on Wilson Editing and adopted by [9]. Lines 2-6 of the algorithm perform this task. Figure 1(b) depicts the 2d-dataset, introduced earlier, after 5 iterations of the
On the Consistency of Information Filters for Lazy Learning Algorithms
287
ICF algorithm. This is the deletion criterion the algorithm uses; the algorithm proceeds by repeatedly computing these properties after filtering has occurred. Usually, additional cases will begin to fulfil the criteria as thinning proceeds and the bands surrounding the class boundaries narrow. After a few iterations of removing cases and recomputing, the criterion no longer holds. We evaluated the ICF algorithm on 30 datasets found at the UCI repository of machine learning databases [5]. The maximum number of iterations performed, of the 30 datasets, was 17. This number of iterations was required for the switzerland database, where the algorithm removed an average of 98% of cases. However, a number of the datasets consistently require as little as 3 iterations. Examining each iteration of the algorithm, specifically the percentage of cases removed after each iteration, provides us with an important insight into how the algorithm is working. We call this the reduction profile and is a characteristic of the case-base. Profiles exhibiting a short series of iterations, each one removing a large number of cases, would indicate a simple case-base structure containing little inter-dependency between regions. The most problematic of case-base structures would be characterised by a long series resulting in few cases being removed. Comparing the ICF algorithm with RT3, the most successful of Wilson and Martinez’s algorithms, we found that the average case behaviors over the 30 datasets were very similar (See Table 1). Neither algorithm consistently outperformed the other. More interestingly, the behavior of the two algorithms differ considerably on some problems. We also found that the domains which suffer a competence degradation as a result of filtering using ICF and RT3 are exactly those in which competence degrades as a result of noise removal. This would indicate that noise removal is sometimes harmful, and both ICF and RT3 suffer as a consequence. To summarize, we have presented an algorithm which iteratively filters a case-base using a lazy learning parallel of the two case properties used in the CBR Footprint Deletion policy. Due to the iterative nature of the algorithm, we have gained an insight into how the deletion of regions depend on each other. The point at which our deletion criterion seases to hold can result in improved generalization accuracy and storage reduction.
5
Conclusion
We have introduced the ICF algorithm which supports the argument that consistency is hard. We compared the ICF algorithm with a recent successful algorithm RT3 and found that their average case performance is very similar, but on individual problems they can differ considerably. Each algorithm can outperform the other on certain problems, both in terms of competence and storage reduction. Consistency is therefore a problem in the deployment of filtering schemes. One advantage of our algorithm is that it provides us with a reduction profile. The profile tells how different regions are dependent on each other. This provides us with a useful degree of perspicuity in understanding the structure of the case-base. Utimately, the choice of filter we deploy must be informed by any insights we have into the structure of the case-space.
288
H. Brighton and C. Mellish
Table 1. The classification accuracy and storage requirements for a sample of the datasets mentioned. The benchmark competence, which is the accuracy acheived whithout any filtering, is compared with Wilson Editing, RT3, and ICF.
Dataset abalone balance-scale cleveland ecoli glass hungarian led led-17 lymphography pima-indians primary-tumor switzerland thyroid waveform wine yeast zoo average
Benchmark Acc. Stor. 19.53 100 77.36 100 77.67 100 81.94 100 71.43 100 76.55 100 63.77 100 42.82 100 77.59 100 69.54 100 36.57 100 92.08 100 90.93 100 75.36 100 84.57 100 52.70 100 95.50 100 75.75 100
Wilson Editing RT3 ICF Acc. Stor. Acc. Stor. Acc. Stor. 22.01 19.64 22.11 40.95 22.74 15.11 86.04 77.48 83.40 18.23 81.47 14.67 78.67 77.39 78.89 20.92 72.08 15.60 86.27 81.77 82.84 15.76 81.34 14.06 69.05 70.17 69.05 23.26 69.64 31.40 79.91 77.03 80.17 9.81 78.30 12.15 68.27 66.11 69.62 18.04 71.74 41.81 43.00 43.09 41.48 46.78 42.33 27.50 76.38 79.41 72.70 26.73 77.59 25.63 71.27 69.20 71.08 22.38 69.17 17.22 36.57 35.81 39.43 30.76 37.06 18.32 93.54 90.45 91.67 2.15 92.28 2.02 89.30 91.48 77.91 16.23 86.63 21.85 76.62 76.37 76.14 22.79 73.93 18.98 86.43 85.17 86.43 15.37 83.81 12.00 55.39 52.97 55.32 27.03 52.25 16.62 96.25 95.31 87.08 26.13 92.42 52.78 77.52 75.98 76.59 19.29 76.13 19.73
References 1. D. W. Aha, D. Kibler, and M. K. Albert. Instance based learning algorithms. Machine Learning, 6(1):37–66, 1991. 2. H. Brighton. Experiments in case-based learning. Undergraduate Dissertation, Department of Artificial Intelligence, University of Edinburgh, Scotland, 1996. 3. H. Brighton. Information filtering for lazy learning algorithms. Masters Thesis, Centre for Cognitive Science, University of Edinburgh, Scotland, 1997. 4. T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. Institute of Electrical and Electronics Engineers Transactions on Information Theory, IT-13:21 – 27, 1967. 5. C. J. Merz and P. M. Murphy. UCI repository of machine learning databases. [http://www.ics.uci.edu/∼mlearn/MLRepository.html], 1996. Irvine, CA: University of California, Department of Information and Computer Science. 6. B. Smyth and M. T. Keane. Remembering to forget. In C. S. Mellish, editor, IJCAI95: Proceedings of the Fourteenth International Conference on Artificial Intelligence, volume 1, pages 377 – 382. Morgan Kaufmann Publishers, 1995. 7. Antal van den Bosch and Walter Daelemans. Do not forget: Full memory in memorybased learning of word pronunciation. In Proceedings of NeMLaP3/CoNLL98, pages 195 – 204, Sydney, Australia, 1998. 8. D. L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3):408 – 421, Jun 1972. 9. D. R. Wilson and A. R. Martinez. Instance pruning techniques. In D. Fisher, editor, Machine Learning: Proceedings of the Fourteenth International Conference, San Francisco, CA, 1997. Morgan Kaufmann.
Using Genetic Algorithms to Evolve a Rule Hierarchy Robert Cattral, Franz Oppacher, and Dwight Deugo Intelligent Systems Research Unit School of Computer Science Carleton University Ottawa, On K1S 5B6 {rcattral,oppacher,deugo}@scs.carleton.ca
Abstract. This paper describes the implementation and the functioning of RAGA (Rule Acquisition with a Genetic Algorithm), a genetic-algorithm-based data mining system suitable for both supervised and certain types of unsupervised knowledge extraction from large and possibly noisy databases. The genetic engine is modified through the addition of several methods tuned specifically for the task of association rule discovery. A set of genetic operators and techniques are employed to efficiently search the space of potential rules. During this process, RAGA evolves a default hierarchy of rules, where the emphasis is placed on the group rather than each individual rule. Rule sets of this type are kept simple in both individual rule complexity and the total number of rules that are required. In addition, the default hierarchy deals with the problem of overfitting, particularly in classification tasks. Several data mining experiments using RAGA are described.
1 Introduction Data mining, also known as KDD, or Knowledge Discovery in Databases, refers to the attempt to extract previously unknown and potentially useful relations and other information from databases and to present the acquired knowledge in a form that is easily comprehensible to humans (for example, see [1]). It differs from classical machine learning mainly in the fact that the training set is a database stored for purposes unrelated to training a learning algorithm. Consequently, data mining algorithms must cope with large amounts of data, various forms of noise and often unfavorable representations. Because of the requirement of comprehensibility, i.e., that the system be able to communicate the results of its learning in operationally effective and easily understood symbolic form, many approaches to data mining favor symbolic machine learning techniques, typically variants of AQ learning and decision tree induction [2]. RAGA meets the comprehensibility requirement by working with a population of variable-length, symbolic rule structures that can accommodate not just feature-value J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 289-294, 1999. © Springer-Verlag Berlin Heidelberg 1999
290
R. Cattral, F. Oppacher, and D. Deugo
pairs but arbitrary n-place predicates (n 0), while exploiting the proven ability of the Genetic Algorithm (e.g. [3]) to efficiently search large spaces. Most extant data mining systems perform supervised learning, where the system attempts to find concept descriptions for classes that are, together with preclassified examples, supplied to it by a teacher. The task of unsupervised learning is much more demanding because here the system is only directed to search the data for interesting associations, and attempts to find the classes by itself by postulating class descriptions for sufficiently many classes to cover all items in the database. We would like to point out that the usual characterization of unsupervised learning as learning without preclassified examples conflates a variety of increasingly difficult learning tasks. These tasks range from detecting potentially useful regularities among the data couched in the provided description language to the discovery of concepts through conceptual clustering and constructive induction, and to the further discovery of empirical laws relating concepts constructed by the system. As will be shown in section 6 below, RAGA is capable of both supervised and (the simplest type of) unsupervised learning. However, since we wish to compare our system to others we emphasize in this paper its use in supervised learning Sections 2 and 3 briefly describe the rules acquired by RAGA and its major parameters, respectively. Section 5 characterizes the system’s peculiar type of evolution, section 6 reports some experimental results and Section 7 concludes. 1
2 Representation of Data: If-Then Rules An important type of knowledge acquired by many data mining systems takes the form of if-then rules. Such rules state that the presence of one or more items implies or predicts the presence of other items. A typical rule has the form: If X1 ¾ X2 ¾ … ¾ Xn, then Y1 ¾ Y2 ¾ … ¾ Yn. The data stored internally by RAGA represents rules of this type, with a varying number of conjuncts in the antecedent and/or the consequent. (See section 3). Each part of the antecedent as well as the expression in the consequent can contain n-place predicates. If n = 0, the expression is a propositional constant; if n = 1, the expression has the widely used form of attribute-value pairs. However, RAGA can handle predicates of any arity. The antecedents and consequents in association rules can be conjunctions (¾) and negations (¬) of expressions that are built up from predicates and comparison operators (=, , <, , >, ). Component expressions can involve boolean variables (e.g. If X ¾ Y, then ¬Z), integer and real variables and constants (e.g. If X > 98.6, then Y = 1), and percentiles and percentage constants (e.g. If X > 85% ¾ Y < X, then Z > 10%). 1
This is the only type of unsupervised learning with which we have experimented thus far.
Using Genetic Algorithms to Evolve a Rule Hierarchy
291
The ability to generate negated expressions is not enabled by default because uncontrolled use of negations not only increases the search space but also often leads to the production of useless rules. For example, while If X ¾ ¬Y, then Z may be a good rule, the fitness function should penalize If X, then ¬Y ¾ ¬Z as useless in many situations where most items are absent in any given transaction. Although the introduction of constraints and rules governing the use of negation may improve the quality and speed of learning, this has not been explored for the experiments reported here.
2.1 Confidence and Support Factors In general, a rule will be more relevant and useful the higher its confidence and support are. Considered in isolation, i.e., outside a default hierarchy, rules with low confidence are useless because they are frequently wrong, and rules with low support are useless because they report uncommon combinations of items and are frequently inapplicable. It is important to note, however, that if the targets for support and confidence are set too high in unsupervised data mining, useful rules will be missed. Thus, when the confidence target is set too high, redundancies and tautologies will crowd out potentially useful rules. The proper support level is best determined by varying the settings over several analyses on the same data.
3 User Interface and System Configuration RAGA is configured through a graphical interface, and can be operated by a novice computer user. It is important, however, that a data mining technician familiar with the domain specifics configure each project. Before RAGA can perform a rule analysis, the variables and predicates with which rules will be built must be defined, and a number of options controlling the Genetic Algorithm component of the system must be chosen. Unlike other classification algorithms, RAGA supports comparison of attributes. An example of this would be comparing the length of a rectangle with its width. Queries of this type enable the classification algorithm to search beyond single dimensional vectors. However, comparison of attributes is not always desirable. An example of this would be the comparison of dollar amounts in a sales transaction. While it may be useful to compare the prices of different items in a purchase, it is probably not useful to compare the price of a single item to the cost of the entire order (e.g. is the cost of the soda less than the total bill). In cases where the comparison of attributes is deemed fruitless, variable class restrictions can be imposed to prevent variables of certain types from being compared to one another. This has the additional benefit of further narrowing search spaces. Rule position conditions fix numbers or types of elements in the antecedent or consequent. This is used in classification tasks, where the attributes are always in the
292
R. Cattral, F. Oppacher, and D. Deugo
antecedent and the class is alone in the consequent. In the absence of these conditions the search is undirected.
4 Evolving a Default Hierarchy A default hierarchy is a collection of rules that are executed in a particular order. When testing a particular data item against a hierarchy of rules, the rule at the top of the list is tried first. If its antecedent correctly matches the conditions in the element being tested, this top rule is used. If a rule does not apply, then the element is matched against the rule at the next lower level of the hierarchy. This continues until the element matches a rule or the bottom of the hierarchy is reached. Rules that are incorrect by themselves can be protected by rules preceding them in the default hierarchy, and play a useful coverage-extending role, as in the following example: If (numberOfSides = 4) ^ (length = width) then class = square If (numberOfSides = 3) then class = triangle If (numberOfSides > 2) ^ (numberOfSides < 5) then class = rectangle If the last rule were used out of order, many instances would be improperly classified. In the current position it covers the remaining data items accurately. Experimentation has shown that rules at the top of the evolved hierarchy cover most of the data, and rules near the bottom often handle exceptional cases. There is a certain amount of overlap between members of the hierarchy, as opposed to having a mutually exclusive set of rules. The evolving hierarchy tends to produce fewer and less complex rules. In RAGA, the problem of over-fitting is addressed by deciding how to handle the rules at the bottom of the hierarchy. Often these special cases handle only 1 or 2 out of perhaps 5000 data elements. It is important to note, however, that this type of overfitting is harmless because these rules are only tried as a last resort. If improper classification is considered more costly than leaving the class unknown, the user would simply ignore, after visual inspection, the lower levels of the rule hierarchy.
5 The Genetic Engine Used in RAGA In order to apply the Genetic Algorithm (GA) to the task of data mining for rules we found it desirable to modify the traditional GA in a number of respects. Accordingly, the genetic engine used by RAGA is a hybrid GA. Perhaps the most drastic modification concerns our choice of representation. Unlike the traditional GA whose chromosomes are fixed-length binary strings, the GA in our system accommodates rules of varying length and complexity. These rules are expressed in a nonbinary alphabet of user-defined symbols.
Using Genetic Algorithms to Evolve a Rule Hierarchy
293
The system operates differently depending on whether the current task is classification or otherwise. The primary difference is determining how useful a rule might be, namely its fitness. The algorithm reads as follows: Processing of one generation in RAGA involves three steps: (i) Controlled by two parameters, ordinary elitism copies one or more of the current best individuals into the next population to guarantee that the top fitness levels will not drop between generations. Classification elitism copies every rule that uniquely covers at least one data item and thus contributes, even if only in a small way, to the set of final rules. (ii) Next, fitness proportional selection, crossover and (macro and micro) mutations are applied. Until the new population is complete, rule pairs are repeatedly selected and possibly crossed over. Crossover splits rules only between conjunctions. Because of this (and also because of macromutations) rules can grow or shrink during this process. Before the two child rules enter the next phase, they - like all other rules except those copied under elitism - are subjected to micro and macro mutation with bounded rates. Since all rules with positive confidence are macro-mutated, the population size grows during generations. (iii) Finally, intergeneration processing takes place to ensure validity and nonredundancy. Rules may have several comparisons deleted before conforming to what is allowed. If after this point a rule has become invalid or identical to one that already exists in the new population, it is discarded. After enough valid rules have been selected, modified, and inserted into the new population, the evolution for the current generation is complete.
6 Some Experimental Results The data set tested contains 8124 sample descriptions of 23 species of gilled mushrooms in the Agaricus and Lepiota Family (drawn from [4] and presented in [5]). The data set uses 22 attributes, and classifies each mushroom as either edible (51.8%) or poisonous. Each of 9 test runs produced between 14 and 25 rules. Each rule set yields 100% accuracy for the entire set. This compares favorably with STAGGER [5] and HILLARY [6], which approach 95% classification accuracy after training on 1000 instances. Several unsupervised tests were also run on the mushroom data set in an attempt to discover information that is not necessarily related to the predefined classes. When looking for domain specific information that may have nothing to do with edibility (by using rules with 100% confidence and 100% support), we found several rules like the following: 2
2
Five test runs used 1000 training instances and the remaining four runs used 7124 instances.
294
R. Cattral, F. Oppacher, and D. Deugo
If the Gill Attachment is not Descending (either: attached, free, or notched), then the Veil Type is Partial. Unfortunately, we lack the expertise in the given domain to distinguish between interesting domain specific information and well-known facts. In an attempt to automatically discover some facts about edibility, we reduced support to 50%. These tests are difficult to interpret because we lack a tool to compare the results of an undirected and a directed search. We did notice, however, that many of the same attributes used to describe classes are used similarly in the two sets of rules.
7 Conclusion We have described a flexible new data mining system based on a modified GA. Preliminary experiments show that RAGA’s performance compares favorably with that of other approaches to data mining. Unlike the latter, RAGA is also capable of simple forms of unsupervised learning. In the space of evolutionary approaches, RAGA seems to lie ‘half way’ between Genetic Algorithms and Genetic Programming: like GP, it uses a variable-length, albeit restricted, representation with a non-binary alphabet, a typed crossover and a macromutation that shares some of the effects with GP crossover; like GA, it uses mutation, and it does not evolve programs. Unlike both GP and GA, it promotes validity and nonredundancy by intergenerational processing on fluctuating numbers of individuals, it implements a form of elitism that causes a wide exploration of the data set, and, by making data coverage a component of fitness, it automatically evolves default hierarchies of rules.
References 1. 2. 3. 4. 5. 6.
Berry, Michael J.A., Linoff, Gordon: Data Mining Techniques. J. Wiley & Sons (1997). Michalski, R., Bratko, I., Kubat, M. Machine Learning and Data Mining. Wiley, New York (1998). Mitchell, Melanie: An Introducation to Genetic Algorithms. MIT Press, Mass (1996). Lincoff, G. H. The Audubon Society Field Guide to North American Mushrooms. Alfred A. Knopf, New York (1981). Schlimmer, J. S. Concept Acquisition through Representational Adjustment (TR87-19). Computer Science, University of California, Irvine (1987). Iba, W., Wogulis, J., Langley, P. Trading off Simplicity and Coverage in Incremental th Concept Learning. Proceedings of the 5 International conference on Machine Learning, 73-39. Morgan Kaufmann, Ann Arbor, Michigan (1988).
Mining Temporal Features in Association Rules 1
2
Xiaodong Chen and Ilias Petrounias 1
Department of Computing & Mathematics, Manchester Metropolitan University Manchester M1 5GD, U.K., e-mail: [email protected]
2
Department of Computation, UMIST, PO Box 88, Manchester M60 1QD, U.K., e-mail: [email protected]
Abstract: In real world applications, the knowledge that is used for aiding decision-making is always time-varying. However, most of the existing data mining approaches rely on the assumption that discovered knowledge is valid indefinitely. People who expect to use the discovered knowledge may not know when it became valid, or whether it still is valid in the present, or if it will be valid sometime in the future. For supporting better decision making, it is desirable to be able to actually identify the temporal features with the interesting patterns or rules. The major concerns in this paper are the identification of the valid period and periodicity of patterns and more specifically association rules.
1. Introduction The problem of association rules was introduced in [1] and has been extended in different ways. Most existing work overlooks any time components, which are usually attached to transactions in databases. Without this knowledge most of the information resulting from data mining activities is not of great use. For example, it is not useful to look at all supermarket transactions that have taken place over the years in order to identify patterns. Most of this information will be outdated. Temporal issues of association rules have been recently addressed in [2] and [4]. [2] focuses on the discovery of association rules with known valid periods and periodicities. The valid period shows the absolute time interval during which an association is valid, while the periodicity conveys when and how often an association is repeated. Valid period and periodicity are specified by calendar time expressions in [2]. In [4], the concept of calendric association rules is defined, where the rule is combined with a calendar that is a set of time intervals and is described by a calendar algebra. Here we focus on two mining problems for temporal features of some known/given association: 1) finding all interesting contiguous time intervals during which a specific association holds (section 2); 2) finding all interesting periodicities that a specific association has (section 3). 2. Discovery of Longest Intervals Given a time-stamped database and a known association, one of our interests is to find all possible time intervals during which this association holds. Those intervals are composed of a totally ordered set of contiguous constructive intervals (called granular intervals) with a given granularity representing a non-decomposable interval of some fixed length. The interval granularity is the size of each granular interval (e.g. Hour, Day, etc.). Each expected time interval is denoted by {Gi, Gi+1, ..., Gj}, where Gk (i ≤ k ≤ j) is a granular interval, and the time domain can be also represented by a totally ordered set of all contiguous granular intervals. We define LENGTH(ITVL, GC) as the number of intervals of granularity GC in ITVL. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 295-300, 1999. © Springer-Verlag Berlin Heidelberg 1999
296
X. Chen and I. Petrounias
Definition 2.1: Given an association AR, an interval ITVL is valid with respect to AR if the temporal association rule (AR, ITVL) satisfies min_supp and min_conf. More often than not people are just interested in intervals the duration of which is long enough, since some short intervals may not be periods of particular interest. Definition 2.2: Given an association AR and an interval granularity GC, an interval ITVL is long with respect to AR if: ITVL is valid with respect to AR, and LENGTH(ITVL, GC) ≥ min_ilen ( minimal interval length ). Consider a long interval ITVL with respect to AR. It is possible that ∃ ITVL’ ⊂ ITVL and LENGTH(ITVL’, GC) ≥ min_ilen. ITVL’ is not a long interval with respect to AR, since AR may have low support and/or confidence during ITVL’, but very high support and confidence during the rest of the period(s) in ITVL. Definition 2.3: Given an association AR and an interval granularity GC, an interval ITVL is strictly long with respect to AR if for any ITVL’, ITVL’ ⊂ ITVL and LENGTH(ITVL’, GC) ≥ min_ilen, ITVL’ is long with respect to AR. With respect to a given association AR, for any two strictly long intervals, ITVL1 and ITVL2, if ITVL1 ⊂ ITVL2, we say that ITVL2 is strictly longer than ITVL1. Definition 2.4: Given an association AR and an interval granularity GC, an interval ITVL is longest with respect to AR if: 1) the interval ITVL is strictly long with respect to AR, and 2) not ∃ ITVL” ⊃ ITVL, where ITVL” is strictly long with respect to AR. With respect to a given association AR, there may be a series of different longest intervals existing along the time line. Definition 2.5: Given a set of time-stamped transactions (D) over a time domain (T), a known association (AR), minimum support (min_supp), minimum confidence (min_conf), and minimum interval length (min_ilen), the problem of mining valid time periods is to find all possible longest intervals with respect to the association AR. Suppose time domain T = {G1, G2, ..., Gn}, where Gi (1≤i≤ n) is a granular interval. The set of time-stamped transactions D is ordered by timestamps and is partitioned into {D(G1), D(G2), ..., D(Gn). The search problem can be considered as successively looking for all longest sequences along the time domain sequence {G1, G2, ..., Gn}. For each possible longest interval, the search can be performed in two steps: 1) find its seed interval; 2) extend this seed interval to the corresponding longest interval. Definition 2.6: Let an interval ITVL = {Gi, Gi+1, ..., Gj}, ITVL is called as a seed interval if it satisfies the following conditions: 1) it is a strictly-long interval; 2) no strictly long interval starting before Gi covers ITVL; and 3) no other interval being covered by ITVL satisfies the previous two conditions. For example, let min_ilen be 3 and assume that ITVL1 = {G5, G6, G7, G8, G9} and ITVL2 = {G7, G8, G9, G10, G11, G12} are two longest intervals, then {G7, G8, G9, G10} could be a seed interval of ITVL2 if there is no other strictly long interval covering it. However, although {G7, G8, G9} is strictly long, it can not be a seed interval of any longest interval since ITVL1 covers it (condition 2). Also, {G7, G8, G9, G10, G11} is not regarded as a seed interval because {G7, G8, G9, G10} is a seed one (condition 3). Proposition 2.1: Let an interval ITVL = {Gi, Gi+1, ..., Gj}. If ITVL is a seed interval, there must be one and only one longest interval that covers ITVL and this longest interval is an interval starting from Gi (this holds due to definitions 2.4, 2.6). This says that if we can find all the seed intervals, we can extend them to get all the longest intervals. The questions are: how to find the seed interval and how to extend it to the longest interval. Let’s answer the second question first. If ITVL is a seed interval, then the corresponding longest interval can be derived from ITVL as follows:
Mining Temporal Features in Association Rules
297
1) If the last granular interval of ITVL is the last granular interval along the time domain, output ITVL (which is obviously a longest interval) and terminate the search. Longest Interval
Strictly Long
p <= j - min_ilen + 2
L=j-min_ilen+2 Gi Gi+1
…..
GL
..…
Gi Gi+1 ...... Gp-1 Gp ...... Gj
Gj
Gj+1
valid valid
Gj+1
not valid valid ............
……..
valid valid
valid (a)
(b)
Figure 2.1 Extend a seed interval to a longest interval
2) Let ITVL = {Gi,Gi+1,...,Gj}, consider next granular interval Gj+1 along the time domain and check if {Gi,Gi+1,...,Gj,Gj+1} is a strictly long interval by successively checking if each interval {Gk, Gk+1, ..., Gj, Gj+1} (where k = i, i+1, ... , j-min_ilen+2) is valid. If all of them are valid (Figure 3.1(a)), interval {Gi,Gi+1,...,Gj,Gj+1} is surely a strictly long one. Gj+1 is added to ITVL and becomes the last granular interval of ITVL. Go back to step 1) to look for a longer interval. Otherwise, once we find an interval {Gp,Gp+1,...,Gj,Gj+1} (where i≤p≤ j-min_ilen+2) is not valid (Figure 2.1(b)), we can conclude that there is no longer interval starting from Gi that will be strictly long. Output ITVL and finish the search for this corresponding longest interval. Now, we consider how to find the seed interval. During the course of the search for all longest intervals, two cases in which we will start to look for the seed interval are: 1) At the beginning of the search, we need to find the seed interval of the first longest interval along the time domain. We can start from the beginning of the time line and check successively each interval with the length of min_ilen until we find that it is valid. This interval will be the seed interval of the first longest interval. 2) At the time when a longest interval is just found (see Figure 2.2), we need to find the seed interval of the next longest interval. Assume the currently found longest interval is ITVL = {Gi, Gi+1, ..., Gj}. As discussed above, once we find an interval {Gp, Gp+1, ..., Gj, Gj+1} (where i ≤ p ≤ j - min_ilen + 2) is not valid, we terminate the search for this current longest interval. Obviously, it is impossible that any sub-interval of ITVL or any interval {Gk, Gk+1, ..., Gj, Gj+1} (where i ≤ k ≤ p) would be a new seed interval. So, the next seed interval must be the first following strictly-long interval that does not end before Gj+1. Figure 2.2 shows two possible cases in which the next seed interval will be found. In Figure 2.2(a), the next seed interval is the first following strictly-long interval which ends by Gj+1. The length of this interval may be greater than min_ilen. In second case, as shown in Figure 2.2(b), the seed is the first following valid interval with the length of min_ilen. Here, Gq is any granular interval after GL. Longest Interval ( p+1≤ q ≤L )
G i .....
G p ....
Longest Interval ( L = j - min_ilen + 2 )
G q ....
GL ....
Gj
( L = j - min_ilen + 2 )
G j+1
G i .....
G p ....
G L ....
Next Seed Interval
(a)
( r = q - min_ilen + 1 )
Gq ....
Gr
........
Next Seed Interval
(b)
Figure 2.2 Finding the next seed interval
Figure 2.3 shows an one-pass algorithm (LISeeker) for searching for all the longest intervals of a given association rule AR in a database D over a time domain T. For
298
X. Chen and I. Petrounias
simplicity and without loss of the generality, suppose T = {G1,G2,...,Gn} and D = {D[G1],D[G2],...,D[Gn]}. The search is gradually made by scanning all the partitions of the database, D[G1],D[G2],...,D[Gn]. To monitor the search process and avoid multiple passes over the database, we build a structured queue G_QUEUE, which is an ordered list of granular intervals. This ordered list can be considered as a candidate interval of the next expected seed interval or the next longest interval, depending on its status. In G_QUEUE, each element Gi consists of the three fields: trans_num (number of transactions in D[Gi]), body_num (number of transactions containing AR.body, in D[Gi]) and rule_num (number of transactions containing AR.body ∪ AR.head, in D[Gi]). IN(G_QUEUE, Gi) adds a granular interval Gi with relevant numbers into the rear of G_QUEUE and OUT(G_QUEUE) removes a granular interval from the front of G_QUEUE. The search starts with the first interval with the length of min_ilen along the time line. In the algorithm, ptr1 and ptr2 always point to the start and end of the current candidate interval and go forward alternately. In the outer iteration in the algorithm (starting at line 3), a seed interval is firstly being looked for (lines 4 to 17) and the corresponding longest interval is then being derived from this (lines 18 to 39). (1) for ( i = 1; i ≤ min_ilen ; i++ ) { SCAN(D[Gj]); IN(G_QUEUE, Gi ); } (2) ptr1= 1; ptr2 =min_ilen; (3) for ( ptr2 ≤ n ) do { (4) for ( ptr2 ≤ n ) do { /* looking for the next strictly long interval */ (5) i = ptr1; j = ptr2; (6) for ( i ≤ j - min_ilen +1 ) do { (7) if ( NotValid({Gi , ..., Gj}) ) break; (8) i++; (9) } (10) if ( i > j - min_ilen + 1 ) break; /* found a seed {Gptr1, ...., Gptr2 } */ (11) else if ( i = j -min_ilen + 1 && j = n ) exit; /* no any more seed */ (12) else { (13) for (k = ptr1; k ≤ i; k++) do OUT(G_QUEUE); (14) if ( i = j - min_ilen + 1 ) (15) { j++; SCAN(D[Gj]); IN(G_QUEUE, Gj ); } (16) ptr1 = i + 1; ptr2 = j; (17) }} (18) for ( ptr2 ≤ n ) do { /* looking for the next longest interval */ (19) if (ptr2 = n){ (20) OUTPUT({G ptr1, ..., Gptr2 }) ; /* found a longest interval */ (21) exit; (22) } (23) i = ptr1; j = ptr2 + 1; (24) for ( i ≤ j - min_ilen + 1 ) do { (25) if ( NotValid({Gi , ..., Gj}) ) break; (26) i++; (27) } (28) if ( i ≤ j - min_ilen + 1 ) { (29) OUTPUT({Gptr1, ..., Gptr2 }); /* found a longest interval */ (30) for (k = ptr1; k ≤ i; k++) do OUT(G_QUEUE); (31) if ( i = j - min_ilen + 1 ) (32) { j++; SCAN(D[Gj]); IN(G_QUEUE, Gj ); } (33) ptr1 = i + 1; ptr2 = j; (34) break; (35) } (36) else { /* extending {Gptr1, ..., Gptr2+1} with Gj */ (37) SCAN(D[Gj ]); IN(G_QUEUE, Gj); (38) ptr2 = j; (39) } }}
Figure 2.3 Search Algorithm for Longest Intervals (LISeeker)
Function SCAN passes over all the transactions in D[Gi] counting the number of those transactions, the number of the transactions containing the body of AR, and the number of transactions containing both the body and head of AR. Function NotValid
Mining Temporal Features in Association Rules
299
checks if the interval {Gi,...,Gj} is valid in terms of the given minimum support and confidence. Since the relevant counts (trans_num, body_num, rule_num) in each data partition D[Gk] ( i≤k≤j ) have been recorded in G_QUEUE, the support and confidence of AR in D[{G1,G2,...,Gn}] can be computed by the sums of those relevant counts. The function OUTPUT will convert the longest interval that was found in the form of {Gptr1, ..., Gptr2} into a time period described by an understandable representation. 3. Discovery of Longest Periodicities Given a time-stamped database and a known association, another temporal feature is a set of regular intervals in cycles, during each of which this association exists. A periodic time can be represented as a triplet . Cycle is the length (given by a calendar) of a cycle, Granule is the duration (given by a calendar) of a granular interval, and Range is a pair of numbers which give the position of regular intervals in the cycles. Given a periodic time PT=, its interpretation Φ(PT)={P1,P2,...,Pj,...} is regarded as a set of intervals consisting of the x-th to y-th granular intervals of GR, in all the cycles of CY. If we partition the time domain T by CY and express it as {C1,C2,...,Cj,...} (where Cj is an interval of CY), we have Pj⊆Cj (for any j>0). For example, let PT=, T can be expressed as {year1,year2,..., yearj,...} and Φ(PT) as {Q1,Q2,...,Qj,...} (Qj is the last quarter of year j). Definition 3.1: Given an association AR, a periodic time PT= is valid with respect to AR if there are not less than min_freq% of intervals in Φ(PT), which are strictly long with respect to AR. Definition 3.2: Given association AR, periodic time PT= is longest with respect to AR if PT is valid with respect to AR, and not ( PT’ = such that RR‘ ⊃ RR and PT’ is strictly long with respect to AR. Definition 3.3: Given a set of time-stamped transactions (D) over a time domain (T), a minimum support (min_supp), a minimum confidence (min_conf), a minimum frequency (min_freq), a minimum interval length (min_ilen), as well as the cycle of interest (CY) and granularity (GR), the problem of mining the periodicities of a known association (AR) is to find all possible periodic times , which are longest with respect to the association AR. Here, RR is expected to be discovered. According to the above, the cyclicity (CY) and granularity (GR) of the periodic time that are of interest, are given. So, we can suppose time domain T = {C1, C2, ..., Cm}, where Ci (1 ≤ i ≤ m) is a cycle, so that the data set D can be partitioned into {D[C1], D[C2], ..., D[Cm] }. The search can be decomposed into two sub-problems: 1) search for all the longest intervals over each Ci from D[Ci]; 2) derive the possible periodicities from all longest intervals found in each cycle Ci. The algorithm in section 2 can be used for the search for the longest intervals over each Ci from dataset D[Ci]. We only focus on the second sub-problem: how to derive the periodic time from all longest intervals that are found in each cycle Ci. We use Ci.ITVLSET to express the set of all longest intervals found in each cycle Ci. The algorithm (PIDeriver) used for the derivation is based on the following steps: 1) Scanning each Ci.ITVLSET and adding all longest intervals that are found into an ordered list A_LIST, which is ordered by the starting point and the ending point of the interval. Intervals in A_LIST are called essential intervals. 2) Looking for all candidate intervals by splitting essential intervals in A_LIST. If any two intervals in A_LIST intersect and the intersection is long enough, then the intersection is added into the candidate interval set C_LIST.
300
X. Chen and I. Petrounias
3) For each candidate interval in C_LIST, counting the number of cycles in which there exists a longest interval covering this interval; computing the frequency for this candidate interval; and removing it from C_LIST if it does not satisfy the minimum frequency (min_freq %). For each interval ITVLi in C_LIST, removing it from C_LIST if there is another interval ITVLj in C_LIST, ITVLi ⊆ ITVLj. 4. Implementation and Experimental Results The algorithms described have been implemented in a prototype mining system [3]. The kernel of the system is a temporal mining language [3], which has been integrated with SQL on the basis of ORACLE. For testing the performance of the algorithms, we generated three datasets that mimic the transactions within one year in a retailing application. Each transaction is stamped with the time instant at which it occurs. We run the algorithm LISeeker to look for longest intervals of a given association of items with the fixed interval granularity, minimum support and minimum confidence, but different minimum interval lengths. The results show that no matter how much the given minimum interval length is, the escaped CPU times are just slightly different. The expense for the search is mostly spent on the scanning of the database and it is scanned only once in any case of different minimum interval lengths. Therefore, the search time depends almost exclusively on the size of the dataset. The escaped CPU time rises almost linearly with the sizes of the datasets. Since the search for longest periodicities is based on algorithm LISeeker and the cost for running LPDeriver can be almost neglected, compared with the cost for running LISeeker, its performance feature is very similar to the search for longest intervals. 5. Conclusions and Future Work This paper concentrated on the identification of interesting temporal features (valid period and periodicity) of association rules. Based on the concepts of long intervals and longest periodicities, the mining problems were defined and the search techniques were discussed with the corresponding algorithms. We believe that the identification of similar temporal features of other types of patterns can occur naturally within the same framework. Work is now concentrating on the development of algorithms for the identification of similar temporal features for the different types of patterns. An interactive temporal data mining system for supporting the described tasks has been developed with an appropriate SQL-based language [3]. It is currently being extended to support other mining tasks. References 1. Agrawal, R., Imielinski, T., and Swami, A, Mining Associations between Sets of Items in Massive Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, Washington D.C., May 1993. 2. Chen, X., Petrounias, I., and Heathfield, H., Discovering Temporal Association Rules in Temporal Databases, Proceedings of International Workshop on Issues and Applications of Database Technology (IADT’98), Berlin, Germany, July 1998. th 3. Chen, X. and Petrounias, I., A Framework for Temporal Data Mining, Proceedings of 9 International Conference on Database and Expert Systems Applications (DEXA’98), Vienna, 1998. 4. Ramaswamy, S., Mahajan, S., and Silberschatz, A., On the Discovery of Interesting th Patterns in Association Rules, Proceedings of 24 VLDB Conference, New York, pp.368379, 1998.
The Improvement of Response Modeling: Combining Rule-Induction and Case-Based Reasoning F. Coenen, G. Swinnen, K. Vanhoof and G. Wets Limburg University Centre, Department of Applied Economics, B-3590 Diepenbeek, Belgium {filip.coenen;gilbert.swinnen;koen.vanhoof;geert.wets}@luc.ac.be
Abstract. Direct mail is a typical example for response modeling to be used. In order to decide which people will receive the mailing, the potential customers are divided into two groups or classes (buyers and non-buyers) and a response model is created. Since the improvement of response modeling is the purpose of this paper, we suggest a combined approach of rule-induction and case-based reasoning. The initial classification of buyers and non-buyers is done by means of the C5-algorithm. To improve the ranking of the classified cases, we introduce in this research rule-predicted typicality. The combination of these two approaches is tested on synergy by elaborating a direct mail example.
1
Introduction
One of the most typical examples where response modeling comes into play is direct mail. This marketing application goes further than just sending product information to randomly chosen people, as mass marketing does. A key characteristic of direct mail is that a specific market or geographic location is targeted, while selecting receptors by age, buying habits, interests, income, etc. In order to decide which people will receive the mailing, the potential customers are divided into two groups or classes: buyers and non-buyers. This division, which is based upon the above-mentioned socio-demographic and/or economic information of the potential customers, is called response modeling and can be realized by means of artificial intelligence. A learning algorithm is then applied to predict the class of unseen cases or records, i.e. possible customers. As known from literature, the accuracy of the prediction never reaches 100%, as there are always cases attributed to the wrong class. When applied to the direct mail example again, this means that, at a given mailing depth, there are always people receiving mail concerning products that appear uninteresting to them while buyers are left out of the mailing. As a consequence, costs are made that can be avoided, e.g. by creating a better response model. One way to come to a better response model would be by choosing a better classifier [8]. In this paper, however, we suggest an other approach; i.e. the combination of multiple classification methods. The remainder of this paper is organized as follows. In the next section, a theoretical background of the performed approach will be discussed and a following J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 301-308, 1999. © Springer-Verlag Berlin Heidelberg 1999
302
F. Coenen et al.
section deals with the empirical evaluation of the suggested approach. To illustrate this, a direct mail example is further elaborated. The last section will be reserved for conclusions and topics for future research.
2
Suggested Approach
2.1
Classifiers
C5. One of the possible classifiers that can be used in the response modeling of a data set is the C5-algorithm, the more recent version of C4.5 [10]. The reason that we preferred this algorithm is based upon previous research Van den Poel and Wets [14]. They used the same data set as we did to provide a comparison between a number of classification techniques. They selected techniques in the field of statistical, machine learning and neural network applications, and compared them by means of the overall accuracy on the data set. We preferred to use the C5-algorithm to do the initial classification, since this algorithm attained the highest accuracy on the test set (see also section 3.2). The goal of response modeling is to rank the cases by probability of response. Since each case is classified with a certain confidence by C5, the most trivial way to rank the cases would be by the confidence figure of the applied rule. This means that when the assigned class label is the non-responding class, the complement of the confidence should be taken before sorting the whole data set on this confidence. However, as mentioned before, we propose in this paper an other method to improve response modeling, as will be explained. Case-Based Reasoning. Case-based reasoning methods are based on similarity and try to use the total information of a given unknown case. In our research, we used typicality as similarity measure. To determine the typicality of each case in the context of this research, the following approach was used. a) Firstly, for each case i a distance measure dist(i, j) is determined as follows; the attribute values of i are compared with the attribute values of a case j <> i. If the values of the considered cases differ, dist(i, j) is increased by one (independent of the size of the difference). b) After determining dist(i, j) according to the above-mentioned method, the class value of the cases i and j is compared. If i belongs to the same class as j, a measure intra(i) is increased by (1 - (dist(i, j) / number of attributes)). If, on the other hand, both cases belong to a different class, a measure inter(i) is increased by (1 – (dist(i, j) / number of attributes)). The above calculations are made for all the cases j <> i. This is the point where the global character of our approach comes into play, since all other cases j<>i are taken into account in the calculation of the typicality of just one case i. c) In a next step the measure intra(i) is divided by the number of cases that belong to the same class as i, and inter(i) is divided by the number of cases that belong to the other class.
The Improvement of Response Modeling
303
d) Finally, the typicality of case i is determined by dividing intra(i) by inter(i). For each case the typicality was calculated, allowing these cases to be ranked by this measure. The above steps lead to the following definition: intra (i ) p . Typicality (i) = (1) inter(i) n with p the number of cases that belong to the same class as case i, and n the number of cases that belong to the other class. The cases with typicality higher than 1 are considered as typical cases for the class they belong to. Used as a classifier, this method looks at the similarity between the considered case and the different classes and assigns the label of the most similar class to the case. In response modeling however, it is sufficient to look at the similarity between the considered case and the responding class, in order to use this similarity as a ranking criterion. A Combined Approach. As it is known from previous research, the accuracy of such a model almost never reaches 100% for real world cases. Also in our direct mail example, the classification of buyers and non-buyers was not completely correct; some non-buyers were classified in the class of the buyers, and vice-versa. Since the accuracy of our model attained 76.32 %, a percentage of 23.68% of the cases were misclassified. The fact that errors are made implies that there is room left for improvement if we choose not to write to all persons in the data set as is often the case in direct mailing. This will further be explained in section 3.3 In order to upgrade the response model and have more control over these mislabeled instances, we decided to rank the classified cases. Empirical results taught us that C5 is a better classifier than typicality on the one hand, and also better than other considered classifiers on the other hand This is why we opted for this algorithm to do the initial classification. By applying C5, a case obtains a response probability from just one rule, i.e. the rule with the highest confidence that meets the case. The other rules or cases are not taken into account. The classification by C5 can thus be considered as a local approach; only a part of the information carried by the case and the rule-base is used. In contrast to this, a case-based reasoning method displays a global character; a case obtains a response probability by looking at the total data set. Empirical results (see section 3.2) showed us that typicality outperforms confidence in ranking the cases. That is why we opted for this method to improve the ranking of the classified cases. By combining the strengths of both methods, i.e. C5 as the best classifier, and typicality as the best ranker, we could investigate the effects of the combination between a global and a local approach. The new response modeling method that is suggested in this paper can then be described as follows. An unknown case obtains the class label from the C5 classifier and obtains as response measure the typicality for the given class label. The latter has as consequence that the calculation of the assigned typicality is based on the predicted class label of the case. This typicality will further be called rule-predicted typicality. Thus, the cases are firstly ranked by class label and secondly by rule-predicted typicality.
304
F. Coenen et al.
2.2 Evaluation To compare the ranking by typicality on the one hand with the ranking by confidence and the original situation on the other hand, we selected the Coefficient of Concordance (CoC) [6] and the cumulative response rate as objective measures and graphs as a visualization tool. The CoC takes into account the ranking of the cases, and gives a percentage as outcome. The higher the percentage, the better the sorting. The main reason for choosing this measure is that it looks at the distribution of the cases in the predicted class as a whole. Therefore, the distribution is calibrated on a 10-class rating scale. This means that the distribution is split up into 10 intervals, each with a score higher than the previous interval. The CoC is defined as follows: max score max score 1 nbi ng ’i +0.5 nbi ng i . CoC = (2) n g nb i = min score i = min score with nbi respectively ngi the number of bad, respectively well classified cases with a score equal to i, ng’i the number of well classified cases with score better than i. With a given mailing depth, we know how many cases will be mailed, and the different methods can be evaluated with the cumulative response rate. Further, the graphs can help us by discovering in which range a certain method is superior.
∑
3
Empirical Validation
3.1
The Data Set
∑
The data set that was used for empirical validation was collected from an anonymous mail-order company and consists of 6800 records or cases, each record described by 15 attributes. These records are equally divided between the classes 0 (non-buyers) and 1 (buyers). The information that was available concerns transactional data, as well as socio-demographic information o the customers. All variables were categorized after careful consideration with the mail-order company. They provided the data to us at the level of the individual customer. The specific model that we have built is based on all available data, and predicts whether a person is a possible buyer or not. The outcome is a binary response variable (0/1) representing buying or not buying. Before the induction of the C5 classifier, a training set was composed by randomly selecting approximately 2/3 of the cases from the original data set. The remaining part was used for purposes of testing.
3.2
Results
Evaluation of the Classification. As mentioned in the section concerning the suggested approach, the C5-algorithm was used to classify the cases in a first step. By
The Improvement of Response Modeling
305
applying C5, an accuracy of 76,32% was obtained on the test set, which consisted of 2052 cases (1018 buyers and 1034 non-buyers). Since the accuracy didn’t reach 100%, and the cases were randomly divided into a training and a test set, there are a number of incorrectly classified cases randomly divided among the predicted ones. In order to implement our approach, we separated the cases that were predicted to belong to class 0 (1120 cases) from the cases that were predicted to belong to class 1 (932 cases). This means that our model considered 1120 out of 2052 persons as nonbuyers, and 932 persons as buyers. An overview of the situation in the test set after classifying by C5 is shown in table 3. Table 3. Confusion Matrix of the Test Set Predicted 0 Predicted 1 Total
Real 0 834 200 1034
Real 1 286 732 1018
Total 1120 932 2052
In order to deduce a better response model, we improved the ranking of the cases by sorting them by typicality within the predicted class, under the assumption that the cases that were misclassified would have a lower typicality. This means that after sorting by typicality, the misclassified cases would appear lower in the rank than the correctly classified ones. Evaluation of the Ranking. To compare the outcome of our experiments with the initially unsorted situation on the one hand and the sorting by confidence and rulepredicted typicality on the other hand, we used the Coefficient of Concordance. As mentioned in section 2.2, the distribution has to be ranked on a 10-class rating scale to be evaluated. To evaluate the ranking by rule-predicted typicality in the context of this research, we decided to use the rule-predicted typicality of the cases as score. This means that if the highest rule-predicted typicality of a case in the set attains 1.5, and the lowest rule-predicted typicality equals 0.5, the cases with rule-predicted typicality between 0.5 and 0.6 will be considered as belonging to the same group, and thus have the same score. This implies that the score in the definition (see section 3.2) is replaced by rule-predicted typicality. The same method is used to evaluate the sorting by confidence. The exact results of these calculations can be found in table 4. Table 4. The Coefficient of Concordance
Sorted by Confidence Sorted by RulePredicted Typicality
Predicted Class 0 1 55,2% 65,9% 62,9% 65,0%
Table 4 shows us that the ranking of the test cases that were predicted to belong to class 0, as well as the test cases that were predicted to belong to class 1, becomes better after sorting by rule-predicted typicality or by confidence. In both cases the coefficient of concordance is higher than 50%, i.e. the percentage that can be
306
F. Coenen et al.
expected by a random division of the misclassified cases among the correctly classified cases. If the sorting by rule-predicted typicality is compared with the sorting by confidence, a difference between the predicted class 0 and the predicted class 1 can be noticed. For the predicted class 0, the rule-predicted typicality produces a better result since the coefficient of concordance equals 62.9% against 55.2% after sorting by confidence. This observation is further illustrated by figure 1; the rule-predicted typicality curve is less steep over a larger distance than the confidence curve. The Xaxis shows the number of cases in the predicted class 0, whereas the Y-axis shows the number of misclassifications as they appear gradually among the considered cases 350 300 250 200 150 100 50 0
Cases
. Fig. 1. The appearance of the errors among the cases that are predicted to belong to class 0. The gray colored graph represents the occurrence of errors among the cases that were predicted to belong to class 0 for the unsorted situation. The black and the bold black graph describe the same after sorting by confidence, respectively rule-predicted typicality.
3.3
Application on the Direct Mail Example
In normal circumstances, a mail-order company will try to cut off between 10% and 40% of its unattractive part of the mailing list. This means that between 60% and 90 % of all the persons in the data set will receive the mailing. Often, a mailing depth of 75% is used [14]. The reason for this is that the profit generated by converting a nonbuyer into a buyer is considered higher than the cost of sending a letter to a person that is not interested in the products that are subject of the mail. In our further calculations, we will also consider a mailing depth of 75% of the test set (0,75 * 2052 = 1539 persons). To reach these people, we will direct a letter to all the persons that are considered as buyers by our system, i.e. 932 persons, of which 732 are classified right and thus are buyers in reality. 1539 – 932 = 607 persons from the predicted class 0 will complete this number so that a total amount of 1539 persons are reached. Applied to this direct mail example, the sorting by confidence and rule-predicted typicality produced the following results.
The Improvement of Response Modeling
307
Sorting by Confidence. The predicted non-buyers were sorted by an increasing confidence. As the non-buyers with low confidence are more likely to be misclassified than the ones that were predicted to be non-buyers with a high confidence, the 607 non-buyers with the lowest confidence are included in the mailing list. Among these 607 persons there were 169 buyers. This means that by mailing 1539 persons, we would reach 732 + 169 = 901 buyers out of the 1018 buyers (88,5%) that are present in the test set. Sorting by Rule-predicted Typicality. Analogously on the sorting by confidence, we sorted the cases that were predicted to belong to class 0 by increasing rulepredicted typicality and included the 607 persons with the lowest rule-predicted typicality in the mailing list. Among these 607 persons there were 194 buyers, so that we would reach 732 + 194 = 926 buyers out of 1018 (91%) by mailing 1539 persons. Unsorted Situation. To illustrate the improvement that is made by sorting the cases of the predicted class, we finally give an overview of the situation as it would be without any sorting. Among the 607 persons of the predicted class 0, there would be approximately 0,26 * 607 = 158 buyers since 286 out of 1120 (+/- 26%) cases were misclassified and the errors are randomly divided in the predicted class. This means that we would reach 732 + 158 = 890 buyers out of 1018, or 87.4%. An overview of the results can be found in table 5. Table 5. The number of reached buyers Unsorted 87.40%
Sorted by Confidence 88.50%
Sorted by Rule-Predicted Typicality 91.00%
The fact that the improvement after ranking by confidence is limited to 1.10% shows us that the sorting of the classified cases is a difficult topic. Our approach proved to be a useful one, since it outperforms sorting by confidence by an improvement more than twice as high (2.50%) as the existing improvement of 1.10%.
4
Conclusions
This article describes a method for improving response modeling by using a combined approach of rule-induction and case-based reasoning. The proposed approach consists of classifying the cases by means of the C5-algorithm in a first step, and ranking the classified cases by a typicality measure in a second step. In this way, we could test the combination of the use of local and global information on synergy. Based on empirical results we decided that the C5-algorithm was the best classifier to do the initial classification. This algorithm provides the local aspect of our approach, since it classifies each case by just one rule, i.e. the rule with the highest confidence that meets the case. The other rules or cases are not taken into account. In contrast to this, a case-based reasoning approach displays a global character, since a case obtains a response probability by looking at the total data set. Empirical results showed us that sorting by typicality was the best method to improve the ranking of the
308
F. Coenen et al.
classified cases. To do so, we introduced the concept rule-predicted typicality, as the calculation of the typicality of a test case is based on the predicted class value of the considered case. Finally, the application of our approach on a direct mail example has shown this method to be a promising one. It proves to yield an improvement of 2.50% over the improvement of 1.10% that is generated by the ranking of the classified cases by the existing confidence figures. This implies that we were able to reach 91% of the buyers in our test set, under the consideration of a mailing depth of 75%. Although it is only about a small improvement in absolute terms, yet the total success of a direct mail can depend on this. Since we were not able to test this approach on more than one data set so far, opportunity for future work lies within this topic.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
Aijun, A., Cercone, N.: Multimodal Reasoning with Rule Induction and Case-Based Reasoning, in Multimodal Reasoning, AAAI Press (1998) Bayer, J.: Automated Response Modeling System for Targeted Marketing (1998) Brodley, C.E. and Friedl, M.A.: Identifying and eliminating mislabeled training instances, in Proceedings of Thirteenth Nat. Conference on Artificial Intelligence, AAAI Press (1996) Domingos, P.: Knowledge Discovery vie Multiple Methods, IDA, Elsevier Science (1997) Domingos, P.: Multimodal Inductive Reasoning: Combining Rule-Based and Case-Based Learning, in Multimodal Reasoning, AAAI Press (1998) Goonatilake, S., Treleaven, P.: Intelligent Systems for Finance and Business, Wiley (1995) 42 – 45 Holte, R., Acker, L.E., Porter, B.W.: Concept learning and the problem of small disjuncts, in Proceedings of the Eleventh Int. Join Conference on AI, Morgan Kaufmann (1989) 813818 Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms, Thirteenth National Conference on Artificial Intelligence (1996) Ling, C.X., Li, C.: Data Mining for Direct Marketing: Problems and Solutions, in Proceedings of the Fourth Int. Conference on Knowledge Discovery and Datamining, AAAI Press (1998) 73-79 Quinlan, J.R.: C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, 1993. Sabater, J., Arcos, J.L. and Lopez de Mantaras, R.: Using Rules to Support Case-Based Reasoning for Harmonizing Nelodies, in Multimodal Reasoning, AAAI Press (1998) Surma, J. and Vanhoof, K.: Integrating Rules and Cases for the Classification Task, CaseBased Reasoning, Research and Development, First International Case-Based Reasoning Conference, - ICCBR’95, Springer Verlag (1995) 325-334 Surma, J., Vanhoof, K. and Limere, A.: Integrating Rules and Cases for Data Mining in Financial Databases, in Proceedings of the Nineth Int. Conference on AI Applications – EXPERSYS’97, IIIT-International (1997) Van den Poel, D., Wets, G.: Data Mining for Database Marketing: a mail-order company application, in Proceedings of the Fourth International Workshop on Rough Sets, Fuzzy Sets and Machine Discovery, RSFD '96 (1996) 383- 389 Zhang, J.: Selecting Typical Instances in Instance-Based Learning, in Proceedings of the Ninth Int. Conference on Machine Learning, Morgan Kaufmann (1992) 470-479
Analyzing an Email Collection Using Formal Concept Analysis Richard Cole1 and Peter Eklund1 School of Information Technology, Griffith University PMB 50 GOLD COAST MC, QLD 9217, Australia [email protected], [email protected]
Abstract. We demonstrate the use of a data analysis technique called formal concept analysis (FCA) to explore information stored in a set of email documents. The user extends a pre-defined taxonomy of classifiers, designed to extract information from email documents with her own specialized classifiers. The classifiers extract information both from (i) the email headers providing structured information such as the date received, from:, to: and cc: lists, (ii) the email body containing free English text, and (iii) conjunctions of the two sources.
1
Formal Concept Analysis
Formal Concept Analysis (FCA) [8,3] is a mathematical framework for performing data analysis that has as its fundamental intuition the idea that a concept is described by its intent and its extent. FCA models the world as being composed of objects and attributes. The choice of what is an object and what is an attribute is dependent on the domain in which FCA is applied. Information about a domain is captured in a formal context, which is a triple K = (G, M, I) in which G is a set of objects, M is a set of attributes, and I ⊂ G × M is a relation saying which objects possess which attributes. A formal concept is a pair (A, B) where A is a set of objects called the extent, and B is a set of attributes called the intent. A must be the largest set of objects for which each object in the set possesses all the attributes in B. The reverse must be true also of B. More precisely, a formal concept of the context (G, M, I) is a pair (A, B), with A ⊆ G, B ⊆ M , A = {a ∈ G | ∀ b ∈ B (a, b) ∈ I} and B = {b ∈ M | ∀ a ∈ A (a, b) ∈ I}. The fundamental theorem of FCA states that the set of formal concepts of a formal context forms a complete lattice. This complete lattice is called a concept lattice, and is usually denoted B(K). A complete lattice is a special type of partial order in which the greatest lower bound and least upper bound of any subset of the elements in the lattice must exist. A lattice may be drawn via a line diagram (see Figures 1 and 4) [7]. For each attribute in m there is a maximal concept that has m in its extent. We shall use the function γ : M → B(K) to denote the mapping from attributes to their maximal concepts. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 309–315, 1999. c Springer-Verlag Berlin Heidelberg 1999
310
Richard Cole and Peter Eklund
M1
M2
γ( m 1)
γ( m 2)
23
44 15 M3
γ( m 3) (a)
10
(b)
Fig. 1. (a) A concept lattice with the implication m1 ∧ m2 → m3 (b) A concept lattice with the partial implication P r(M 1|M 2) = 15/44.
A concept lattice, generally denoted B(K), is representationally equivalent to the attribute logic that exists over the attributes in the context. For example (see Fig. 1(a)), the proposition ∀g ∈ G p1 (g) ∧ p2 (g) → p3 (g) where pi (g) means that object g has attribute mi will be true, if and only if the greatest lower bound of γ(m1 ) and γ(m2 ) will be greater than or equal to γ(m3 ) in the concept lattice. Since this information is represented diagramatically it is more accessible than a list of conditional probabilities. Labels on the lattice (see Fig. 1(b)) attached above each concept show the introduced intent1 while the labels attached below show the size of the extent of the concept. It is possible to determine the intent of a concept by collecting all intent labels on an upward path from the concept. For example the concept labeled M 3 in Fig. 1(b) also has M 2 and M 1 in its intent. The concept lattice can also represents partial implications or conditional probabilities[5]. For example, if we wanted to know the probability that an object from G has m1 given that it has M 2, denoted P r(M 1 | M 2), this would be given by |Ext(γ(M 1)∧γ(M 2))|/|Ext(γ(M 2))| = 15/44. The two numbers in this ratio being present in the diagram (see Fig. 1(b)).
2
Background
Previous applications of FCA may be divided into two categories. Those that generate a large concept lattice — the number of concepts is roughly the square of the number of documents — of all terms and documents and those that employ conceptual scaling. Godin et. al [4] proposed navigating though a set of text documents via a large concept lattice. Each concept visited was presented in a window listing its intent, and the user moved to subsequent elements nodes via selection of additional terms. Carpineto and Romano [1] proposed navigation in a concept lattice of all terms using a fish-eye viewer. 1
The introduced intent is that part of the intent that is not found in the intent of any more general concepts.
Analyzing Email Using FCA
311
Wille and Rock [6] implemented a system for a library catalogue, using a software system called TOSCANA, in which a large number of sub-lattices were designed by a subject librarian. A visitor to the library could choose a “theme” previously defined by the librarian. This was seen as an advantage in the library environment since the users of the system are generally unfamiliar with reading lattice diagrams. In a sense these approaches represent different extremes. In the first instance the user has a maximum number of choices when navigating, while in the second the user is presented with carefully constructed views of the data. The approach outlined in this paper attempts to strike a middle road allowing the user to construct and modify scales in response to learning information about the data. It is novel in that it allows the user to define a hierarchy over the search terms and presents the user with a dynamic environment for the creation and modification of scales by the user.
3
Hierarchy of Classifiers
The task of extracting information from text documents is a difficult one. The language used in email documents is often informal, makes extensive use of abbreviations, and is highly contextualized. For this reason we do not attempt to do any deep extraction of information from email texts but rather recognize key terms. We experimented with classifiers that recognize regular expressions, either from email headers, or within the body of the email itself.
1 2 3 4 5 6 7 8 9 10 11
CLASSIFICATION "Email-Analysis" ... "Year 1994 - Sep:Nov" Date: "ˆ[A-Z][a-z]+, [0-9]+ (Sep|Nov) 1994" ; ... "Melfyn mentions Barbagello" From: melfyn Body: David ; "Mentions Melfyn" Body: [mM]elfyn ; ... END CLASSIFICATION (a)
1 2 3 4 5 6 7 8 9
BEGIN ORDER "Email-Analysis" ... "Melfyn mentions Barbagello" < "Mentions Barbagello" ; "From Melfyn" < "From DSTC" ; "From DSTC" < "DSTC" ; ... END ORDER
(b)
Fig. 2. (a) Classifiers: a file expressing classifiers for the terms of interest via regular expressions. (b) Hierarchy: a file expressing the hierarchical ordering of classifiers.
The example in Figure 2(a) shows a portion of the classifier file, generated by our taxonomy editor. Lines 3 and 4 show the definition of a classifier. The classifier recognizes the attribute whose name is “Year 1994 - Sep:Nov”. It matches the date field of an email message with a regular expression that recognizes
312
Richard Cole and Peter Eklund
dates between September and November of 1994. Lines 6 and 7 show a classifier that detects emails sent by “Melfyn” in which “David” is mentioned within the text of the email. The classifiers associate attributes with emails and the result is stored in an inverted file index. A hierarchy is defined by a set of subsumption rules defined by the user. For instance, a portion of the example file produced by our taxonomy is presented in Fig. 2(b). Line 3 introduces the implication that an email with “Melfyn mentions Barbagello” implies that the email “Mentions Barbagello”. Associated with each attribute is a primary name, e.g. “Melfyn mentions Barbagello”, and a set of descriptions, for example “Melfyn mentions Barbagello” might also be recognized by “Melfyn Lloyd mentions David Barbagello”. These extra descriptions are used in the next section in which the data is explored.
4
Conceptual Scaling
After defining a taxonomy of attributes which are associated with email documents by classifiers defined in the previous section, it is necessary for the user to choose a small number of attributes, usually less than 10 [2]. The user does this with a specific question in mind and with the aid of the program depicted in Fig. 3.
Fig. 3. Conceptual Scale Creation Tool
The user is interested in the ways in which the attributes combine in the email collection. For example she might be interested in emails “from Melfyn”, “about the DTSC”, and “mentioning Barbagello”. She searches for appropriate attributes based either (i) on their description or (ii) their location in the hierarchy.
Analyzing Email Using FCA
313
To locate an attribute via its description, the user may enter a text search. For instance “Melfyn” would match with all attributes having a description containing a word with the prefix “Melfyn”. Alternatively, the attribute “From Melfyn” might be located as an element immediately below “From DSTC Personnel”. In either case, the results of the search operation are displayed in the lower right hand panel of the tool shown in Fig. 3 and clicking on the attribute adds it to the diagram. The conceptual scale is a subset of the attributes selected by the user, and displayed in the left hand panel of the tool shown in Fig. 3. It is desirable for the user to gain an impression of how the attributes are related with respect to their taxonomical ordering defined in the previous section. We represent the attributes using a Hasse diagram (see Figure 1(a)). Using a Hasse diagram to diagramatically represent the relative ordering of a subset of elements from an ordered set raises a number of questions. Should we preserve (i) the covering relation, (ii) the ordering relation, and (iii) meet and joins where they exist. Preserving the covering relation, while straight forward, is cumbersome since it produces long chains in the diagram and introduces a large number of extra elements. Preserving the ordering relation by itself, would for many queries, produce a single anti-chain2 . We preserve the ordering relation that we then close under join. This induces a new covering relation computed in response to updates to the diagram. The diagram is automatically drawn using a force directed placement heuristic that attempts to minimize change to the diagram. All changes to the diagram are animated to help the user preserve their mental map of the diagram. The user can remove join irreducible elements — those not required by the join closure requirement — from the diagram. An attempt to remove a join reducible element results in user feedback showing the elements preventing its removal. After the user has selected attributes to the required level of specificity. The scale may be used to construct a concept lattice showing the concepts generated by the emails and the attributes selected by the user.
5
Analyzing the Lattice
Fig. 4 shows the concept lattice resulting from the terms identified in the scaling process shown in Fig. 3. The diagram is expressed as a lattice product. Each of the small black circles represent a concept, all the concepts of the bottom large oval have “Philippe” in the intent. In other words these concepts refer to emails associated with Philippe via one or more of the classifiers defined in Fig. 2(a). Compare the numeric labels of the top two concepts in each of the ovals. “793” is the number of emails associated with the term “Philippe” while “5022” is the total number of emails in the test set. We infer from this that 793/5022 or 15.7% of all emails are associated with “Philippe”. Moving down from the top numeric label “5022” to its child “4108” (also labeled “Groups”) reveals that 81% of all emails in the test set are associated with the term “Groups”. 2
An anti-chain is a set of elements that have no relative ordering.
314
Richard Cole and Peter Eklund top Groups
DSTC
KVO
5022 Mention Melfyn
DSTO
4108 363 212 3589 194 124
623
Mention Stephen
44 20 256 27 18
109 17
12 57 14 11
Philippe
793 382 53 27 344 47
106
23
74 24
44
16
33 13 10
Fig. 4. Analysis of the relationships between the terms “Mention Stephen, “KVO”, “DSTC”, “DSTO”, “Melfyn” and the target “Philippe”.
Similarly, if we move down from the top label “793” in the lower oval to its child “382” we see that only 48% of emails associated with “Philippe” concern “Groups” — a possible reason being that much of the email traffic within the email set associated with “Philippe” is not group related. “Groups” are further divided into group categories: these can be read from the top oval identified with the labels “KVO”, “DSTC”, and “DSTO”. Note that “KVO” concerns the majority of group email traffic with 3589/4108 = 87.3%. Emails associated with the group labels “DSTO” and “DSTC” are read from the label “44” in the top oval. Finding the corresponding circle in the lower oval we notice that it is grey. This represents an implication. We move from the grey circle down through the lattice to the label “24”. This point includes the extra attribute “KVO”. The inference is that emails associated with “DSTC”, “DSTO” and “Philippe” are all associated with the term “KVO”. The diagram can be viewed as a three dimensional space containing thematic planes. Consider the plane defined by the labels 109,17,12,11,14,57 in the upper oval of Fig 4. This plane represents the impact of the term “Mention Stephen” on the other named terms “KVO”, “DSTC”, “DSTO”, “Melfyn”. The plane is parallel to two other planes (those above it, to the right) and by considering corresponding points in each of these planes we can measure the influence of the term “Mention Stephen” on the way emails in the test set are partitioned by that term. A more specific inference concerns the points labeled “11” and “12” in the upper oval. “12” is the number of emails associated with the combination of “Mention Stephen” and “Mention Melfyn”. Therefore, 11/12 of these emails
Analyzing Email Using FCA
315
also mention the term “KVO”. The inference is the high correspondence of the use of “KVO” in the context of emails involving ’Melfyn’ and ’Stephen’. Moving to the bottom circle in the lower oval labeled “10” we infer that 10/11 emails also mention “Philippe”. In summary less than half of the email associated with “Philippe” is group related (382/793). When “Philippe” is mention in the context of a group it mostly concerns the “KVO’ group (344/382 = 90%). Emails associated with both the “DSTC” and “DSTO” groups that are “Philippe” related always mentioned the “KVO” group (24/24 — inferred via the grey circle). Finally, “Philippe” is mentioned in 24/44 (55%) of emails involving both the “DSTC” and “DSTC” but is mentioned in 83% of correspondence mentioning “Stephen” and “Melfyn”. We can draw the inference from the analysis of this email data that “Philippe” is the important factor of common interest between “Stephen” and “Melfyn”. It is also clear from the email analysis that “Philippe” is a more important topic of discussion to the “KVO” group (344/382) than to the “DSTC” (53/382) and “DSTO” (106/382) generally.
6
Conclusions
This paper has described the use of a suite of tools designed to allow an investigation of data retrieved from email. The data is retrieved from the emails with the aid of a hierarchy of classifiers that extract useful terms and encode known implications. Further implications, both complete and partial, are then investigated by means of a nested line diagram.
References 1. C. Carpineto and G. Romano. A lattice conceptual clustering system and its application to browsing retrieval. Machine Learning, 24:95–122, 1996. 2. R. Cole and P. Eklund. Scalability of formal concept analysis. Computational Intelligence, 15(1):11–27, 1999. 3. B. Ganter and R. Wille. Formal Concept Analysis: Logical Foundations. Springer Verlag, 1999. 4. R. Godin, J. Gecsei, and C. Pichet. Design of a browsing interface for information retrieval. SIG-IR, pages 246–267, 1987. 5. Michael Luxenburger. Implications, dependencies and Galois drawings. Technical report, TH Darmstadt, 1993. 6. T. Rock and R. Wille. Ein TOSCANA—Erkundungssystem zur Literatursuche. In G. Stumme and R. Wille, editors, Begriffliche Wissensverarbeitung. Methoden und Anwendungen, Berlin–Heidelberg, 1997. Springer–Verlag. 7. Frank Vogt and Rudolf Wille. TOSCANA a graphical tool for analyzing and exploring data. In Graph Drawing ’94, LNAI 894, pages 226–223. Springer Verlag, 1995. 8. Rudolf Wille. Concept lattices and conceptual knowledge systems. In Semantic Networks in Artificial Intelligence. Pergamon Press, Oxford, 1992. Also appeared in Comp. & Math. with Applications, 23(2-9), 1992, p. 493-515.
Business Focused Evaluation Methods: A Case Study Piew Datta1 1GTE Laboratories Incorporated, 40 Sylvan Rd., Waltham, Massachusetts USA 02451 [email protected]
Abstract. Classification accuracy or other similar metrics have long been the measures used by researchers in machine learning and data mining research to compare methods and show the usefulness of such methods. Although these metrics are essential to show the predictability of the methods, they are not sufficient. In a business setting other business processes must be taken into consideration. This paper describes additional evaluations we provided potential users of our churn prediction prototype, CHAMP, to better define the characteristics of its predictions.
1. Introduction As data mining and machine learning techniques are moving from research algorithms to business applications, it is becoming obvious that the acceptance of data mining systems into practical business problems relies heavily on their integration in to business process. One critical aspect of building a practical and useful system is showing that the techniques can tackle the business problem. Traditionally, machine learning and data mining research areas have used classification accuracy in some form to show that the techniques can predict better than chance. The evaluation methods need to more closely resemble how the system will work if in place. This paper focuses on the experimental evaluations we performed on a prototype called CHAMP, Churn Analysis Modeling and Prediction, developed for GTE Wireless (GTEW). CHAMP is a data mining tool used to predict which of GTEW customers will churn within the following two months. Although we were able to show that CHAMP was considerably more accurate at identifying churners than existing processes at GTEW, we needed to provide additional evaluations to persuade potential users of its benefits. In the next section of this paper we provide some background about GTEW. Section 3 describes briefly each of CHAMP’s components. Section 4 discusses several criteria we used to describe CHAMP’s benefits to the GTEW marketing department. Many of these experiments are non-traditional methods used to evaluate CHAMP.
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 316-322, 1999. © Springer-Verlag Berlin Heidelberg 1999
Business Focused Evaluation Methods: A Case Study
317
2. GTEW and Its Data Warehouse GTEW provides cellular service to customers from various geographically diverse markets within the United States of America. GTEW currently has about 5 million customers in about 100 markets and is growing annually. As in all businesses, customers will sometimes terminate their service or switch providers for a variety of reasons. In the telecommunications industry this is referred to as churn. Although industry wide churn rates are only about 2% to 3% per month, this results in a considerable number of subscribers discontinuing service. Currently GTEW accumulates information for its cellular customers in a relational data warehouse that collects data from many regional database sources. The warehouse contains over 200 fields consisting of billing and service data for each customer on a monthly basis and stores historical data going back for two years. Each month CHAMP analyzes this information to predict the possibility that any particular customer will churn based on historical data. Knowledge discovered by analyzing the characteristics of churners is used to guide marketing retention campaigns.
3.
CHAMP: A Brief Overview
Members of the Knowledge Discovery in Databases project at GTE Laboratories developed CHAMP to help GTEW reduce customer churn. Since GTEW’s data warehouse is updated monthly and since they have a diverse set of markets, we decided to build models monthly for each of GTEW’s top 20 markets. This decision constrains CHAMP to be fully automated. Another goal was to ensure that CHAMP is scalable regarding customer records and fields. The last goal was to allow the models to be valid for a period of 60 days to allow the marketing department time to develop any desired campaigns. We developed CHAMP’s overall design with these and other goals at the forefront. Readers interested in details should refer to Datta et. al. (1999)[1]. There are essentially two phases for applying data mining methods to identify churners: building models and applying models. For model building, initially the date and the market are provided to the prototype which retrieves relevant historical data from the remote data warehouse to create a local extract. The input to the model building component uses customer billing and usage information from three months previous and the dependent binary variable denoting whether the customer has churned in the previous 2 months. CHAMP’s modeling method employs a hybrid of machine learning techniques. Initially we use a decision tree method (Quinlan, 1993 [2]) to rank fields according to their prediction capability and then use a cascade neural network (Fahlmann & Lebiere, 1988 [3]; Puskorius et al. 1991 [4]; Rumelhart, Hinton, & Williams, 1986 [5]) with the 30 highest ranked fields. The neural network uses a genetic algorithm (Koza, 1993 [6]) to find transformations and groupings of fields for increased model accuracy. Once the model is built, the model is applied to current data. This data only contains customer billing and usage data and does not have customer churn information since we do not know if a customer will churn until the end of the month. The churn score generator uses the learned model and current data to produce a churn score for each customer, predicting if the customer will churn in the next 60 days.
318
P. Datta
The churn score ranging from 0 to 100 describes an individual customer’s propensity to churn. Customers with a higher churn score have a higher propensity to churn.
4. Empirical Evaluations In this section we describe the empirical evaluations we applied to better understand the characteristics of CHAMP on several differing markets. Some of these methods are applied traditionally, such as computing the lift and payoff of the learned models. Marketing professionals at GTEW suggested business oriented experiments aimed at taking some of the marketing processes currently in place and seeing how CHAMP will operate in regards to these constraints. Generally, the data is prepared by randomly separating the entire dataset into two distinct sets: training and testing. The testing dataset is roughly 50% of the entire dataset. All experimental results are shown on the held aside testing set. We use five markets which vary considerably in size and geographic location, showing the generality of our results. We use these markets to demonstrate the performance of CHAMP across six different types of evaluation methods. 4.1
Traditional Evaluation Methods 1
We have validated models using both the lift and payoff metrics (Datta et al., 1999 [1]; Masand et al., 1999 [7]; Masand & Piatetsky-Shapiro, 1996 [8]). An example of the lifts for different percentages of the sorted list is shown in Figure 1 for Markets 1, 2, and 3. The largest gain in lift for all three markets occurs for the first 5% to 10% as shown from the slope of the curve at these points. The first (top) decile is the first 10% of the sorted list. A lift of 1 means that the model predicts churn equal to chance and the lift eventually becomes 1 as the entire sorted list is used. These results show that CHAMP can predict churn behavior more accurately than chance. 2 Figure 2 shows the cumulative payoff as incremental percentages of the sorted scores list are used for Markets 1, 2, and 3. The highest point in the curve where payoff is maximized varies dramatically for each market. If the customers falling to the left of the highest point are contacted this results in the highest payoff for the market. The highest payoff has a large range from $40,000 to $85,000 per month depending on the market. 4.2
Business Oriented Evaluation Experiments
In this section, we describe evaluations of CHAMP behavior of interest to marketing professionals. We typically run experiments on the first decile, top 10% of the sorted churn scores. As shown in Figure 2, this is where CHAMP has the largest lift.
1
The prediction module produces a score for each customer and sorts customers according to score. The lift metric computes the gain in predictiveness for subsets of the sorted list over the base churn rate (i.e. churn as it is currently occurring in the market). 2 We used a probability of 50% that a customer will continue service after being contacted, that contacting the customer costs $7 and that the customer will stay remain for 6 months. These numbers used to calculate the payoff are for illustrative purposes only and do not necessarily reflect actual numbers used in the business process.
Business Focused Evaluation Methods: A Case Study
319
Lift of Mark ets 1,2,3 5
Lift
4 3
M arket 1
2
M arket 2
1
M arket 3
0 5
15
25
35
45
55
65
75
85
95
Percent of s orted s core lis t
Fig. 1. Lift for Markets 1, 2, and 3. The highest gain in lift is between the top 1-10% of the sorted scores list.
Payoff (in 1,000’s)
C u m a lit iv e P a y o f f f o r M a r k e t s 1 ,2 ,3 100 80
M ark et 1
60
M ark et 2
40
M ark et 3
20 0 5
15
25
35
45
55
65
75
85
95
P e r c e n t o f s o r t e d s c o r e lis t
Fig. 2. Simulated cumulative payoff for Markets 1, 2, and 3. For these markets the highest payoff occurs at less than 50% of the sorted scores list.
Percentage of Churners Identified From a marketing point of view it is important to know the percentage of the actual churners that are being captured in the highest decile, decile 1. In addition it is important to know how much lead time they will have before a customer with a higher propensity to churn will churn. We went back to our historical data and chose to look at the churners at a point in time, namely February 1998. We calculated the number of churners that appeared in decile 1 during the previous month, January 1998, and also looked at the number of customers that appeared at least once in decile 1 during the three months previous to February. Table 1 shows the results for 3 markets. CHAMP identified a fairly large percentage (28% to 36%) of churners (i.e. that is the customer appeared in the top decile at least once in the previous three months). These results also indicate that CHAMP can pick up clear signs of churning soon before customers actually churn. A smaller percentage of decile 1 customers churned in the following month (first column), although it is a larger percent than uniform for markets 2 and 3.
320
P. Datta
Table 1. Percent of churners identified by CHAMP scores for January 1998. Market Name
Market 1 Market 2 Market 3
Percent of churners in decile 1 in previous month 17% 19.5% 28%
Percent of churners in decile 1 at least once in previous three months 28% 31% 36%
Estimated Overlap among High Propensity Churn Customers Another aspect marketing professionals were interested in was the number of contacts they would have to make if they used CHAMP scores. There is some indication that some percentage of those that appeared in decile 1 from one month would also appear in the next month. GTEW has policies restricting the number of times a customer can be contacted for a specified period of time. We conducted the following experiment. We identified the unique customers from decile 1 for two consecutive months, that is, if a customer appeared in decile 1 for both months, the customer was only counted once. We also conducted the same experiment for three consecutive months. Table 2 shows the results. The percentages were computed by dividing the number of unique customers for the period by the number of total customers in decile 1 for the period. For example, if we identify 50 unique customers for two months and decile 1 has a size of 30 customers a month then we would divide 50 by (30*2) and get 83%. These results show that a sizeable number of unique customers appear in decile 1 for consecutive months, showing the stability of the model. Depending on the marketing department’s policy for contacting customers, they have some idea of the number of contacts they will need to make monthly. Table 2. Percent of unique customers in decile 1 for consecutive months.
Market Name Market 1 Market 2 Market 3
Unique customers in 2 months 80% 75.5% 72%
Unique customers in 3 months 69% 69% 59%
Aging Experiments With dramatic changes in the cellular industry, an important issue related to modeling behavior is understanding the lifetime of the learned models and the decline in their predictive capability. In addition, it is also important to understand how long individual customer scores are valid. In this section we discuss two experiments focused on evaluating the lifetime of the models and scores.
Business Focused Evaluation Methods: A Case Study
321
We ran the first experiment, model aging, with models learned monthly, starting at March 1997. We created 4 different models, one for each successive month. We evaluated the models on customer data dated in June of 1997. Figure 3 shows the lift of the first decile for three markets. As can be seen on the graph, although the lift decreases slightly, there is no statistical difference when the older models are run on more recent customer data. Market 1 did have a significant drop in lift for the 3 and 4 month models. One possibility is that the data representing that time of the year did not include any major seasonal changes or new competitive offers for Markets 4 and 5 but some external factors such as new competition could have made the older models less accurate in Market 1. An experiment looking at longer delay periods between when the model is built and applied may better reflect seasonal trends. M ode l agi n g
Lift
4
M arket 1
3.5
M arket 4
3
M arket 5
2.5 1
2
3
4
Age of m ode l (m on th s)
Fig. 3. Lift slowly decreases for Markets 1, 4, and 5 as the model ages. In the second experiment we considered the lifetime of generated scores, that is when do customers in decile 1 with a high propensity to churn actually churn. To conduct this experiment, we followed a group of customers scored by CHAMP, from June 1997 until January 1998 to see whether they churned during the period and if so during which month. The results of Market 1 are illustrated in Figure 4. Decile 1 has a larger number of customers churning over the time period compared to the other deciles. In addition, in decile 1 the customers that churn tend to do so within the first few months. The lift for decile 1 in June 1997 is 4.07 which means churners concentrated in decile 1 are about four times as likely to churn when compared to the background churn rate. The lifts for July and August are 2.35 and 1.93 respectively. Decile 10 should contain the customers less likely to churn. The proof for this is shown in the lifts for decile 10 which are 0.22, 0.52, and 0.61 for June, July and August respectively. Although only about 40% of the customers in decile 1 have churned in a 6 month period, this prediction accuracy is still much higher than the percentage of background churn over the same period, about 16%-24% (assuming the industry average of about 2%-3% per month). The remaining markets have similar lift characteristics but are not shown for space considerations.
5. Summary and Discussion The traditional lift experiments we conducted on CHAMP indicated that the learned models could predict churners in the upcoming months more effectively than current methods used by GTEW. We conducted additional experiments described in section 4.2 that focus on these
322
P. Datta
questions. These experiments illustrated the benefits of using CHAMP that were not obvious to us initially. For example, in Figure 4 shows that the effectiveness of CHAMP customer scores extends over the time period that we initially built the models for, 60 days, and Figure 3 shows that models slowly decline in predictive capability over several months. These experiments not only helped explain CHAMP characteristics to users, but also the helped CHAMP developers and researchers. We expect end users of any data mining prototype or system to have a wide variety of questions regarding performance and applicability. This paper takes a first step in describing some of the questions not addressed by simple accuracy measurements.
Percent of scores for 6/97
Mar k e t 1 S c or e A g ing
A fte r Ja n 98
50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0%
Ja n -98
D e c -97
N o v -97
O c t-97
Se p -97
A u g -97
Ju l-97
Ju n -97 1
2
3
4
5
6
7
8
9
10
De c ile
Fig. 4. Score aging results for Market 1. Those in decile 1 tend to churn at a higher rate not only for the next 2 months, but for the next 6 months. Note that the top of the decile bars have been cut off for space considerations. The bars reach 100%.
References 1. 2. 3. 4. 5.
6. 7. 8.
Datta, P., Masand, B., Mani, D. R. & Li, B.: Automated Cellular Modeling and Prediction on a Large Scale. Artificial Intelligence Review: Special Issue on Data Mining Applications. Kluwer Academic Publishers. To appear Oct. 1999. Quinlan, J. R.: C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann (1993). Fahlmann, S. E., & Lebiere, C.: The cascade-correlation learning architecture, Advances in Neural Information Processing Systems, volume 2. Morgan Kaufmann (1988). Puskorius, Gint, Feldkamp, Lee: Decoupled Extended Kalman Filter Training of Feedforward Layered Networks. Proceedings of the International Joint Conference on Neural Networks, IEEE (1996). Rumelhart, D. E., Hinton, G. E. & Williams, R. J.: Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the microstructure of cognition. Volume I: Foundations. Cambridge, MA: MIT Press/Bradford Books, (1986) pp 318 - 362. Koza, J.: Genetic Programming. MIT Press (1993). Masand, B., Datta, P., Mani, D. R. & Li, B. CHAMP: A Prototype for Automated Cellular Churn Prediction. Data Mining and Knowledge Discovery. Kluwer Academic Publishers (1999). Masand, B. & Piatetsky-Shapiro, G.: A comparison of approaches for maximizing business payoff of prediction models. Proceedings of the Second International Conference on Knowledge Discovery & Data Mining. Seattle, WA. (1996). pp.195-201.
Combining Data and Knowledge by MaxEnt-Optimization of Probability Distributions Wolfgang Ertel and Manfred Schramm Fachhochschule Ravensburg-Weingarten, Postfach 1261, 88241 Weingarten, GERMANY
<ertel|schramma>@fh-weingarten.de www.fh-weingarten.de/~ertel http://ti-voyager.fbe.fh-weingarten.de/schramma
Abstract We present a project for probabilistic reasoning based on the
concept of maximum entropy and the induction of probabilistic knowledge from data. The basic knowledge source is a database of 15000 patient records which we use to compute probabilistic rules. These rules are combined with explicit probabilistic rules from medical experts which cover cases not represented in the database. Based on this set of rules the inference engine Pit (Probability Induction Tool), which uses the well-known principle of Maximum Entropy 5], provides a unique probability model while keeping the necessary additional assumptions as minimal and clear as possible. Pit is used in the medical diagnosis project Lexmed 4] for the identication of acute appendicitis. Based on the probability distribution computed by Pit, the expert system proposes treatments with minimal average cost. First clinical performance results are very encouraging.
1 Introduction Probabilities deliver a well-researched method of reasoning with uncertain knowledge. They form a unied language to express knowledge inductively generated from data as well as expert knowledge. To build a system for reasoning with probabilities based on data and expert knowledge, we have to solve dierent problems: { To infer a set of rules (probabilistic constraints) from data, where the number of rules has to be small enough to avoid over-tting and to be large enough to avoid 'under-tting'. For this task we use an algorithm, which generates a probabilistic network. { To nd probabilistic rules for groups of patients, not present in our database. In cooperation with our medical experts, we collect rules describing patients with acute abdominal pain, not taken to the theatre by the diagnosis of acute appendicitis and therefore not present in our database. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 323-328, 1999. Springer-Verlag Berlin Heidelberg 1999
324
W. Ertel and M. Schramm
{ To construct a unique probability model from all our constraints. If we ex-
press our knowledge in a set of rules, this set usually does not allow to generate a unique probability model, necessary to get a denite answer for every probabilistic query (of our domain). We solve this task by the use of MaxEnt (see sec. 4), which delivers a precise semantics to complete probability distributions. { A typical problem for classication (e.g. in medicine) is that dierent classication errors cause dierent costs. We solve this task by asking the experts to dene the costs of wrong decisions, where these costs are not meant to be local to the hospital, but global in the sense of including all consequences the patient has to suer from. The decisions of our system are found by minimizing these costs under a given probability distribution.
2 Automatic Generation of Rules from Data From a database of 15000 patient records from all hospitals in Baden-W urttemberg in 19951, the following procedure generates a set of probabilistic rules. To facilitate the description we simplify our application to the two class problem of deciding between appendicitis (App = true) and not appendicitis (App = false), assuming the basic condition of acute abdominal pain to be true in all our rules. In order to be independent of the environment of the particular clinic, the rules are conditioned on the diagnosis variable App, i.e. rules will have the form P (A = aij App = true B = bj : : :) = x where ai bi are values of binary symptom variables A and B (binary for simplication) and x is a real number or a real interval2. In order to abstract the data into probabilities, we use the concept of (conditional) independence3 as widely known and accepted. We therefore draw an independence map (6]) of the variables, i.e. an undirected graph G where the nodes represent variables and the edges and paths represent dependencies between variables. A missing edge between 2 variables denotes a conditional independence of the two variables, given the union of all other variables (see e.g. Sec. 4.5 in 9]). For most real world applications, however the number of elementary events, induced by the union of variables, is larger than the available data (in our application, the variables span 109 events where 'only' 15000 patient records are available). In order to avoid over-tting, we have to use a 'local' approach for building an independence map, i.e. an approach which works on a small set of variables rather than the union. Our procedure works as follows: We are grateful to the ARGE - Qualitatssicherung der Landesarztekammer Baden Wurttemberg for providing this database. 2 In case of an interval, P (x) = a b] expresses the uncertainty that P (x) can be any value in a b] 0 1]. 3 Variables A and B are conditionally independent given C i (knowing the value of C ) the knowledge of the value of B has no inuence on deciding about the value of A. In technical terms: P (A = ai jB = bj C = ck ) = P (A = ai jC = ck ) for all i j k. 1
Combining Data and Knowledge by MaxEnt
325
1. For variables A and B and a vector of variables S (with a vector of values s) with A B 2= S ) let DA S B denote the degree of dependence between the variables A and B given S , which we calculate by the 'distance' 4 between X and Y , where Y := P (A = ajS = s) P (B = bjS = s) and X := P (A = ai B = bj )jS = s) . Let DA 0 B denote the degree of dependence between A and B , which we calculate by the distance between X and Y , where Yi j := P (A = ai ) P (B = bj ) and Xi j := P (A = ai B = bj ). 2. We build an undirected graph by the following rules: Draw a node for every variable A, including the special diagnosis variable App (with values 'true' and 'false'). Draw an edge (A App) i DA 0 App is above a heuristically determined value t (see below). For the pair of variables A and B with the largest value of DA App B we add an edge (A,B ) to the graph if for the minimal separating set S for the nodes A and B 5 the distance DA B is above t. 3. If the procedure is completed, a graph G has been generated. As already mentioned, medical knowledge is typically conditioned on illnesses, expressing the assumption that this type of rules is more context independent than others (see footnote 8 in sec. 3). We therefore adopt the graph to this type of rules. For this goal we direct the edges (App A) towards A and calculate rules of the form P (A = ai j App = true) = x. Directions for the remaining edges are selected arbitrarily with the result of dening a Bayesian network6 of rules like P (A = ai jApp = true B = bj : : :) = x, where App B and possibly other variables are 'inputs' to the variable A. Remember that the number of edges is limited by the size of the threshold t: If the number of variables in a rule is too large in relation to the available data, t has to be increased (to avoid over-tting) if the density of edges is too small (if the inductive power of the probabilistic rules is too weak) t has to be decreased (to avoid under-tting). This set of rules is incomplete (i.e. it does not specify a unique probability model) because we do not construct rules for the class (App = false) from our database (see Sec. 3). Additional rules are specied by our experts. But as the resulting set of rules is still incomplete (for e.g. using intervals in our rules), we need the method of Maximum Entropy (see 4) to complete the probability model. i j s
i j s
S
P
We use the cross entropy-function for this task, which is similar, but not equivalent to the correlation coecient: it is dened as CR(x y) = i xi log(xi =yi ). 5 A separating set S for the nodes A and B disconnects A and B , i.e. there are no paths between A and B if the variables in S and their edges are removed from the graph. A minimal separating set is minimal in the number of variables it contains. If there is more than one minimal separating set S , we take the set S with the lowest distance DA S B . Remark: By construction, the minimal separating set will always contain App. 6 The missing distribution of App is given by our experts 4
326
W. Ertel and M. Schramm
3 Expert-Rules All patients in our database have been operated under the diagnosis 'acute appendicitis', suering from 'acute abdominal pain'. Thus our database can not provide a model of patients with have been sent home (with the diagnosis 'non specic abdominal pain') or which have been forwarded to other departments (assuming other causes for their pain). In order to get a model of these classes of patients, we use the explicit knowledge of our medical experts7 and the literature (see e.g. 1]) to receive rules like: 8 P (A = ai jApp = false ^ : : :) = x y] :
4 Generating a unique probability distribution from rules by the method of Maximum Entropy In order to support interesting decisions in cases of incomplete knowledge, we have to add more constraints. In order to add no (false) ad hoc knowledge, the constraints have to be selected such that they maximize the ability to decide and minimize the probability of an error. The method of Maximum Entropy P which chooses the probability model with maximal entropy H H (v) := ; i vi log(vi )] is known to solve these problems: { it maximizes the ability to decide, because it is known to choose a single (unique) probability model in the case of linear constraints. { it minimizes the probability of an error, because the distribution of models is known to be concentrated around the MaxEnt model (3]). Computing the MaxEnt-Model is not a new idea but very expensive in the worst case. The main problem is that the number of interpretations (elementary events) grows exponentially with the number of variables. To avoid this eect in the average case, the principles of independence and indierence are used to reduce the complexity of the computations. These two principles are both used in our system Pit (Probability Induction Tool) for a more ecient calculation of the MaxEnt model (8]).
5 Generating Decisions from Probabilities Once the rule base is constructed, a run of Pit computes the MaxEnt model and any query can be answered by standard probabilistic computations. However, the expected result of reasoning in Lexmed is not a probability but a decision (diagnosis). How are probabilities related to decisions? In our application We are grateful to our medical experts Dr. W. Rampf and Dr. B. Hontschik for their support in the knowledge acquisition and patience in answering our questions. 8 This type of knowledge surely depends on the particular application scenario. For example in a pediatric clinic in Germany there are other typical causes for abdominal pain than in a hospital in a tropical country or in a military hospital. 7
Combining Data and Knowledge by MaxEnt
327
(as in many others), misclassications do have very dierent consequences. The diagnosis 'perforated appendicitis', where 'no appendicitis' would be correct, is very dierent to the diagnosis 'no appendicitis' where 'perforated appendicitis' would be correct. The latter case is of course a much bigger mistake, or in other words, much more expensive. Therefore we are interested in a diagnosis which causes minimum overall cost. Including such a cost calculation in the diagnosis process is very simple (c.f. Figure 1). Let Cij be the additional costs if the real diagnosis is class j , but the physician would decide for i. Given a matrix Cij of such misclassication costs and the probability pi for each real diagnosis i, the query evaluation of Lexmed computes the average misclassication cost Cj to
X Cj = Cij pi : n
i=1
and then selects the class j = argminfCj jj = 1 : : : ng with minimum average cost.
6 Diagnosis of Acute Appendicitis During the last twenty years the diagnosis of acute appendicitis has been improved with respect to the misclassication rate 2,1]. However, depending on the particular way of sampling and the hospital the rate of misclassication among surgeons still ranges between 15 and 30%, which is not satisfactory (2]). A number user interface of expert systems for this task have been diagnosis realized, some with high accuracy (1]), query misclassification but still there is no breakthrough in clincosts ical applications of such systems. Lexmed is a learning expert system for query medical diagnosis based on the MaxEnt PIT method. Viewed as a black box, Lexmed maps a vector of clinical symptoms (discrete variable-values) to the probability experts answer for dierent diagnoses. The central comrule knowledge ponent inside Lexmed is the rule base modelling base containing a set of probabilistic rules as shown in Figure 1. The acquisition of rules data base literature is performed by the inductive part (see 2) rule induction and the acquisition of explicit knowledge (see 3). The integration of knowledge from Figure1: Overview of the Lexmed artwo dierent sources in one rule base may chitecture. cause severe problems, at least if the formal knowledge representation of the two sources is dierent, for example if the inductive component is a neural net and the explicit knowledge is represented in rst order logic. In our system, however, the language of probabilities provides
328
W. Ertel and M. Schramm
a uniform and powerful knowledge representation mechanism. And MaxEnt is an inference engine which does not require a complete9 set of rules.
7 Results Running times of the system for query evaluation on the appendicitis application with about 400 probabilistic rules are about 1{2 seconds. The average cost of the decisions was measured for Lexmed without expert rules (as described in Section 2) and for the decision tree induction system C5.0 7] with 10-fold cross validation on the database in Table 1. For completeness reasons we also performed runs of both systems without cost information and computed the classication error. Lexmed C5.0 Average cost 1196 DM 1292 DM Classication error 22.6 % 23.5%
Table1. Average cost and error of C5.0 and Lexmed without expert rules.
The gures show that on the database the purely inductive part of Lexmed and C5.0 have similar performance. However in a real test in the hospital the expert rules in addition to the inductively generated rules will be very important for good performance, because the database is not representative of the patients in the hospital as mentioned in Section 3. Thus, in the real application we expect Lexmed to perform much better than a pure inductive learner like C5.0. Apart from the numeric performance results the rst presentation of the system in the municipal hospital of Weingarten was very encouraging. Since June 1999 the doctors in the hospital use Lexmed via internet 4].
References 1. De Dombal. Diagnosis of Acute Abdominal Pain. Churchill Livingstone, 1991. 2. B. Hontschik. Theorie und Praxis der Appendektomie. Mabuse Verlag, 1994. 3. E.T. Jaynes. Concentration of distributions at entropy maxima. In Rosenkrantz, editor, Papers on Probability, Statistics and statistical Physics. D. Reidel Publishing Company, 1982. 4. Homepage of Lexmed. http://lexmed.fh-weingarten.de, 1999. 5. J.B. Paris and A. Vencovska. A Note on the Inevitability of Maximum Entropy. International Journal of Approximate Reasoning, 3:183{223, 1990. 6. J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. 7. J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. C5.0, online available at www.rulequest.com. 8. M. Schramm and W. Ertel. Reasoning with Probabilities and Maximum Entropy: The System PIT and its Application in LEXMED. In accepted at: Symposium on Operations Research 1999, 1999. 9. J. Whittaker. Graphical Models in applied multivariate Statistics. John Wiley, 1990. 9
'complete' means that the rules are sucient to induce a unique probability distribution.
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation? Ad Feelders Tilburg University CentER for Economic Research PO Box 90153 5000 LE Tilburg, The Netherlands e-mail: [email protected]
Abstract. In many applications of data mining a - sometimes considerable - part of the data values is missing. Despite the frequent occurrence of missing data, most data mining algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. We investigate simulation-based data augmentation to handle missing data, which is based on filling-in (imputing) one or more plausible values for the missing data. One advantage of this approach is that the imputation phase is separated from the analysis phase, allowing for different data mining algorithms to be applied to the completed data sets. We compare the use of imputation to surrogate splits, such as used in CART, to handle missing data in tree-based mining algorithms. Experiments show that imputation tends to outperform surrogate splits in terms of predictive accuracy of the resulting models. Averaging over M > 1 models resulting from M imputations yields even better results as it profits from variance reduction in much the same way as procedures such as bagging.
1
Introduction
The quality of knowledge extracted with data mining algorithms is evidently largely determined by the quality of the underlying data. One important aspect of data quality is the proportion of missing data values. In many applications of data mining a - sometimes considerable - part of the data values is missing. This may occur because they were simply never entered into the operational systems, or because for example simple domain checks indicate that entered values are incorrect. Another common cause of missing data is the joining of not entirely matching data sets, which tends to give rise to monotone missing data patterns. Despite the frequent occurrence, many data mining algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. In this paper we focus on the well-known tree-based algorithm CART [3], that handles missing data by so called surrogate splits1 . As an alternative we 1
In fact we used the S program RPART that reimplements many of the ideas of CART, in particular the way it handles missing data.
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 329–334, 1999. c Springer-Verlag Berlin Heidelberg 1999
330
A. Feelders
investigate more principled simulation-based approaches to handle missing data, based on filling-in (imputing) one or more plausible values for the missing data. One advantage of this approach is that the imputation phase is separated from the analysis phase, allowing for different data mining algorithms to be applied to the completed data sets.
2
Multiple Imputation
Multiple imputation [5,4] is a simulation-based approach where a number of complete data sets are created by filling in alternative values for the missing data. The completed data sets may subsequently be analyzed using standard completedata methods, after which the results of the individual analyses are combined in the appropriate way. The advantage, compared to using missing-data procedures tailored to a particular algorithm, is that one set of imputations can be used for many different analyses. The hard part of this exercise is to generate the imputations which may require computationally intensive algorithms such as data augmentation and Gibbs sampling [5,7]. In our experiments we used software for data augmentation written in S-plus by J.L. Schafer2 to generate the imputations. Since the examples we consider in this section contain both categorical and continuous variables, imputations are based on the general location model (see [5], chapter 9). The Bayesian nature of multiple imputation requires the specification of a prior distribution for the parameters of the imputation model. We used a non-informative prior, i.e. a prior corresponding to a state of prior ignorance about the model parameters. One of the critical parts of using multiple imputation is to assess the convergence of data augmentation. In our experiments we used a rule of thumb suggested by Schafer [6]. Experience shows that data augmentation nearly always converges in fewer iterations than EM. Therefore we first computed the EM-estimates of the parameters, and recorded the number of iterations, say k, required. Then we perform a single run of the data augmentation algorithm of length 2M k, using the EM-estimates as starting values, where M is the number of imputations required. Just to be on the “safe side”, we used the completed data sets from iterations 2k, 4k, . . . , 2M k.
3
Waveform Recognition Data
To compare the performance of imputation with surrogate splits, we first consider the waveform recognition data used extensively in [3]. The only categorical variable is the class label (with 3 possible values), and all 21 covariates are continuous, so imputation is based on the well-known linear discriminant model. Note that the assumptions of the linear discriminant model are not correct here, because the distribution of the covariates within each class is not multivariate 2
this software is available at http://stat.psu.edu/∼jls/misoftwa.html
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation?
331
normal and furthermore the covariance structure differs between the classes. Still, the model may be “good enough” to generate the imputations. In the experiments, we generated 300 observations (100 from each class) to be used as a training set, with different percentages of missing data in the covariates. Then we built trees as follows 1. On the incomplete training set, using surrogate splits. 2. On one or more completed data sets using (multiple) imputation. In both cases the trees were built using 10-fold cross-validation to determine the optimal value for the complexity parameter (the amount of pruning), using the program RPART3 . The error rate of the trees was estimated on an independent test set containing 3000 complete observations (1000 from each class). To estimate the error rate at each percentage of missing data, the above procedure was repeated 10 times and the error rates were averaged over these 10 trials. In a first experiment, each individual data item had a fixed probability of being missing. Table 1 summarizes the comparision of surrogate splits and single imputation at different fractions of missing data. Single imputations are drawn from the predictive distribution of the missing data given the observed data and the EM-estimates for the model parameters. Looking at the difference between the error rates one can see that imputation gains an advantage when the level of missing data becomes higher. However, at a moderate level of missing data (say 10% or less) it doesn’t seem worth the extra effort of generating imputations. − This same trend is also clear from rows four (p+ imp ) and five (pimp ) of the table. − p+ imp (pimp ) indicates the number of times of the ten trials, that the error rate of imputation was higher (lower) and the difference was significant at the 5% level. So, for example, at 30% missing data the difference was significant at the 5% level four out of ten times, and in all four cases the error rate of imputation was lower. % Missing eˆsur eˆimp eˆsur − eˆimp p+ imp p− imp
10 29.8% 29.8% 0% 1 1
20 30.9% 29.2% 1.7% 0 4
30 32.2% 30.6% 1.6% 0 4
40 32.4% 30.0% 2.4% 0 6
45 34.3% 30.4% 3.9% 0 7
Table 1. Estimated error rate of surrogate splits and single imputation at different fractions of missing data (estimates are averages of 10 trials)
In a second experiment we used multiple imputation with M = 5, and averaged the predictions of the 5 resulting trees. The results are given in table 2. The 3
RPART is written by T. Therneau and E. Atkinson in the S language. The S-plus version for Windows is available from http://www.stats.ox.ac.uk/pub/Swin.
332
A. Feelders
performance of multiple imputation is clearly better than both single imputation and surrogate splits. Presumably, this gain comes from the variance reduction resulting from averaging a number of trees, like is done in bagging [2]. % Missing eˆsur eˆimp eˆsur − eˆimp p+ imp p− imp
10 28.9% 26.0% 2.9% 0 9
20 30.1% 26.1% 4.0% 0 8
30 30.0% 25.5% 4.5% 0 9
40 33.3% 25.7%∗ 7.6% 0 10
45 35.6% 26.0%∗ 9.6% 0 10
Table 2. Estimated error rate of surrogate splits and multiple imputation at different fractions of missing data. ∗ : here we ran into problems with data augmentation and used EM-estimates only to generate the imputations
4
Pima Indians Database
In this section we perform a comparison of surrogate splits and imputation on a real life data set that has been used quite extensively in the machine learning literature. It is known as the Pima Indians Diabetes Database, and is available at the UCI machine learning repository [1]. The class label indicates whether the patient shows signs of diabetes according to WHO criteria. Although the description of the dataset says there are no missing values, there are quite a number of observations with “zero” values that most likely indicate a missing value. In table 3 we summarize the content of the dataset, where we have replaced zeroes by missing values for x3 , . . . , x7 . The dataset contains a total of 768 observations, of which 500 of class 0 and 268 of class 1. Variable y x1 x2 x3 x4 x5 x6 x7 x8
Description Class label (0 or 1) Number of times pregnant Age (in years) Plasma glucose concentration Diastolic blood pressure Triceps skin fold thickness 2-hour serum insulin Body mass index Diabetes pedigree function
Missing values 0 0 0 5 35 227 374 11 0
Table 3. Overview of missing values in pima indians database
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation?
333
In our experiment the test set consists of the 392 complete observations, and the training set consists of the remaining 376 observations with one or more values missing. Of these 376 records, 374 have a missing value for x6 (serum insulin), so we removed this variable. Furthermore, we changed x1 (number of times pregnant) into a binary variable indicating whether or not the person had ever been pregnant (the entire dataset consists of females at least 21 years old, so this variable is always applicable). This leaves us with a dataset containing two binary variables (y and x1 ) and six numeric variables (x2 , . . . , x5 , x7 and x8 ), with 278/2632 ≈ 10% missing values in the covariates. Although x2 and x8 are clearly skewed to the right, we did not transform them to make them appear more normal, in order to get an impression of the robustness of imputation under the general location model. The first experiment compares the use of surrogate splits to imputation of a single value based on the EM-estimates. Of course the tree obtained after single imputation depends on the values imputed. Therefore we performed ten independent draws, to get an estimate of the average performance of single imputation. The results are summarized in table 4. Draw 1 2 3 4 5 6 7 8 9 10 eˆimp 22.7% 30.6% 25.3% 26.0% 30.0% 24.5% 26.8% 24.7% 27.8% 29.3% p-value .0002 1 .0075 .0114 .7493 .0097 .0237 .0038 .2074 .6908 Table 4. Estimated error rates of ten single imputation-trees and the corresponding p-values of H0 : eimp = esur , with eˆsur = 30.6%
For each single imputation-tree, we compared the performance on the test set with that of the tree built using surrogate splits, which had an error rate of 120/392 ≈ 30.6%. Tests of H0 : esur = eimp against a two-sided alternative, using an exact binomial test, yield the p-values listed in the second row of table 4. On average the single imputation-tree has an error rate of 26.8% which compares favourably to the error rate of 30.6% of the tree based on the use of surrogate splits. In a second experiment we used multiple imputation (M = 5) and averaged the predictions of the 5 trees so obtained. Table 5 summarizes the results of 10 independent trials. The average error rate of the multiple imputation-trees over these 10 trials is approximately 25.2%. This compares favourably to both the single tree based on surrogate splits, and the tree based on single imputation.
5
Discussion and Conclusions
The use of statistical imputation to handle missing data in data mining has a number of attractive properties. First of all, the imputation phase and analysis phase are separated. Once the imputations have been generated the completed
334
A. Feelders Trial 1 2 3 4 5 6 7 8 9 10 eˆimp 27.3% 24.5% 25.8% 26.8% 23.7% 24.2% 24.0% 25.5% 24.7% 25.5% p-value .1048 .0015 .0295 .0357 .0003 .0026 .0022 .0105 .0027 .0119
Table 5. Estimated error rates of 10 multiple imputation-trees (M = 5), and the corresponding p-values of H0 : eimp = esur , with eˆsur = 30.6%
data sets may be analysed with any appropriate data mining algorithm. The imputation model does not have to be the “true” model (otherwise why not stick to that model for the complete analysis?) but should merely be good enough for generating the imputations. We have not performed systematic robustness studies, but in both data sets analysed the assumptions of the general location model were voilated to some extent. Nevertheless, the results obtained with imputation were nearly always better than those with surrogate splits. Despite these theoretical advantages, one should still consider whether they outweigh the additional effort of specifying an appropriate imputation model and generating the imputations. From the experiments we performed some tentative conclusions may be drawn. For the waveform data, single imputation tends to outperform surrogate splits as the amount of missing data increases. At moderate amounts of missing data (say 10% or less) one can avoid generating imputations and just use surrogate splits. For the pima indians data, with about 10% missing data in the training set, single imputation already shows a somewhat better predictive performance. Multiple imputation shows a consistenly superior performance, as it profits from the variance reduction achieved by averaging the resulting trees. For high variance models such as trees and neural networks multiple imputation may therefore yield a substantial performance improvement.
References 1. C. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, 1999. http://www.ics.uci.edu/∼mlearn/MLRepository.html. 2. L. Breiman. Bagging predictors. Machine Learning, 26(2):123–140, 1996. 3. L. Breiman, J.H. Friedman, R.A. Olshen, and C.T. Stone. Classification and Regression Trees. Wadsworth, Belmont, California, 1984. 4. D.B. Rubin. Multiple imputation after 18+ years. Journal of the American Statistical Association, 91:473–489, 1996. 5. J.L. Schafer. Analysis of Incomplete Multivariate Data. Chapman & Hall, London, 1997. 6. J.L. Schafer and M.K. Olsen. Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivariate Behavioral Research, 33(4):545– 571, 1998. 7. M.A. Tanner. Tools for Statistical Inference (third edition). Springer, New York, 1996.
Rough Dependencies as a Particular Case of Correlation: Application to the Calculation of Approximative Reducts? Mar´ıa C. Fernandez-Baiz´an1 , Ernestina Menasalvas Ruiz1 Jos´e M. Pe˜ na S´ anchez1 Socorro Mill´ an2 , Eloina Mesa2 1
Departamento de Lenguajes y Sistemas Inform´ aticos e Ingenier´ıa del Software, Facultad de Inform´ atica,U.P.M., Campus de Montegancedo, Madrid 2 Universidad del Valle, Cali. Colombia {cfbaizan, emenasalvas}@fi.upm.es, [email protected], [email protected], [email protected]
Abstract. Rough Sets Theory provides a sound basis for the extraction of qualitative knowledge (dependencies) from very large relational databases. Dependencies may be expressed by means of formulas (implications) in the following way: {x1 , . . . , xn } ⇒ρ {y} where {x1 , . . . , xn } are attributes that induce partitions into equivalence classes on the underlying population. Coefficient ρ is the dependency degree, it establishes the percentage of objects that can be correctly assigned to classes of y, taking into account the classification induced by {x1 , . . . , xn }. Dealing with decision tables, it is important to determine ρ and to eliminate from {x1 , . . . , xn } redundant attributes, to obtain minimal reducts having the same classification power as the original set. The problem of reduct extraction is NP-hard. Thus, approximative reducts are often determined. Reducts have the same classification power of the original set of attributes but quite often contain redundant attributes. The main idea developed in this paper is that attributes considered as random variables related by means of a dependency, are also correlated (the opposite, in general, is not true). From this fact we try to find, making use of well stated and widely used statistical methods, only the most significant variables, that is to say, the variables that contribute the most (in a quantitative sense) to determine y. The set of attributes (in general a subset of {x1 , x2 , . . . , xn }) obtained by means of well-founded sound statistical methods could be considered as a good approximation of a reduct. Keywords: Rough Sets, Rough Dependencies, Multivariate Analysis, Multiple Regression. ?
This work is supported by the Spanish Ministry of Education under project PB950301
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 335–340, 1999. c Springer-Verlag Berlin Heidelberg 1999
336
1
M.C. Fernandez-Baiz´ an et al.
Rough Dependencies Reducts
Let U = {1, 2, . . . , n} be a non empty set of objects that will be called the universe. Objects of the universe are described by means of a set of attributes: T = {x1 , x2 , . . . , xk }. If we assume all these attributes to be mono valued functions of the elements of U , then they can be seen as equivalence relations on U . The corresponding quotient sets being: U/xj = {[i]xj /i ∈ U }
(1)
where [i]xj stands for the equivalence class (with respect to xj ) including the element i,. Let P ⊂ T be a subset of T . The indiscernability relation with respect to P , IN D(P ), is defined as follows: U/IN D(P ) =
\
[i]xj
(2)
xj ∈P
The indiscernability relation is an equivalence relation. Let now consider the following sets: P ⊆ T and Q ⊆ T . We say that Q depends on P , P ⇒ Q, if and only if IN D(P ) ⊆ IN D(Q) (every class of IN D(P ) is included in a class of IN D(Q)). In general and due both to the random nature of data and the inherent imprecision of the measures, from a table of observations we cannot infer exact dependencies. All that can be obtained are expressions of the form: P ⇒ρ Q. Being ρ the dependency degree , 0 ≤ ρ ≤ 1, where 1 corresponds to the total dependency and 0 to the total independency of Q with respect to P . P OSP (Q) =
[
IN D(P )X
(3)
X∈U/Q
IN D(P )X = ∪{Y ∈ U/IN DP/Y ⊆ X}
(4)
OSP (Q) We can now define ρ as cardP × 100, the meaning of the dependency cardU P ⇒ρ Q is that the ρ% of the elements of U can be correctly assigned to classes of Q, given the classification P . If deleting xj ∈ P , the equality P OSP −{xj } (Q) = P OSP (Q) holds, then we say that xj is Q-redundant in P and it may be suppressed while preserving the classification power of he set. If P 0 ⊂ P is such that P OSP (Q) = P OSP 0 (Q) and P 0 does not contain Qredundant elements, then we say that P 0 is a Q-reduct of P . Dealing with decision tables, and being C = {x1 , x2 , . . . , xk } (condition attributes) and D = {y} (decision attribute), the dependency C ⇒ρ D holds, and we must: (a) determine ρ and (b) minimise C (eliminating redundancies by means of extracting reducts from it).
Rough Dependencies as a Particular Case of Correlation
2
337
Techniques for the Multidimensional Analysis of Data: Correlation and Multiple Regression
The choice of a statistical technique for the multidimensional analysis of data depends on the nature of them as well as on the desired objective: description or prediction. Dealing with decision tables the problem can be seen as the prediction of the decision attribute making use of the condition attributes. We can distinguish two different cases to which regression technique is applicable: – When the predictive variables (in our case condition attributes) are quantitatives ones and the predicted variable (the decision attribute in our case) is also quantitative. – When the predictive variables are quantitatives and the predicted variable is qualitative but can be expressed by means of a numerical value with a logical order.
3
Multiple Regression
In simple correlation there is only one predictive variable and one predicted variable. The n available examples constitute a cloud of dots in the two dimensions plane (X, Y ) through which the minimal square straight line is drawn. In multiple regression this procedure is generalised. Having k predictive variables we have to calculate k coefficients A1 , A2 , . . . , Ak as well as a constant term y0 that allow you to form the equation: y = yo + A1 x1 + A2 x2 + . . . , +Ak xk
(5)
of the regression hyperplane that approximate the best the n examples. Assuming n to be n >> k : The k coefficients determine a vector A and the values of x1 , . . . , xk constitute a matrix X(n, k). The n values of Y form a column. Y1 A1 . . Y = (6) A= . . Ak Yn and we get :
A = (X 0 X)−1 X 0 Y
(Being X 0 the transpose of X)
(7)
In order to calculate these coefficients the method of centred variables is applied. To evaluate the quality of the approximation the difference between the observed and the predicted values is calculated. Let s be the result of adding the square of that difference. We define then: σ2 =
s n−k−1
(8)
338
M.C. Fernandez-Baiz´ an et al.
When n >> k this value is approximately the variance of sample of the n examples. Then we have the correlation coefficient r to be: r s r = 1− P (9) (yi − y)2 being −1 ≤ r ≤ 1. We consider values |r| > 0, 8.
4
Stepwise Regression
We are interested only in the most significant variables “explaining” or “predicting” Y . To eliminate the less significant ones, we follow an iterative process of stepwise regression. The steps are the following: – Carry out the simple regression process with every variable under consideration. Then, retain the one giving the maximal value of r (or the minimal value of s). – Carry out double regression process with the selected variable and any other one. Retain the one giving minimal value of s. – We follow in this way, (triple regression, ...) In each step there is a decrement δ of s. We calculate: δ (10) F = 2 σ We compare this result with the value given by a Fischer table for (n − k − 1) and 1 degree of freedom. We finish when the result of this test is negative (Fcalculated < FGivenbythetable ). The set of condition variables selected in this way is an approximative reduct.
5
Correlation vs Rough Dependencies
Correlation does not mean causality. If two independent variables depend on a third one, they will be strongly correlated. Correlation does not implies dependency. But if y depends on {x1 , x2 , ..., xk }; then {y} will be correlated with x1 , x2 , ..., xk .
6
Stepwise Regression as a Foundation for the Calculation of Approximative Reducts
When dependencies as {x1 , x2 , ..., xk } ⇒ {y} are simplified, we consider possible dependencies existing between subsets of {x1 , x2 , ..., xk } and thus eliminating redundancy. The statistical approach is similar: If there is a set of variables strongly correlated in the implicant, there is redundant use of the less significants, that may be eliminated by means of the stepwise regression. The subset of condition variables obtained in this way is an approximative reduct.
Rough Dependencies as a Particular Case of Correlation DECISION TABLE
(C,D)
C = {x1, x2, ..., xk } D = {y}
DECISION TABLE
STEPWISE REGRESSION
DISCRETISER
339
(C,D)
C0 ⊆ C
(approximative reduct) (C’,D)
LOWER
card P OSC (D) ρ= card U
ρ
DISCRETISER
LOWER
ρ0 =
card P OSC 0 (D) card U
ρ0 Fig. 1. Approximative reducts by means of Stepwise regression.
7
Calculating Approximative Reducts by Means of Stepwise Linear Regression
In a decision table when: (i) The number of cases (rows) is much more greater than the number of attributes (columns), (ii) The condition attributes are quantitative and (iii) There is either a qualitative or quantitative susceptible of being expressed as a numerical value with an order decision attribute. The dependency between condition and decision may be analytically approached by means of a linear regression model1 : y = y0 + A1 x1 + ... + Ak xk
(11)
The stepwise regression process provides for the elimination of the less significant condition attributes, thus obtaining an approximative reduct, whose quality may be tested by comparing the percentage of objects classified by using the whole set of conditions.
8
Application to Randomly Generated Data
A table containing 10.000 tuples has been generated in a random way. The table corresponds to a decision table composed by 4 condition attributes x1 , x2 , x3 , x4 and one decision attribute y. The correlation matrix is: 1
0, 731 1 1 0, 816 0, 229 −0, 535 −0, 824 −0, 139
1 −0, 821 −0, 245 −0, 973 −0, 029 1
1
(12)
If |r| < 0.8 the linear model is not adequate and other approaches should be used.
340
M.C. Fernandez-Baiz´ an et al.
The selection of variables (in the order x4 , x1 , x2 , x3 ) results: y = 117, 57 − 0, 738x4 y = 103, 1 − 0, 614x4 + 1, 44x1 y = 71, 65 − 0, 237x4 + 1, 452x1 + 0, 416x2
(r = −0, 821; δ = 0, 674)
(13)
(r = 0, 986; δ = 0, 297) (14) (r = 0, 991; δ : non-sensitive) (15)
The correlation matrix indicates that there is a high correlation index between x2 and x4 . The minimum distance between the correspondent coefficient and its standard deviation, pointed out the need to eliminate x4 . Thus, the following result is obtained as a lineal model of y: y = 52, 58 + 1, 468x1 + 0, 662x2
r = 0, 989
(16)
Considering the possible existence of a dependency between {x1 , x2 , x3 , x4 } and {y} you get that an approximative reduct is {x1 , x2 }. This method has the advantage, from the point of view of minimising the error, of calculating approximative reducts from raw (non discrete) data. However, if we apply a discretising method and then calculate the dependency degree making use of rough sets we obtain the following result: {x1 , x2 , x3 , x4 } =⇒ρ1 {y} {x1 , x2 } =⇒ρ2 {y}
ρ1 = 88, 27% ρ2 = 86, 74%
(17) (18)
From this result we can conclude that the power of classification remains almost unalterable.
References 1. Cristine Nora Analyse de Donnees et Information, Ecole Nationale Superieure des Telecomunications. Paris. Research Report ENST C - 79022 2. Sergey Brin, Rajeev Motwani,Craig Silverstein Beyond Market Basket: Generalizing Association Rules to Correlations, In Proceedings of ACM SIGMOD International Conference 1997 pp. 265-276 3. M. Jambu Classification automatique pour l’analyse des donnes , Vol.1 Methods et Algorithms ed. Dunod Paris 1978 4. I. C. Lerman Classification et Analyse Ordinal des donnes. ed. Dunod 1981, 5. Jean de Lagrade Initiation A L’Analyse des Donnees Dunod 1983 6. Cristine Nora, Christine Vercken Panorama Des Principales Techniques D’Analyse De Donnes Multidimensionnelles et De Leurs Possibilites Ecole Nationale Superieure des Telecomunications. Paris. Research Report ENST C - 76010 7. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data Kluwer 1991 8. Grizzle, J.E., Williams, O.D. [1972] Loglinear models and test of independence for contingency tables. Biometrics 28, pp.137-156 9. Bishop, Y., Fienberg, S., Holland P. Discrete Multivariate Analysis:Theory and Practice. Cambridge, MA: The MIT press,Second Printing , 1975 10. Agresti, A. Analysis of ordinal caetgorical Data. Jhon Wiley and Sons, Inc. New York 1984 11. Cox, D.R. The Analysis of Binary Data, New York, Halsted Press, 1970
A Fuzzy Beam-Search Rule Induction Algorithm 1
2
1
1,2
Cristina S. Fertig , Alex A. Freitas , Lucia V. R. Arruda , and Celso Kaestner 1
CEFET-PR, CPGEI. Av. Sete de Setembro, 3165 Curitiba – PR. 80230-901. Brazil. [email protected], [email protected], [email protected] 2
PUC-PR, PPGIA-CCET. Rua Imaculada Conceição, 1155 Curitiba – PR. 80215-901. Brazil. {alex, kaestner}@ppgia.pucpr.br http://www.ppgia.pucpr.br/~alex
Abstract. This paper proposes a fuzzy beam search rule induction algorithm for the classification task. The use of fuzzy logic and fuzzy sets not only provides us with a powerful, flexible approach to cope with uncertainty, but also allows us to express the discovered rules in a representation more intuitive and comprehensible for the user, by using linguistic terms (such as low, medium, high) rather than continuous, numeric values in rule conditions. The proposed algorithm is evaluated in two public domain data sets.
1 Introduction This paper addresses the classification task. In this task the goal is to discover a relationship between a goal attribute, whose value is to be predicted, and a set of predicting attributes. The system discovers this relationship by using known-class examples, and the discovered relationship is then used to predict the goal-attribute value (or the class) of unknown-class examples. There are numerous rule induction algorithms for the classification task. However, the vast majority of them work within the framework of classic logic. In contrast, this paper proposes a fuzzy rule induction algorithm for the classification task. The motivation for this work is twofold. First, fuzzy logic is a powerful, flexible method to cope with uncertainty. Second, fuzzy rules are a natural way to express rules involving continuous attributes. Actually, rule induction algorithms implicitly perform a kind ‘hard’ (rather than soft) discretization when they cope with continuous attributes. For instance, within the framework of classic logic a rule induction algorithm discovers rules containing conditions such as ‘age < 25’, which has the obvious disadvantage of not coping well with the borderline ages of 24 and 25. In contrast, a fuzzy rule can contain a condition such as ‘age = young’, where young is a fuzzy attribute describing persons around 25 years old. This concept is more natural and more comprehensible for the user. In principle any rule induction method can be ‘fuzzyfied’. Indeed, some fuzzy decision tree algorithms have been proposed in the past [3], [2]. In this paper we have chosen, as the underlying data mining algorithm to be fuzzyfied, a variant of the beam-search rule induction algorithm described in [8]. Our motivation for this choice J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 341-347, 1999. © Springer-Verlag Berlin Heidelberg 1999
342
C.S. Fertig et al.
is as follows. Despite their popularity, most decision tree algorithms perform a kind of hill-climbing search for rules, by exploring just one alternative at a time, which makes them very sensitive to the problem of local maxima. Beam search algorithms have the advantage of being less sensitive to this problem, since they explore w alternatives at a time, where w is the beam width [13]. To the best of our knowledge, this paper is the first work to propose a fuzzy beam search-based rule induction algorithm.
2 Beam Search-Based Rule Induction The basic idea of a beam search-based rule induction algorithm [8] is described in Figure 1. The algorithm receives two arguments as input, namely max_depth (the maximum depth of the search tree) and w, the beam width (the number of rules or tree paths being explored by the algorithm at a given time). Both parameters are small integer numbers. In Figure 1, Ri denotes the i-th rule, i=1,…,w, Aj denotes the j-th predicting attribute, vjk denotes the k-th value of the j-th predicting attribute, and Rijk denotes the new rule creating by adding condition Aj = vjk to the rule Ri. This is a top-down algorithm, which starts with an ‘empty rule’ with no condition, and iteratively adds one condition at a time to all the current rules, so specializing the rules. Each condition added to a rule is of the form ‘Attribute = Value’, where Value is a value belonging to the domain of the attribute. (The operator could be ‘>’ or ‘<’ in the case of continuous attributes. However, for simplicity we consider here only the ‘=’ operator, which is the only one necessary for our proposed fuzzy version of the algorithm.) In each iteration the algorithm tries to add all possible conditions to each rule (as long as the corresponding attribute does not occur yet in the rule), evaluates the just-created rules according to a rule-quality measure, and chooses the best w rules to proceed the search. This process is repeated until either no just-created rule is better than any rule of the previous iteration or the maximum depth has been reached. Note that the algorithm expands a tree search where the root node is a rule with no conditions, each path from the root until a current leaf node corresponds to a current rule, and the depth of the path corresponds to the number of conditions in the rule. Therefore, the parameter max_depth effectively specifies the maximum number of conditions in a rule. To compute the quality of a rule we used a variant of the well-known confidence factor (CF) measure. Let A => C denote a rule, where A is the rule antecedent (a conjunction of conditions) and C is the rule consequent (the valued predicted for the goal attribute). The CF measure is simply |A & C| / |A|, where |x| denotes the cardinality of set x. In other words, CF is the ratio of the number of examples that both satisfy the conditions in the rule antecedent and have the goal-attribute value predicted by the rule consequent over the number of examples satisfying the conditions in the rule antecedent. We borrowed from [10] the idea of using a variant of this measure defined as: (|A & C| - ½) / |A|. The motivation for subtracting ½ from the numerator is to favor the discovery of more general rules, by avoiding the overfitting of the rules to the data. For instance, consider two rules R1 and R2, where R1 has |A & C| = |A| = 1 and R2 has |A & C| = |A| = 100. Without the ½ correction, both rules have a CF = 100%. However, rule R1 is probably overfitting the data, and
A Fuzzy Beam-Search Rule Induction Algorithm
343
rule R2 is more likely to be accurate on unseen data. With the ½ correction we achieve the more realistic CF measures of 50% and 99.5% for rules R1 and R2, respectively. Input: max_depth, w; depth := 0; RuleSet = a single rule with no conditions; REPEAT FOR EACH rule Ri in RuleSet FOR EACH attribute Aj not used yet in rule Ri Specialize Ri by adding a condition Aj = vjk to the rule, and call the new rule Rijk; Compute a rule-quality measure for Rijk; END FOR END FOR RuleSet = the best w rules among all current rules; depth := depth + 1; UNTIL (no rule created in this iteration is better than any rule of previous iteration) OR (depth = max_depth)
Fig. 1. Overview of a Beam Search-based Rule Induction Algorithm
Finally, note that the algorithm of Figure 1 is searching for conditions to add to the rule antecedent only. It does not specify how to form the rule consequent. This part of the rule contains the value (class) predicted for the goal attribute. The choice of the class predicted by a rule can be made in several ways. One approach would be to let the algorithm automatically choose the best class to form the rule consequent, by picking the class to which the majority of the examples satisfying the rule antecedent belong. Another approach would be to run the algorithm k times for a kclass problem, where in each run the algorithm searches only for rules predicting the k-th class. Yet another approach, assuming a two-class problem, is to run the algorithm to discover only rules predicting one class. In this case, if an example satisfies any of the discovered rules it is assigned the corresponding class; otherwise it is assigned the other class (the ‘default’ class). We have opted for this latter approach. In our work the algorithm searches only for rules predicting the minority class (i.e. the class with smaller relative frequency in the data being mined), whereas the majority class is the default class. We have chosen this approach for two reasons. First, its simplicity. Second, focusing on the discovery of rules predicting the minority class has the advantage that these rules tend to be more interesting to the user than rules predicting the majority class [5]. For instance, rules predicting a rare disease tend to be more interesting than rules predicting that the patient does not have that disease. (Of course, one of the reasons why minority-class rules tend to be more interesting is that it is more difficult to discover them, in comparison with majority class rules.)
344
C.S. Fertig et al.
3 Fuzzyfying a Beam Search-Based Rule Induction Algorithm In this section we describe how we have fuzzyfied the algorithm described in the previous section. First of all, as a pre-processing step, for each continuous attribute in the data being mined we must associate: (1) a set of fuzzy linguistic terms (such as high, medium, low), which are essentially labels for fuzzy sets; and (2) a fuzzy membership function, which specifies the degree to which each attribute’s original value belongs to the attribute’s linguistic terms. Loosely speaking, this can be regarded as a kind of ‘discretization’, since the originally-continuous attribute can now take on just a few linguistic terms as its value. However, this is a very flexible discretization, since each of the now ‘discrete’ values of the attribute is actually a flexible linguistic term corresponding to a fuzzy set. The continuous attributes were fuzzyfied by using trapezoidal membership functions [1], since this kind of function often leads to a data modeling closer to reality. Note that it is necessary to fuzzyfy only continuous attributes, and not categorical ones. In addition to the above-described fuzzyfication of continuous attributes, we also need to fuzzyfy the computation of the rule-quality measure. In our case this corresponds to fuzzyfying the formula (|A & C| - ½) / |A|, defined in the previous section. In our first attempt to define a fuzzy version of |A|, this term was the summation, over all training examples, of the degree to which the example satisfies the rule antecedent. For each example, this degree is computed by a fuzzy AND of the degree to which the example satisfies each condition. We have used the conventional definition of the fuzzy AND as the minimum operator. For instance, suppose a rule antecedent contains the following three conditions ‘age = young and salary = low and sex = male’, where age and salary are fuzzyfied (originally continuous) attributes and sex is a categorical attribute. Suppose that a given training example satisfies the condition ‘age = young’ to a degree of 0.8, the condition ‘salary = low’ to a degree of 0.6 and the condition ‘sex = male’ to a degree of 1. In this case the example satisfies the rule antecedent to a degree of 0.6 (minimum value among 0.8, 0.6 and 1). Note that if the example had ‘sex = female’ it would satisfy the above rule antecedent to a degree of 0. (Conditions involving categorical attributes are satisfied to a degree of either 0 or 1.) However, the above fuzzyfication of |A|, although theoretically sound, has a disadvantage. In practice, many training examples can satisfy a rule antecedent to a small degree, and the summation of all these small degrees of membership can undermine the reliability of the CF measure. To avoid this, our summation of the degree to which an example satisfies the rule antecedent includes only the examples with a degree of membership greater than or equal to a predefined threshold (set to 0.5 in our experiments), rather than all training examples. This operation is based on the alpha-cut technique used in fuzzy arithmetic [9]. In our fuzzy version, |A & C| is the summation, over all examples that have the class predicted by the rule, of the degree to which the example satisfies the rule antecedent. Analogously to the computation of |A|, this summation considers only examples which satisfy the rule antecedent to a degree greater than or equal to the above-mentioned membership threshold.
A Fuzzy Beam-Search Rule Induction Algorithm
345
4 Computational Results The above described fuzzy rule induction algorithm was evaluated on two public domain data sets, available from http://www.ics.uci.edu/~mlearn/MLRepository.html. The first data set used in our experiments was the Pima Indians Diabetes Database [11], which contains 9 continuous attributes and 768 examples. The second data set was the Boston Housing Data [7], which contains 13 continuous attributes and 1 binary attribute. The latter was removed from the data set for the purposes of our experiments, since only continuous attributes are fuzzyfied by our algorithm. This data set contains 506 examples. The goal attribute for this data set (median value of owner-occupied homes in $1000’s) was originally continuous. To adapt this data set to the classification task, the goal attribute was discretized, so that it can take on only two classes (cheap and expensive). Each data set was divided into 5 partitions, and a cross-validation procedure [6] was then performed. For each data set, this corresponds to run the algorithm 5 times, where in each time a different partition is used as the test set and all the remaining four partitions are used as the training set. In all our experiments the maximum depth of the tree search was set to 2 and the beam width w was set to 10. Almost all continuous attribute were fuzzyfied into 3 linguistic values, each represented by a trapezoidal membership function. The only exceptions were the attributes Rad and Tax of the housing data set, which were fuzzyfied into two linguistic values. The results of our experiments are reported in Table 1. Each cell of the table refers to the average results over the 5 runs of the algorithm associated with the crossvalidation procedure. The first row indicates the baseline accuracy of each data set, that is the relative frequency of the majority class. A basic requirement for the predictive accuracy of a classifier is that it be greater than the baseline accuracy, since it would be trivial to achieve such accuracy by always predicting the majority class. The second row indicates the overall predictive accuracy achieved by our fuzzy algorithm. This was computed as the ratio of the number of correctly classified test examples over the total number of test examples. Note that for both data sets the algorithm achieved an accuracy rate significantly better than the baseline accuracy. Note that the above overall accuracy rate only takes into account whether the classification of an example was correct or wrong, regardless of which kind of rule (a discovered rule or the default-class rule) was used to classify the example. To analyze in more detail the quality of the discovered rules, we also measured separately the accuracy rate of the discovered rules and the accuracy rate of the default-class rule. Recall that all discovered rules predict the same class, namely the minority class, whereas the default-class rule, which is applied whenever the test example does not satisfy any discovered rule, simply predicts the majority class. Recall also that an example is considered to satisfy a discovered rule only if the rule antecedent belongs to the alpha-cut membership function describing the linguistic terms (with alpha = 0.5 in our experiments), as explained in section 3. The third row of Table 1 indicates the accuracy rate of the discovered rules, computed as the ratio of the number of examples correctly classified by the discovered rules over the number of examples satisfying any of the discovered rules. (Since all discovered rules predict the same class, it is irrelevant which rule is actually classifying the example.)
346
C.S. Fertig et al.
The fourth row indicates the accuracy rate of the default-class rule, computed as the ratio of the number of examples correctly classified by this rule over the number of examples classified by this rule (i.e. the number of examples that do not satisfy any of the discovered rules). As can be seen in the table, in both data sets the default rule has a predictive accuracy significantly better than the discovered rules. This was somewhat expected, given the difficulty of predicting the minority class in both data sets. Actually, if we were to compute the ‘baseline accuracy of the minority class’, we would find the values 0.349 (1 – 0.651) and 0.245 (1 – 0.755) for the Diabetes and Housing data sets, respectively. From this point of view, the discovered rules can still be considered as good-quality ones with respect to predictive accuracy, since their accuracy is 0.711 and 0.838 for the Diabetes and Housing data sets, respectively. Table 1. Computational Results
Baseline accuracy Overall accuracy Accuracy of discovered rules Accuracy of default rule
Diabetes data set 0.651
Housing data set 0.755
0.711 0.641 0.730
0.838 0.725 0.863
Another issue to be considered is the comprehensibility of the discovered rules. Although this is a subjective criterion, it is common in the literature to evaluate comprehensibility in terms of the syntactical simplicity of the discovered rules. In this case, the smaller the number of rules and the smaller the number of conditions per rule, the more comprehensible the discovered rule set is. With this definition of comprehensibility we can say that the rules discovered by our algorithm are comprehensible, almost by definition of the algorithm and a suitable choice of its parameters. Firstly, the algorithm discovers only a small set of rules, whose number is the user-specified beam width. Secondly, the algorithm discovers only relatively short rules, where the number of conditions of the discovered rules is at most the user-specified maximum depth of the search tree. In addition, and more important, the use of linguistic terms associated with our fuzzy algorithm can be considered as a form of improving rule comprehensibility, in comparison with continuous, numeric values, as argued in the introduction. It should be noted, however, that high accuracy and comprehensibility do not necessarily imply interestingness. To consider a classical example, in a hospital database we can easily mine a rule such as ‘if (patient is pregnant) then (patient sex is female)’. Although this rule is highly accurate and comprehensible, it is obviously uninteresting, since it states the obvious. Our algorithm discovers only rules predicting the minority class, which, as argued in section 3, tend to be more interesting rules for the user. However, a more detailed analysis of the degree of interestingness of the discovered rules is beyond the scope of this paper. Although the literature proposes several methods to evaluate the degree of interestingness of the discovered rules – see e.g. [4], [12] – it does not seem trivial to adapt these methods to the context of fuzzy rules.
A Fuzzy Beam-Search Rule Induction Algorithm
347
5 Conclusion We have proposed a fuzzy version of a beam search-based rule induction algorithm. We have also evaluated the algorithm on two data sets. Overall, the results are good, not only with respect to predictive accuracy, but also (and more importantly, in data mining) with respect to the comprehensibility of the discovered rules, which is intuitively improved by the use of linguistic terms associated with fuzzy rules. However, the computational results reported in this paper are still somewhat preliminary, since the algorithm has been applied to only two data sets. Future research will include a more extensive evaluation of the algorithm on other data sets. A disadvantage of our fuzzy approach is that it requires a preprocessing phase to specify the fuzzy membership for each continuous attribute. To solve this problem two approaches can be used: (1) this specification is done with the help of the user (an expert in the meaning of the data), which requires valuable human time; (2) a fuzzy clustering algorithm can be used to get the membership function [9]. Hopefully, in the future fuzzy databases will be more commonplace, so that the fuzzy values and fuzzy membership functions that our algorithm requires might be already available in the underlying fuzzy database, so avoiding the need for the above preprocessing phase.
References 1. Bojadziev, G., Bojadziev, M.: Fuzzy sets, Fuzzy logic, Applications. World Scientific (1995) 2. Chi, Z., Yan, H.: ID3-derived fuzzy rules and optimized defuzzification for handwritten numeral recognition. IEEE Trans. On Fuzzy Systems¸4(1), Feb. (1996) 24-31 3. Cios, K.J., Sztandera, L.M.: Continuous ID3 algorithm with fuzzy entropy measures. Proc. IEEE Int. Conf. Fuzzy Systems,. San Diego (1992) 469-476 4. Freitas, A.A.: On objective measures of rule surprisingness. Principles of Data Mining and nd Knowledge Discovery (Proc. 2 European Symp., PKDD-98). Lecture Notes in Artificial Intelligence 1510. Springer-Verlag (1998) 1-9 5. Freitas, A.A.: On rule interestingness measures. To appear in Knowledge-Based Systems journal (1999) 6. Hand, D.: Construction and Assessment of Classification Rules. John Wiley & Sons (1997) 7. Harrison, D., Rubinfeld, D.L.: UCI Repository of machine learning databases. [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Dept. of Information and Computer Science (1993) 8. Holsheimer, M., Kersten, M., Siebes, A.: Data surveyor: searching the nuggets in parallel. In: U.M. Fayyad et al. (Eds.) Advances in Knowledge Discovery and Data Mining. AAAI Press (1996) 447-467 9. Predrycz, W., Gomide, F.: An Introduction to Fuzzy Sets Analysis and Design. MIT (1998) 10. Quinlan, J.R.: Generating production rules from decision trees. Proc. Int. Joint Conf. AI (IJCAI-87) (1987) 304-307 11. Sigillito, V.: UCI Repository of machine learning databases. [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: Univ. of California, Dept. of Information and Computer Science (1990) 12. Suzuki, E., Kodratoff, Y.: Discovery of surprising exception rules based on intensity of nd implication. Principles of Data Mining and Knowledge Discovery (Proc. 2 European Symp., PKDD-98). Lecture Notes in Artif. Intelligence 1510. Springer-Verlag (1998) 10-18 rd 13. Winston, P.H.: Artificial Intelligence. 3 Ed. Addison-Wesley (1992)
An Innovative GA-Based Decision Tree Classifier in Large Scale Data Mining Zhiwei Fu Robert H. Smith School of Business, University of Maryland, College Park, 20742, USA [email protected]
Abstract. A variety of techniques have been developed to scale decision tree classifiers in data mining to extract valuable knowledge. However, these aproaches either cause a loss of accuracy or cannot effectively uncover the data structure. We explore a more promising GA-based decision tree classifier, OOGASC4.5, to integrate the strengths of decision tree algorithms with statistical sampling and genetic algorithm. The proposed program could not only enhance the classification accuracy but assumes the potential advantage of extracting valuable rules as well. The computational results are provided along with analysis and conclusions.
1
Introduction
Data mining has recently become an attractive discipline within the last few years [4][7]. Its goal is to extract pieces of previously unknown, but valuable knowledge or patterns from large data sets. Data mining over large data sets is important due to its commercial potential. Numerous algorithms have been developed with regard to handling large data sets, such as distributed algorithms, restricted search, parallel algorithms and data reduction algorithms. However, the computational costs, the available storage and the retrieval of large data sets are still serious concerns for large-scale data mining. While certain data mining algorithms show consistent performances for some data sets, it is not necessarily true across all problem domains. For example, the performance of different extracted “ideal” scheme for induction based classification/decision problems varies greatly with the characteristics of the data sets to which the algorithms applies. On the other hand, techniques such as discretization can cause a loss of accuracy when scaling up to large data sets [1][8]. In this paper we explore a more promising algorithm using genetic algorithm (GA) and SampleC4.5 for classification problems. The paper begins with introducing the problem and reviewing the decision tree algorithm and GA, followed by the design of Object Oriented Genetic Algorithm with SampleC4.5 (OOGASC4.5). We then show the computational results. After comparison and analysis, we end up with conclusions in the final section.
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 348-353, 1999. © Springer-Verlag Berlin Heidelberg 1999
An Innovative GA-Based Decision Tree Classifier in Large Scale Data Mining
2
349
The OOGASC4.5 Program
2.1 C4.5 and SampleC4.5 Decision tree algorithms have long been recognized as a powerful tool in data mining to represent schemes in the studied data sets according to values of variables. Among them, C4.5 [9] has been widely implemented and tested. It uses the top-down induction approach to build the decision tree, which is fitted to training samples by recursively partitioning the data into increasingly homogeneous subsets, based on the values of a variable one at a time. It starts at the root of the tree and moves through it until a leaf is encountered, or no more improvement could be made. However, C4.5 pursues local greedy search which can quickly converge but at the expense of higher possibility of getting trapped into the local optima. To efficiently uncover the data set, we develop SampleC4.5 by instilling statistical sampling methods into C4.5. The SampleC4.5 algorithm starts with the full original set of variables and a training set of certain starting percentage (p=p0) extracted from raw data. The remaining data (1- p0)N form the test set, where N is the size of the original data set. In the case of large data sets, we would form a separate test set, and an “untouched” validation set beforehand rather than the dynamic (1- p0)N test sets. SampleC4.5 is implemented on the training set for a number of iterations (n=n0). The classification accuracy on test sets over all iterations n is averaged, and the standard error is calculated, after which the algorithm increments p by some step percentage s0. The process repeats on a new training set of (p0+s0)N, and a new test set of (1-(p0+s0))N or the pre-specified test/validation set for large data sets. Two statistical sampling approaches, simple random sampler and stratified random sampler, are provided. The pseudo code for the algorithm is shown as follows, begin initialize n=1; initialize p=p0; while (0
350
Z. Fu
2.2 Genetic Algorithm Genetic algorithm (GA) [6] has been using in a wide variety of applications, from combinatorial optimization to data mining. GA is domain independent global search technique in that it exploits accumulating information about an initially unknown search space in order to bias subsequent search into more promising subspace. GA proceeds iterative search from one population of candidate solutions to another, each consisting of m candidate solutions. The global search taken by GA is randomized but structured since during each iteration step, called a generation, the current population is evaluated by the fitness function, and, based on the evaluation and evolving strategies, a new population of candidate solutions is formed. GA then searches through the parameter space in many directions simultaneously and thereby improves the probability of convergence to the global optima [3][5]. 2.3 OOGASC4.5 Taking advantage of GA and statistical sampling, we develop the OOGASC4.5 program to build better decision tree(s) for classification problems. In OOGASC4.5, there are three principal issues involved in GA. 1. The initial population. An appropriate representation of the problem space is a key issue for GA to achieve efficiency. Usually the initial population is chosen randomly. We create genetic tree chromosomes for GA rather than map the highdimensional decision tree onto a linear genetic chromosome. 2. Genetic operations. There are two types of crossover operations in our program, subtree-to-subtree and subtree-to-leaf crossover, which randomly exchange between subtree and subtree or subtree and leaf from two parent trees. Mutation operation is implemented as an asexual crossover by randomly exchanging any subtree or leaf with other subtree or leaf with only one tree involved. 3. Evaluation. Appropriately defining the evaluation/objective function is the main issue of successfully applying GA. In the program, we employ the accuracy metric to evaluate the performance of the generated decision trees. This is the percentage of correctly classified observations in the test data sets. The OOGASC4.5 program is shown in Fig. 1. The object oriented design was conducted on Rose98 from Rational Software Corp. The program was implemented in Microsoft Visual C++ 5.0 on Windows95 with 128.0MB RAM and 400 MHz. In the program, we first build the diversified tree models from different (not necessarily mutually exclusive) subsets of the original data set by SampleC4.5, then take the output trees as inputs to GA. In the process of evolving, feasibility of the generated trees is checked since the trees generated after genetic operations are probably not feasible and only those feasible trees are of interest. After feasibility check, the fitness values of the generated trees are calculated over the test set.
An Innovative GA-Based Decision Tree Classifier in Large Scale Data Mining
351
Object-Oriented Genetic Algorithms Object-Oriented SampleC4.5 initialize the population sampling raw data generate randomized C4.5 trees initialize the fitness function
validate - feasibility check and evolve (selection, crossover & mutation)
simplification test and evaluate performance
results
Fig. 1. The OOGASC4.5 program
3
Experimental Results
3.1 Parameter Settings In the experiments, we set the probability of crossover at 1.0, mutation rate 0.01, and used 50 generations as the stopping criterion. The sampling percentages vary from 10% to 60%, and the population size from 10 to 100.
3.2 SocioOlympic Data We implement OOGASC4.5 on SocioOlympic data collected from the 1996 Atlanta Summer Olympic Games, in which 11,000 athletes from 197 nations participated in 271 events from 26 sports. While more countries than ever appeared in the final medal count (78 countries received at least one medal), some countries regularly win more than others do. Many attempts have been made in the literature to explain these differences. Population is cited most frequently as a factor affecting Olympic success. However, we only need to compare recent performances of nations such as Cuba and India to see that other factors are involved (in 1996, Cuba won 25 medals while India won 1). The total score for each country is calculated by assigning points to the top eight placing countries for each of the 271 Olympic events. Socioeconomic information (e.g., population and GNP) is used to classify the countries into categories based on the total scores attained and predicted [2].
352
Z. Fu
Table 1. Computational results for SampleC4.5 and OOGASC4.5 Parameters
Population size
Best of the Initial Population (SampleC4.5 Trees)
Sampling Percentage
5% 10% 20% 30% 40% 50% 60%
10 0.6462 0.8462 0.8359 0.8462 0.8769 0.8410 0.8974
Best Output GASampleC4.5 Tree
Sampling Percentage
5% 10% 20% 30% 40% 50% 60%
0.7385 0.8667 0.8462 0.8667 0.8974 0.8821 0.9077
0.7385 0.8769 0.8923 0.8821 0.9128 0.9128 0.9077
0.7385 0.8872 0.8976 0.9128 0.9128 0.9231 0.9179
0.7385 0.8872 0.8769 0.9179 0.9128 0.9282 0.9179
0.7385 0.8974 0.8769 0.8872 0.9128 0.9128 0.9231
0.7385 0.8769 0.8923 0.9128 0.9128 0.9179 0.9179
0.7385 0.8974 0.8976 0.9179 0.9128 0.9231 0.9333
Convergence (No. of Generations)
Sampling Percentage
5% 10% 20% 30% 40% 50% 60%
1 9 4 9 14 40 19
1 8 9 11 16 1 7
1 7 16 17 8 26 12
1 15 9 24 7 33 11
1 6 8 13 20 1 22
1 11 14 26 3 3 4
1 12 13 18 30 6 25
Computing Time for GAOOSC4.5 (seconds)
Sampling Percentage
5% 10% 20% 30% 40% 50% 60%
14.8 15.0 15.1 15.1 15.2 15.3 15.5
70.6 72.9 73.2 73.6 74.6 74.7 75.3
86.8 87.8 88.7 89.8 89.5 90.1 90.8
98.4 101.9 102.3 102.8 103.7 104.9 105.8
112.4 116.8 117.4 118.3 119.0 119.8 121.5
130.3 131.1 131.5 132.8 133.4 135.2 138
144.7 146.3 146.9 147.9 148.5 149.6 150.5
3.3
50 0.6462 0.8462 0.8462 0.8769 0.8872 0.9128 0.9026
60 0.6462 0.8462 0.8462 0.8769 0.8872 0.9128 0.9026
70 0.6462 0.8462 0.8462 0.8769 0.8974 0.9128 0.9026
80 0.6462 0.8462 0.8462 0.8769 0.8974 0.9128 0.9128
90 0.6462 0.8462 0.8462 0.8769 0.8974 0.9128 0.9128
100 0.6462 0.8462 0.8462 0.8769 0.8974 0.9128 0.9128
Results
Table 1 shows experiment results for SampleC4.5 and OOGASC4.5 from different respects. For any combinations of the population size and the sampling percentage, GASampleC4.5 output trees perform better than SampleC4.5 trees. The improvement mainly varies from 2% to 9%, which is significantly indicative of the robustness of OOGASC4.5. OOGASC4.5 can effectively achieve at least the same level performance as SampleC4.5 at a much lower sampling percentage, which will be beneficial for large-scale data mining. The sampling percentages should be set up problemspecific, considering the manageable size of the sampled subsets. In the case of large data sets, we can reasonably set the sampling percentage as low as 0.1%, or even 0.01%, to build up the initial tree models. The population size seems more dominant in computing time than the sampling percentage since as the number of candidate decision trees increases, more genetic operations and evaluations will be done by GA. On the other hand, that the sampling percentage does not significantly affect total computing time is also due to the fact that the extracted number of observations by sampling from the original data is not significantly large. OOGASC4.5 has some limitations. First, the embedded GA search will entail many evaluations of the fitness functions and, consequently, much computing time. The second limitation is that convergence of GA does not necessarily occur at the global optima, but close enough. OOGASC4.5 converges relatively fast as illustrated in Table 1, but does not assume any specific trend.
An Innovative GA-Based Decision Tree Classifier in Large Scale Data Mining
4
353
Conclusion
The OOGASC4.5 program demonstrates that better decision trees could be built using a well-designed GA. The decision trees built by OOGASC4.5 can achieve better performance than those by traditional approaches. The generated trees usually differ from those by SampleC4.5, and the difference is often extended to the variable at the root node. This could be due to the difference in the nature of searching method. Decision tree algorithms like C4.5 pursue local greedy search through the data space, by selecting the most significant variable to partition one level down at a time. However, the variables that most significantly partition the data sets at each step are not necessarily the combination that leads to the best decision trees. While OOGASC4.5 is designed for large data sets, we use the SocioOlympic data to test the efficacy of various sampling levels and parameter settings, and to validate the accuracy of our approach. Using the learning with the small data set we intend to run OOGASC4.5 on very large data sets. Part of the next work would be extended to extracting, aggregating and enhancing the valuable rules discovered by the program. Exploration of more effective genetic crossover and mutation operations is also part of the future work.
References 1.
2.
3. 4. 5. 6. 7. 8.
9.
Chattratichat, J. et al. Large Scale Data Mining: Challenges and Responses. Prord ceedings of the 3 International Conference on Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, 1997. Condon, E. Predicting the Success of Nations in the Summer Olympics Using Neural Networks, Master’s Thesis. University of Maryland, College Park, MD, 1997. DeJong, K. Adaptive system design: a genetic approach. IEEE Transaction on Systems, Man, and Cybernetics, Vol. SMC-10, No. 9, pp.566-574, Sept. 1980. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. Advances in Knowledge Discovery and Data Mining, MIT Press, 1996. Goldberg, D. Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA 1988. Holland, J. H. Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, 1975. Piatetsky-Shapiro, G., Frawley, W. Knowledge Discovery in Databases, MIT Press, 1991. rd Provost, F. Scaling up Inductive Algorithm: An Overview. Proceedings of the 3 International Conference on Knowledge Discovery and Data Mining, p239-242. AAAI Press, Menlo Park, CA, 1997. Quinlan, J. R. C4.5: Programming for Machine Learning, Morgan Kaufman Publishers, 1993.
Extension to C-means Algorithm for the Use of Similarity Functions 1
2
Javier Raymundo García-Serrano and José Francisco Martínez-Trinidad 1
Centro Nacional de Investigación y Desarrollo Tecnológico. Cuernavaca, Morelos, México. e-mail: [email protected] 2 Centro de Investigación en Computación Instituto Politécnico Nacional, México, D.F. e-mail: [email protected]
Abstract. The C-Means algorithm has been motive of many extensions since the first publications. The extensions until now consider mainly the following aspects: the selection of initial seeds (centers); the determination of the optimal number of clusters and the use of different functionals for generate the clusters. In this paper it is proposed an extension to the C-means algorithm which considers description of the objects (data) with quantitative and qualitative features, besides consider missing data. These types of descriptions are very frequent in soft sciences as Medicine, Geology, Sociology, Marketing, etc. so the application scope for the proposed algorithm is very wide. The proposed algorithm use similarity functions that may be in function of partial similarity functions consequently allows comparing objects analyzing subdescriptions of the same. Results using standard public databases [2] are showed. In addition, a comparison with classical C-Means algorithm [7] is provided.
1 Introduction The restricted unsupervised classification (RUC) problems have been studied intensely in the Statistical Pattern Recognition [1,6] based on metric distances in an ndimensional metric space [1,6]. Algorithms as the C-means have showed their effectiveness in the unsupervised classification process. This algorithm starts with an initial partition, then tries all possible moving or swapping of data from a group to another iteratively to optimize the objective measurement function. The objects must be described in terms of features such that a metric can be applied for evaluate the distance. Nevertheless, these are not the conditions in soft sciences as Medicine, Geology, Sociology, Marketing, etc. [4]. In this sciences the objects are described in terms of quantitative and qualitative features. For example, if we look at geological data, features such as age, porosity, and permeability are quantitative, but other features such as rock types crystalline structure, and facies structure are qualitative. Likewise, the missing data is common in this type of problems. In this circumstances only the similarity degree between the objects can be determined. Actually exists few algorithms for solve the RUC in a context as the previously mentioned. The conceptual C-means algorithm is the most representative [3]. This algorithm proposes a distance function for handle quantitative and qualitative features. The distance between two objects is computed evaluating the distance between quantitative J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 354-359, 1999. © Springer-Verlag Berlin Heidelberg 1999
Extension to C-means Algorithm for the Use of Similarity Functions
355
features (with an Euclidean distance) plus the distance between qualitative features (using the chi-square distance). For achieves before, each value of a qualitative feature is coding as a binary feature. Finally is assumed that the distance defined in this way has an interpretation in the original n-dimensional space, where centroides for example can be computed, this last is wrong. Other motivation for our proposed algorithm is the necessity of many specialists, that work in soft sciences, for group data in a specific number of clusters (a RUC problem). Such that objects that are more similar tend to fall into the same group and objects that are relatively distinct tend to separate into different groups.
2 Algorithm Description Let us consider a set of m objects {O1,O2,...,Om} which must be grouped in c clusters. Each object is described by a set R={x1,x2,...,xn} of features. The features take values in a set of admissible values xi(Oj)∈Mi, i=1,...,n. We assume that in Mi exists a symbol „∗“ for denote missing data. The features thus can be of any nature (qualitative: Boolean, multi-valued, etc. or quantitative: integer, real) and incomplete descriptions of the objects can be considered. For each feature a comparison criteria Ci:MixMi→ Li i=1,…,n is defined, where Li is a totally ordered set, besides, Let 2 Γ:( M 1 ×... × M n ) →[0,1] be a similarity function. In some cases the similarity function Γ depend or is a lineal combination of partial similarity functions 2 Γ':( M i1 ×... × M is ) → L´i , with L´i same as Li. This function allow us to compare descriptions of objects with s
similarity between the objects Oj and Ok. The value Γ(O j , Ok ) satisfies the following three conditions: 1. Γ(O j , Ok ) ∈[0,1] 2. Γ(O j , O j ) =1
for 1≤j≤m and 1≤k≤m; for 1≤j≤m;
3. Γ(O j , Ok ) = Γ(Ok , O j ) for 1≤j≤m and 1≤k≤m. cxm Let uik the degree of membership of the object Ok in the cluster Ci, and let R the set of all the real c×n matrices. Any c-partition of the data set is represented by a cxm matrix U=[uik]∈R , which satisfies: for 1≤j≤m and 1≤k≤m; 1. uik∈{0,1}
∑ 3. ∑ 2.
c i =1
uik = 1
for 1≤k≤m;
uik > 0
for 1≤i≤c.
m k =1
Then we have that, the partition matrix U is determined from maximization of the objective function given by J(U)=
∑ ∑ c
m
i =1
k =1
u ik Γ(Oir , Ok ) where Γ(O ir , Ok ) is the
356
J.R. García-Serrano and J.F. Martínez-Trinidad
similarity between the object most representative Oir („the center“) in the cluster Ci and the object Ok. Note that in our case the „center“ is an object of the sample instead of a fictitious element as in the classical C-means algorithm. To determine the most representative object in a cluster Ci for a given U, we introduce the function
( ) = (β C (O j ) [(α C (O j ) + (1 − β C (O j )))]) + η C (O j )
rC i O
j O j ∈C i Cq ≠Ci
i
i
( )
where the function β Ci O j
i
(1)
q
is the average of similarity (mean) of the object Oj
with the other objects in the same cluster Ci and is computed as follows β C i (O j ) =
1 Ci − 1 O
∑ Γ (O
j , Oq
)
(2)
, O q ∈C i O j ≠Oq j
( )
For increase the informational value of (2) we introduce the function α Ci O j . α C i (O
j
)=
1 Ci − 1
β C i (O
∑
j
) − Γ (O j , O q )
(3)
O j ,O q ∈C i O j ≠Oq
This function evaluates the variance between the mean (2) and the similarity between the object Oj and the other objects in Ci, so when the variance decreases increases the values of (1). To the expression (3) in (1) is added the expression 1 − β Ci O j in the denominator. This value allow us compute (1) in the case that (3)
(
( ))
be zero and besides represents the average of dissimilarity of Oj with the other objects in Ci.
ηC k (O j ) =
∑ (1 − Γ(O , O )) c
r q
(4)
j
q =1 i≠q
Finally the function (4) is used for diminish the cases where exist two objects with the same value in (1). The function (4) represents the dissimilarity of the object Oi with the representative objects of the others clusters. It is quite reasonable that the representative object for the cluster Ci is defined as the object Or which yield the maximum of the function rCi O j .
( )
{ ( )}
rCi (Or ) = max rCi O p O p ∈Ci
(5)
Therefore, the most representative object is such that, it is the most similar in average with the other objects in the cluster, being this average the greater and with less variance. In addition, the representative object is the most dissimilar compared with the representative objects in the other clusters. If the cluster centers are given, the functional J(U) is maximized when uik is determined as:
1 uik = 0
(
)
{(
if Γ Oir , O k = max Γ Oqr , Ok 1≤ q ≤ c
)}
(6)
Extension to C-means Algorithm for the Use of Similarity Functions
357
That is to say, an object Ok will be assigned to the cluster such that Ok is the more similar with their representative object. 2.1 Algorithm Step 1. Select c objects in the data as initial seeds. Fix the number of iterations ni’, and ni=0. (ni) Step 2. Calculate the partition matrix U=U using (6) and the initial seeds selected in the step 1. (ni) Step 3. Determine the representative objects of the clusters for the matrix U , using (1) and (5). (ni+1) using (6) and the representative objects Step 4. Calculate the partition matrix U computed in the step 3. Step 5. If the set of representative objects, is the same that in the previous iteration stop. Otherwise increase ni=ni+1. Step 6. If ni>ni’ stop. Otherwise, go to step 3.
3. Experimental Results Initially we perform a comparison between our extension and the classical c-means algorithm, for this purpose we consider the Iris data [2]. We apply the classical c-means algorithm using the Euclidean distance, and our extension using the equations (7) as comparison criteria between feature’s values, and equation (8) as similarity function between object’s descriptions. The results are showed in the table 1.
(
((
( ))
( ) ))
(7)
( ))
(8)
C s x s (Oi ), x s O j = 1 − 1 − 1 x s (Oi ) − x s O j + 1
(
)
s Oi , O j = 1 T
∑
x p ∈T
(
C p x p (O i ), x p O j
Table 1. Results using classical C-means and the proposed extension in 10 realized tests Type of initial seeds
% Effectiveness per cluster Cluster1 Cluster2 Cluster3
Random Random Representative1 Representative
62% 99% 100% 99%
8% 75% 70% 78%
100% 93% 96% 92%
Used Algorithm
Classic C-means Extension proposed Classic C-means Extension proposed
% Effectiveness in average
56.7 % 89.16 % 88.6 % 89.8 %
So our extension obtains a better percent of classification in average than the classical c-means algorithm, this percent could be improved by a better modeling of (7) and (8).
1
An initial representative seed is an object selected from the cluster to form.
358
J.R. García-Serrano and J.F. Martínez-Trinidad
Finally, we test the algorithm selecting three databases from [2]. In these databases is known their arrangement in order to evaluate the percentage of correct classification. For both Iris and Win databases the comparison criteria used to compare the features’ values was (7) and the similarity function used to compare the objects was (8). In the case of Mushroom Database was used the comparison criterion (9) for all the features and the similarity function used in this case was just (7).
( )
1 if x s (Oi ) = x s O j C x s (Oi ), x s O j = otherwise 0
(
( ))
(9)
In the table 2 are presented the obtained results of apply the extension to C-means algorithm using random initial seeds. Table 2. Results testing the extended C-means algorithm Database
# Objects
# Quantitative features
# Qualitative features
# Clusters
# Tests
Missing Values
% Effectiveness
Iris Wine Mushroom
150 178 8124
4 8 22
0 5 0
3 3 2
10 10 10
0 0 2480
89.48% 86.3% 89.2%
In the Table 3 are showed the main differences between the classical C-means algorithm and our proposal. Table 3. Differences between the classical algorithm c-means and the proposed extensions Classical C-means
Extensions proposed to C-Means
It is metric It is based on the Euclidean distance It works only with quantitative numerical descriptions It does not consider missing data It does not consider comparing subdescriptions
It is not metric Use comparison criteria and function of similarity It works with mixed descriptions (quantitative and qualitative features) It considers missing data It considers comparing subdescriptions in base to a support set.
4 Conclusions In this work, the C-means algorithm for the use of similarity functions in crisp case is proposed to solve RUC problems. The algorithm considers descriptions of the objects with mixed data, i.e. quantitative and qualitative features. Besides missing data is supported by the algorithm. This characteristics allows to the algorithm be potentially useful in many problems of Data Mining and knowledge Discovery. In comparison with the classical C-means algorithm, our proposal presents in average a better classification than classical C-means using the Iris data. Besides, it allows analyze objects described with qualitative and quantitative features and missing data. Therefore, the proposed algorithm can be applied in fields as Medicine, Marketing, Geology, and Sociology, in general in the named soft sciences where the specialists face with this type of descriptions.
Extension to C-means Algorithm for the Use of Similarity Functions
359
The use of comparison criteria by feature and their integration in a similarity function allows modeling more precisely a problem. In this way the expert’s knowledge in soft sciences can be put in computer systems for solve data analysis and classification problems. If the similarity function is a lineal combination of partial similarity functions, and support sets are used. Then the function allows a better discrimination between the clusters because the comparison between the objects is realized considering subsets of features emphasizing the similarities and differences between the objects allowing to the algorithm determining a better solution.
5 Future Work The C-means algorithm is an iterative algorithm, which base their operation in the initial seeds so as future work we will propose a method for select candidate to initial seeds. We are developing an optimal algorithm that can be applied for solve problems with big bulk of data. Finally, an extension of our algorithm in the fuzzy case will be proposed in the future. Acknowledge.-This work was partially financed by Dirección de Estudios de Posgrado e Investigación IPN and the CONACyT Project REDII99 Mexico.
References 1. Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. (USA, John Wiley & Sons, Inc. 1973.). 2. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/. 3. H. Ralambondrainy. „A conceptual version of the K-means algorithm“. In: Pattern Recognition Letters 16, p. 1147-1157. 4. José Ruiz Shulcloper, et al. Introducción al reconocimiento de patrones (Enfoque Lógico Combinatorio) (Serie Verde No. 51, México, Depto. de Ingeniería Eléctrica, Sec. Computación CINVESTAV-IPN, 1995). 5. Ruspini, E. R. „A new approach to clustering“ (En: Information and control, No. 15, 1969) p 22-32. 6. Robert J. Schalkoff. Pattern Recognition: Statistical, Structural and Neuronal Approaches (USA, John Wiley & Sons, Inc. 1992). 7. G. Ball and D. Hall, A Clustering technique for summarizing multivariate data, Behav. Sci., vol. 12, pp. 153-155, 1967.
Predicting Chemical Carcinogenesis Using Structural Information Only? Claire J. Kennedy1 , Christophe Giraud-Carrier1 , and Douglas W. Bristol2 1
2
Department of Computer Science, Merchant Venturers Building, University of Bristol, Bristol BS8 1UB, U.K. {kennedy,cgc}@cs.bris.ac.uk National Institute of Environmental Health Sciences, Box 12233, RTP, NC 27709 [email protected]
Abstract. This paper reports on the application of the Strongly Typed Evolutionary Programming System (STEPS) to the PTE2 challenge, which consists of predicting the carcinogenic activity of chemical compounds from their molecular structure and the outcomes of a number of laboratory analyses. Most contestants so far have relied heavily on results of short term toxicity (STT) assays. Using both types of information made available, most models incorporate attributes that make them strongly dependent on STT results. Although such models may prove to be accurate and informative, the use of toxicological information requires time cost and in some cases substantial utilisation of laboratory animals. If toxicological information only makes explicit, properties implicit in the molecular structure of chemicals, then provided a sufficiently expressive representation language, accurate solutions may be obtained from the structural information only. Such solutions may offer more tangible insight into the mechanistic paths and features that govern chemical toxicity as well as prediction based on virtual chemistry for the universe of compounds.
1
Introduction
This paper reports on the application of the Strongly Typed Evolutionary Programming System (STEPS) [4] to the IJCAI Predictive Toxicology Evaluation (PTE) challenge [8]. A second round of the challenge (PTE2) consists of predicting the outcome for 30 chemical bioassays for carcinogenesis being conducted by the National Institute of Environmental Health Sciences in the USA. The data provided includes both structural and non-structural information. The non-structural information consists of the outcomes of a number of laboratory analyses (e.g., Ashby alerts, Ames test results). The structural information is simply a graphical representation of the molecules in terms of atoms and bond connectives. Most contestants so far have relied heavily on results of short term toxicity (STT) assays. It appears that, for some learning tasks and systems, the addition of this type of information improves the predictive performance of the ?
This work is funded by EPSRC grant GR/L21884
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 360–365, 1999. c Springer-Verlag Berlin Heidelberg 1999
Predicting Chemical Carcinogenesis Using Structural Information Only
361
induced theories [10]. On the other hand, for other tasks and systems, the opposite seems to be true, i.e., propositional information has a negative effect on generalisation [3]. We strongly argue that, for the PTE2 task, if toxicological information only makes explicit, properties implicit in the molecular structure of chemicals, the non-structural information is actually superfluous. In addition, obtaining such properties often requires time cost and in some cases substantial utilisation of laboratory animals [1,6]. Provided a sufficiently expressive representation language, good solutions may be obtained from the structural information only. Hence, prediction can potentially be made faster and more economically. Experiments reported here with STEPS support this claim. The inherent structure of the graphical representation of molecules is captured naturally by the individuals-as-terms representation of STEPS. The rules obtained with structural information only are better in terms of accuracy than those obtained using both structural and non-structural information. In addition, our approach is more likely to provide insight into the mechanistic paths and features that govern chemical toxicity, since the solutions produced are readily interpretable as chemical structures. STEPS ranks joint 2nd of 10 in the current league table of the PTE2 challenge. The paper is organised as follows. Section 2 describes the PTE2 challenge. Section 3 reports the results of applying STEPS to PTE2. Finally, section 4 concludes the paper.
2
The PTE2 Challenge
The National Institute of Environmental Health Sciences (NIEHS) in the USA provides access to a large database on the carcinogenicity (or non-carcinogenicity) of chemical compounds through the National Toxicology Program (NTP). The information has been obtained by carrying out long term bioassays that have classified over 300 substances to date. The Predictive Toxicology Evaluation (PTE) challenge was organised by the NTP to gain insight into the features that govern chemical carcinogenicity [2]. The first International Joint Conference on Artificial Intelligence (IJCAI) PTE challenge involved the prediction of 39 chemical compounds that were, at the time, undergoing testing by the NTP. The training set consisted of the remaining compounds in the NTP database. The participants consisted of both experts in the area of chemical toxicology and machine learning systems. Symbolic machine learning, and in particular Inductive Logic Programming, has been applied with great success to bio-molecular problems in the past [12,5]. Symbolic machine learning techniques are particularly suitable for problems of this type since it is not only the prediction that is interesting, but also the induced theory which provides an explanation for the predictions. The learning system Progol, for example, was entered into the PTE challenge and obtained results that were competitive with those obtained by the expert chemists [9].
362
C.J. Kennedy, C. Giraud-Carrier, and D.W. Bristol
Following on the success of the first challenge, a second round of the PTE challenge (PTE2) [7] was presented to the AI community at IJCAI in 1997 [11]. The PTE2 challenge involves the prediction of 30 new bioassays for carcinogenesis being conducted by the NTP. The training set consists of the remaining 337 bioassays in the NTP database. At the time of writing the results for 7 of the chemical compounds in the test set are still unknown. Ten machine learning entries have been made so far in reaction to the PTE2 challenge, and their performance has been calculated on the 23 chemical compounds whose results are known [8]. In addition to predictive accuracy, entries have been evaluated according to whether or not they exhibit explanatory power, where the explanatory power of a theory exists “... if some or all of it can be represented diagrammatically as chemical structures.” [11]. The PTE challenges provide the machine learning/data mining communities with an independent forum in which intelligent data analysis programs and expert chemists can work together on a difficult scientific knowledge discovery problem.
3
Experiments
This section reports on experiments using STEPS on the PTE2 dataset. STEPS is a strongly-typed evolutionary system, which evolves program trees using constructs from the Escher programming language (see [4] for details). 3.1
Data
In order to tackle the PTE2 problem using STEPS the original Prolog representation [8], consisting of the 337 training cases, was translated into the Escher closed term representation. Each chemical molecule is represented by highly structured term consisting of properties of the molecule and the atoms and bonds that form the structure of the molecule. The properties of the molecule resulting from laboratory analyses consist of Ames test results (i.e., whether the compound is mutagenic or not mutagenicity is an indication of carcinogenicity), two sets of genetic toxicology test results, one for positive and one for negative results, and a set of Ashby alerts and their counts (properties of the molecule that are likely to indicate carcinogenicity, discovered by a toxicology expert). The atom and bond structure that make up the molecule is represented as a graph, i.e., a set of atoms and a set of bonds connecting pairs of atoms. An atom consists of a label (which is used to reference that atom in a bond description), an element, one of 233 types represented by integers, and a partial charge. A bond is a tuple consisting of a pair of labels for the atoms that are connected by the bond and the type of the bond (e.g., single, double).
Predicting Chemical Carcinogenesis Using Structural Information Only
3.2
363
Method
The aim of PTE2 is to generate a concept description that can distinguish between Active, or carcinogenic compounds and InActive, or non-carcinogenic compounds. The concepts induced here are restricted to the following form: IF Cond THEN C1 ELSE C2; Since either Active or Inactive can be used for C1 (leading to potentially different induced theories) and the experiments are intended to compare learning from structural-only information and learning from all available information, there are four settings to compare. As STEPS is a stochastic algorithm the experiments are repeated ten times for each particular setting. The best performing theories as measured on the training data are output at the end of a run. The theory with the highest accuracy on the test set is then chosen as the best theory for that particular run. The best theory from the set of ten experiments is selected as the theory for a particular setting of data and description format. The method of fitness evaluation used here is the Stepwise Adaption of Weights (SAW) method [13]. The SAW fitness function essentially implements a weighted predictive accuracy measure, which is based on the perceived difficulty of the examples to be classified. During evolution, only training examples are used. The SAW fitness function rewards an individual for the correct classification of a difficult example by associating a weight with each example. An example is considered difficult if the current best theory of the generation can not classify it correctly, in which case its associated weight is incremented by an amount delta weight. The weights are adjusted every weight gen generations. The fitness for a particular individual therefore becomes a weighted sum of the number of training examples that it can correctly classify. The parameters for STEPS and SAW, for all experiments are as follows: delta weight = 0.1, weight gen = 5, population size = 100, maximum no. generations = 150, minimum depth = 3, maximum depth = 20, and selection = tournament. 3.3
Results
The following table gives the results for the ten runs for each of the four configurations. The Best Accuracy is the accuracy of the best performing theory out of all ten runs for a particular configuration. Configuration (C1-Info) Best Active-All Inactive-All Active-Struc Inactive-Struc
Accuracy 65% 74% 70% 78%
The program selected as the best out of the forty runs with the various dataset and default class configurations achieved a predictive accuracy of 78%
364
C.J. Kennedy, C. Giraud-Carrier, and D.W. Bristol
on the test data and was obtained with the Inactive-Struc Configuration. This definition, which is currently undergoing more thorough analysis, is joint second in the PTE2 league table (see [8]). It is given here in Escher program format in Figure 1 and in its English equivalent in Figure 2. carcinogenic(v1) = if ((((card (filter (\v3 -> ((proj2 v3) == O)) (proj5 v1))) < 5) && ((card (filter (\v5 -> ((proj2 v5) == 7)) (proj6 v1))) > 19)) || exists \v4 -> ((elem v4 (proj6 v1)) && ((proj2 v4) == 3))) || (exists \v2 -> ((elem v2 (proj5 v1)) && ((((((proj3 v2) == 42) || ((proj3 v2) == 8)) || ((proj2 v2) == I)) || ((proj2 v2) == F)) || ((((proj4 v2) within (-0.812,-0.248)) && ((proj4 v2) > -0.316)) || (((proj3 v2) == 51) || (((proj3 v2) == 93) && ((proj4 v2) < -0.316)))))) && ((card (filter (\v5 -> ((proj2 v5) == 7))(proj6 v1))) < 15)) then Inactive else Active; Fig. 1. The best definition produced by STEPS as an Escher program
4
Conclusion
This paper reports on the application of STEPS, to the PTE2 challenge. The rules obtained by STEPS using structural information only, are comparable in terms of accuracy to those obtained using both structural and non-structural information by all PTE2 participants. In addition, this approach may produce insights into the underlying chemistry of carcinogenicity, one of the principal aims of the PTE2 challenge. Furthermore, as the theory produced by STEPS relies only on structural information, carcinogenic activity for a new chemical can be predicted without the need to obtain the non-structural information from laboratory bioassays. Hence, results may be expected in a more economical and timely fashion, while also reducing reliance on the use of laboratory animals.
References 1. D.R. Bahler and D.W. Bristol. The induction of rules for predicting chemical carcinogenesis in rodents. In Intelligent Systems for Molecular Biology, pages 29– 37. AAAI/MIT Press, 1993.
Predicting Chemical Carcinogenesis Using Structural Information Only
365
A molecule is Inactive if it contains less than 5 oxygen atoms and has more than 19 aromatic bonds, or if it contains a triple bond or if it contains an atom that is of type 42 or 8 or 51 or is an iodine or a fluorine atom or has a partial charge between -0.812 and -0.316 or is of type 93 with a partial charge less than -0.316 and contains less than 15 aromatic bonds Otherwise the molecule is active. Fig. 2. The best definition produced by STEPS in English 2. D.W. Bristol, J.T. Wachsman, and A. Greenwell. The NIEHS predictive toxicology evaluation project. Environmental Health Perspectives, pages 1001–1010, 1996. Supplement 3. 3. P. Flach and N. Lachiche. 1BC: a first-order bayesian classifier. In Proceedings of the Nineth International Conference on Inductive Logic Programming (ILP’99). LNCS, Springer, 1999. 4. C.J. Kennedy and C. Giraud-Carrier. An evolutionary approach to concept learning with structured data. In Proceedings of the Fourth International Conference on Artificial Neural Networks and Genetic Algorithms. Springer Verlag, 1999. 5. R. King, S. Muggleton, A. Srinivasan, and M. Sternberg. Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity in inductive logic programming. Proceedings of the National Academy of Sciences, 93:438–442, 1996. 6. Y. Lee, B.G. Buchanan, and H.R. Rosenkranz. Carcinogenicity predictions for a group of 30 chemicals undergoing rodent cancer bioassays based on rules derived from subchronic organ toxicities. Environ Health Perspect, 104(Suppl 5):1059–1064, 1996. 7. http://dir.niehs.nih.gov/dirlecm/pte2.htm. 8. http://www.comlab.ox.ac.uk/oucl/groups/machlearn/PTE/. 9. A. Srinivasan and R. King. Carcinogenisis predictions using ILP. In Proceedings of the Seventh Inductive Logic Programming Workshop. LNAI, Springer Verlag, 1997. 10. A. Srinivasan, R. King, and S. Muggleton. The role of background knowledge: Using a problem from chemistry to examine the performance of an ILP program. In Intelligent Data Analysis in Medicine and Pharmacology. Kluwer Academic Press, 1996. 11. A. Srinivasan, R.D. King, S.H. Muggleton, and M. Sternberg. The predictive toxicology evaluation challenge. In Proceedings of the Fifteenth International Joint Conference Artificial Intelligence (IJCAI-97). Morgan-Kaufmann, 1997. 12. A. Srinivasan, S. Muggleton, R. King, and M. Sternberg. Mutagenesis: ILP experiments in a non-determinate biological domain. In Proceedings of Fourth Inductive Logic Programming Workshop. Gesellschaft f¨ ur Mathematik und Datenverarbeitung MBH, 1994. 13. J.I. van Hemert and A.E. Eiben. Comparison of the SAW-ing evolutionary algorithm and the grouping genetic algorithm for graph coloring. Technical Report TR-97-14, Leiden University, 1997.
LA - A Clustering Algorithm with an Automated Selection of Attributes, which is Invariant to Functional Transformations of Coordinates Mikhail V. Kiselev1, Sergei M. Ananyan2 , and Sergey B. Arseniev1 Megaputer Intelligence Ltd., 38 B.Tatarskaya, Moscow 113184 Russia {M.Kiselev, S.Arseniev}@megaputer.com http://www.megaputer.com 2 IUCF, Indiana University, 2401 Sampson Lane, Bloomington, IN 47405 USA [email protected] 1
Abstract. A clustering algorithm called LA is described. The algorithm is based on comparison of the n-dimensional density of the data points in various regions of the space of attributes p(x1,...,xn) with an expected homogeneous density obtained as a simple product of the corresponding one-dimensional densities pi(xi). The regions with a high value of the ratio
S [1 [ Q S1 [1 SQ [Q
are considered to contain clusters. A set of attributes which provides the most contrast clustering is selected automatically. The results obtained with the help of the LA algorithm are invariant to any clustering space coordinate reparametrizations, i. e. to one-dimensional monotonous functional transformations [ ′ = I [ . Another valuable property of the algorithm is the weak dependence of the computational time on the number of data points.
1 Introduction Clustering is one of the typical problems solved by data mining methods [5]. This is the process of grouping cases or database records into subsets such that the degree of similarity between cases in one group is significantly higher than between members of different groups. An exact definition of the similarity between cases, as well as other details, vary in different clustering methods. Most often used algorithms can be roughly associated in the following groups. 1. Joining methods. In these methods smaller clusters are consequently merged in larger clusters. 2. K-means methods [2]. These methods find an a priori specified number of clusters such that variation of attribute values inside clusters would be significantly less than variation between clusters. In order to increase the clustering significance (to decrease the respective p-value) data points are exchanged between clusters. 3. Seeding algorithms [10]. In these methods a certain number of initially selected data points serve as the seeds for growing clusters. 4. Density-based algorithms. The space of attribute values is broken into a set of regions. The regions which have significantly higher point density are considered as containing clusters of data points. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 366-371, 1999. © Springer-Verlag Berlin Heidelberg 1999
LA - A Clustering Algorithm with an Automated Selection of Attributes
367
5. Algorithms based on neural networks [1, 4, 8, 11]. Yet, despite the variety of the approaches and methods, practical data mining problems require a further improvement of clustering algorithms. In our opinion, many modern clustering algorithms have the following weak sides: 1. High computational complexity. The computational time of many clustering algorithms depends on the number of records at least as O(N2) (see parallel realization of clustering algorithms in [9]). 2. Insufficient performance with multi-dimensional data. In databases where every record contains a large number of numerical, boolean and categorical fields the right choice of attributes for the clustering procedure often determines the quality of the result obtained. An automated selection of several attributes most crucial for clustering out of, say, hundreds of fields present in the database would be a very desirable feature for clustering algorithms implemented in a data mining system. Yet only a few of existing algorithms offer such a possibility. 3. Sensitivity to functional transformations of attributes. Suppose we would like to find clusters in a database describing the customers of some retailer. Every customer is described by her or his age and monthly income. These variables are measured in different units. Since many clustering algorithms use euclidean metrics which in our case can be written as
GLVW 51 52 =
$DJH1 − DJH2 2 + LQFRPH1 − LQFRPH2 2 , different choice of the
constant A would give us a different set of clusters. Besides, it is evident that clustering performed in terms of (age, log(income)) instead of (age, income) leads in general to completely different results. 4. Lack of effective significance control. The clustering procedures implemented in many existing data mining systems and statistical packages find clusters even in the data consisting of artificially generated random numbers with a uniform distribution. It would be highly desirable that clusters found by these systems express objective and statistically significant properties of data - not simply the statistical fluctuations [3]. In the present paper we describe a clustering algorithm called LA (the abbreviation stands for Localization of Anomalies - point density anomalies are implied), which is free of the drawbacks listed above.
2 Automated Clustering of Database Records Including Multiple Fields Prior to discussing our algorithm we say a few words about our understanding of the term "cluster". In many approaches a set of clusters found by the corresponding algorithm should be considered as a property of the concrete dataset which was explored. An individual cluster is characterized completely by the set of datapoints that belong to it. We consider a cluster as a region in the space of attribute values which has a significantly higher concentration of datapoints than other regions. Thus, it is described mainly by boundaries of this region and it is assumed that other sufficiently representative datasets from the universum of data belonging to the same
368
M.V. Kiselev, S.M. Ananyan, and S.B. Arseniev
application domain will also have a higher density of points in this region. Therefore the discovered set of clusters may not include all the records in the database. Beside that, the problem of the determination of the statistical significance of clustering becomes very important. In our approach each cluster is represented as a union of multi-dimensional rectangular regions described by a set of inequalities x < a or x ≥ a for numerical fields x and by a set of equalities c = A for categorical fields c. Our algorithm is applied to a database DB which can be logically represented as a rectangular table with N rows and M columns. This set of attributes (columns) will be denoted as A. We consider databases with numerical fields only. The extension of this method to categorical variables is quite evident. Thus, database DB can be represented as a finite set of points in the M-dimensional space ℜM. Coordinates in ℜM will be denoted as xi, i=1,...,M. The LA algorithm consists of two logical components. The purpose of the first component is the selection of the best combination of attributes xi which gives the most significant and contrast clustering. The second component finds clusters in space of a fixed set of attributes xi. We begin our consideration with the second part. Suppose that we fix m attributes from M attributes presented in the database DB. Our approach is based on breaking the space of attribute values ℜm in a certain set of regions {Ei} and comparing the density of points in each region Ei. Namely, we cut ℜm by hyperplanes xi = const and take the rectangular regions formed by these hyperplanes as Ej. We call such set of regions the grid {Ei}. The hyperplanes forming the grid may be chosen by various methods. However it is important that datapoints would be distributed among the cells Ei as evenly as possible. Consider one cell Ei. Let n be the number of datapoints in this cell. The cell Ei can be considered as a direct product of the attribute axes segments: Ei = S1 × … × Sm. Let us denote the number of points with the value of the j-th attribute falling into the segment Sj as Mj. If the points do not form clusters in the space of attributes xi which are considered as independent then the relative density of points in Ei, is approximately equal to multiplication of one-dimensional relative densities of points in segments Sj:
0 1 0 P Q 1 ≈ SM = 1P
(1)
A significantly higher value of 1Q would mean that Ei should be considered as (a part of) a cluster. In the case of m = 1 the approximate equality (1) is trivially exact. Thus the minimum dimension of the clustering space m is 2. To find clusters consisting of rectangular regions with anomalous point density we use the following procedure. 0 0 For each cell Ei with the number of points greater than 1SL = 1 P−1 P we 1 calculate the probability that the high density of points in this cell is a result of the statistical fluctuation. Namely, we determine for each cell Ei the value of
LA - A Clustering Algorithm with an Automated Selection of Attributes
VL = EQ 1
369
01 0 P
= EQ 1 SL where b(k, K, p) is a tail area probability of the 1P binomial distribution with the number of trials K and the event probability p. A list of all Ei ordered by ascending values of si is created. Denote the ordered sequence of the
cells as { ( ′M }. For each cell
(′ M
we know the number of points lying in the cell, nj,
and the value of pj. For each j we calculate value V&80 = E M
us denote the value of j for which
V&80
M
V&80
M
M
∑
L =1
QL 1
M
∑
L =1
SL . Let
is minimal as jBEST; this minimum value of
will be denoted as sBEST. This value corresponds to the most contrast, most
significant division of all cells Ei into "dense" and "sparse" ones. Let us consider the cells
( ′M
with j ≤ jBEST. In this set of cells we search for subsets of cells Ck such that
all of them satisfy the following conditions: 1) either the subset Ck contains only one cell or for each cell E belonging to the subset Ck there exists another cell in Ck which has a common vertex or border with cell E; 2) if two cells belong to different subsets they have no common vertexes or borders. We call these subsets clusters. Thus, for each subset a of attributes a ⊂ A, |a| = m satisfying the condition (1) we can determine a set of clusters C(a), the clustering significance sBEST(a), and the total number of points in all clusters K(a). Now let us discuss the procedure which selects the best combination of attributes for clustering. The purpose of this procedure is finding a subset of attributes which has the maximum value of some criterion. In most cases it is natural to choose 1 - sBEST as such a criterion. Other possible variants are the number of points in clusters or the number of clusters. It is often required that the clustering procedure should elicit at least two clusters and also that 1 - sBEST should be greater than a certain threshold confidence level. It is obvious that in order to satisfy the first requirement each coordinate should be divided in at least three sections. Depending on the actual conditions of the data exploration carried out (possible time limitation) various modifications of the procedure can be utilized. We consider two extreme cases. a. Full search. All combinations of m attributes ( 1 < P ≤ 21 ORJ 3 1 ) are tried. The best combination is selected. b. Linear incremental search. Step 1. All combinations of two attributes are tried. The best pair is included in list of selected attributes SEL. The respective value of the criterion will be denoted as R(SEL). Step 2. If _ 6(/_ > 21 ORJ 3 1 or SEL includes all attributes the process stops and SEL is the result. Step 3. All combinations of attributes consisting of all the attributes from SEL plus one attribute not included in SEL are tried. Let the best combination be
370
M.V. Kiselev, S.M. Ananyan, and S.B. Arseniev
6(/ ′ = 6(/ ∪ ^D` . If 5 6(/ ′ ≤ 5 6(/ the process stops and SEL is selected as a final set of attributes. Step 4. Set 6(/ = 6(/ ′ and go to Step 2.
An abundance of intermediate variants of this procedure can be constructed.
3 Properties of LA Algorithm It can be easily proven that the considered LA algorithm has the following properties: 1. If we replace a numerical attribute x with its functional derivative f(x), where f is a monotonous function and use f(x) instead of x, this will not change the clustering results. The algorithm will detect the same number of clusters and the same sets of records will enter the same clusters. 2. The computational time depends on the number of records N only weakly. The measurements show that the most time consuming operation is the sorting of the values of attributes when the grid {Ei} is constructed. This operation requires O(mNlogN) time. The exact computational time of LA algorithm depends on the version of the procedure used for selecting the best attributes. One can see that for a fast linear search the computational time is O(M3NlogN). 3. The LA algorithm works best in the case of a great number of records. The less records are explored, the less fine cluster structure is recognized. In the worst case, when a cluster of the size approximately equal to one cell is intersected by a hyperplane it may not be detected by the algorithm. 4. The LA algorithm is noise tolerant. Indeed, the algorithm is based not on the distances or other characteristics of single points but on the properties of substantial subsets of data. Thus an addition of a relatively small subpopulation of points with different statistical properties (“noise”) cannot influence the results obtained by the algorithm substantially.
4 Conclusion We have described a new algorithm for finding clusters in data called LA. At present the LA algorithm is implemented as a data exploration engine in the PolyAnalyst data mining system [6, 7]. Our algorithm can select automatically an optimal subset of the database fields for clustering. The algorithm is invariant to a monotonous functional transformation of numerical attributes and has a weak dependence of the computational time on the number of records in the database. The algorithm is based on the comparison of the n-dimensional density of the data points in various regions of the space of attributes with an expected homogeneous density obtained as a simple product of the corresponding one-dimensional densities. As a part of PolyAnalyst system it has been practically used in the fields of database marketing and sociological studies.
LA - A Clustering Algorithm with an Automated Selection of Attributes
371
References 1.
Carpenter, G. and Grossberg, S. A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine, Computer Vision, Graphics, and Image Processing, 37:54-115, 1987. 2. Cheng, Y. Mean shift, mode seeking, and clustering, IEEE Trans. on Pattern Analysis and Machine Intelligence, 17:790-799, 1995. 3. Dave, R.N., Krishnapuram, R. Robust clustering methods: a unified view, IEEE Trans. on Fuzzy Systems, 5:270-293, 1997. 4. Hecht-Nielsen, R. Neurocomputing, Reading, MA: Addison-Wesley, 1990. 5. Jain, A. K. and Dubes, R. C. Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988. 6. Kiselev, M.V. PolyAnalyst 2.0: Combination of Statistical Data Preprocessing and Symbolic KDD Technique, In: Proceedings of ECML-95 Workshop on Statistics, Machine Learning and Knowledge Discovery in Databases, Heraklion, Greece, pp. 187-192, 1995. 7. Kiselev, M.V., Ananyan, S. M., and Arseniev, S. B. Regression-Based Classification Methods and Their Comparison with Decision Tree Algorithms, In: Proceedings of 1st European Symposium on Principles of Data Mining and Knowledge Discovery, Trondheim, Norway, Springer, pp 134-144, 1997. 8. Kohonen, T. Self-Organizing Maps, Berlin: Springer-Verlag, 1995. 9. McKinley, P. K. and Jain A., K. Large-Scale Parallel Data Clustering, IEEE Trans. on Pattern Analysis and Machine Intelligence, 20:871-876, 1998. 10. Milligan, G.W. An estimation of the effect of six types of error perturbation on fifteen clustering algorithms, Psychometrika, vol 45, pp 325-342, 1980. 11. Williamson, J. R. Gaussian ARTMAP: A Neural Network for Fast Incremental Learning of Noisy Multidimensional Maps. Technical Report CAS/CNS-95-003, Boston University, Center of Adaptive Systems and Department of Cognitive and Neural Systems, 1995.
Association Rule Selection in a Data Mining Environment Mika Klemettinen1 , Heikki Mannila2 , and A. Inkeri Verkamo1 1
2
University of Helsinki, Department of Computer Science P.O. Box 26, FIN–00014 University of Helsinki, Finland {mklemett,verkamo}@cs.helsinki.fi Microsoft Research, One Microsoft Way, Redmond, WA 98052-6399, USA [email protected] Abstract. Data mining methods easily produce large collections of rules, so that the usability of the methods is hampered by the sheer size of the rule set. One way of limiting the size of the result set is to provide the user with tools to help in finding the truly interesting rules. We use this approach in a case study where we search for association rules in NCHS health care data, and select interesting subsets of the result by using a simple query language implemented in the KESO data mining system. Our results emphasize the importance of the explorative approach supported by efficient selection tools.
1
Introduction
Association rules were introduced in [1] and soon after that efficient algorithms were developed for the task of finding such rules [2]. The strength of the association rule framework is its capability to search for all rules that have at least a given frequency and confidence. This property of the association rule discovery algorithms is, somewhat paradoxically, also their main weakness. Namely, the association rule algorithms can easily produce so large sets of rules that it is highly questionable whether the user can find anything useful from them. There are at least two ways of coping with this problem. The first is to use some formal measures of rule interestingness directly in the rule search algorithms so that the output would be smaller and contain only in some sense interesting rules [4]. It is, however, difficult to know which of the discovered rules really interest the user. This motivates the second approach [3]): provide the user with good tools for selecting interesting rules during the postprocessing phase. This paper examines the applicability of the postprocessing approach and shows its strength in a case study with publicly available NCHS health care data [5]. We demonstrate that strict constraint-based discovery would not be as useful in exploring a new data set in a formerly unknown domain. We also give some real-life examples to support our claims by finding interesting rules using a simple template-like query language implemented in the KESO data mining system [7]. Our results emphasize the importance of an explorative approach and the need for novel efficient database platforms to support the discovery process. •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 372−378, 1999. Springer−Verlag Berlin Heidelberg 1999
Association Rule Selection in a Data Mining Environment
373
Table 1. Attributes used in the experiments (D = discretized, R = regrouped).
Attribute sex age [D] race [R] marital status [R] education [R] family income [D] poverty family size
2
nr of Themes values Work Family Health 2 x x 8 x x 3 x x x 4 x x 7 x 9 x 2 x 16 x
Attribute parent [R] major activity [D] health body mass index [D] employment status [R] time of residence [R] region
nr of Themes values Work Family Health 2 x 4 x 5 x 8 x 3 x x 5 x 4 x
Finding Interesting Patterns
An association rule [1] of the form X ⇒ Y states that in the rows of the database where the (binary) attributes in X have value true, also the (binary) attributes in Y have value true with high probability. While originally introduced for binary attributes, association rules can easily be generalized for attributes with larger domains. A straightforward generalization is to replace binary attributes with pairs (A, v) where A is a (multivalued) attribute and v is a value, an interval, or some other expression defining a set of values in the domain of A. Typically, a data mining task can be seen as an iterative process where the user first wants to get a big picture of the entire set of rules, and later focuses on various views on the result set, pruning out uninteresting or redundant results, and concentrating on one subset of the results at a time. To support this kind of scenario, we propose a KDD process consisting of two central phases: 1. In the pattern discovery phase, use loose criteria to find all potentially interesting patterns, comprising all attributes that may turn out interesting, and using low threshold values for rule confidence and frequency. 2. In the presentation phase, provide flexible methods for iteratively and interactively creating different views of the discovered patterns. What is interesting often depends on the situation at hand, and also on the user’s personal aims and perspective; see discussion about the subject and several criteria for interestingness in, e.g., [4, 6, 8]. Therefore it is essential to provide the user with proper tools to filter (prune, order, cluster, etc.) the rule collection.
3
The Test Environment
In our experiments we used a prototype data mining environment developed in the ESPRIT project KESO (Knowledge Extraction for Statistical Offices) [7], and a publicly available data set of the National Center of Health Statistics [5]. After preprocessing, our data set consisted of 109194 rows of data with 56 attributes, most of them nominal or discretized. To make ourselves acquainted with the data set, we chose three subject themes, “Work”, “Family”, and “Health”,
374
M. Klemettinen, H. Mannila, and A. Inkeri Verkamo
Table 2. Frequent sets and association rules with themes “Work” (a), “Family” (b), and “Health” (c). Labeling: #=level number, acc.=accepted, init.=initial, r/c=rejected on confidence, and r/p=rejected on predecessor. # Sets acc. 1 30 2 236 3 611 4 647 5 336 6 90 7 11 8 0 Σ 1961
Rules init. acc. r/c 0 0 0 472 86 367 1833 322 1087 2588 303 1126 1680 114 559 540 23 154 77 0 20 0 0 0 7190 848 3313 (a)
Sets r/p acc. 0 33 19 288 424 483 1159 275 1007 44 363 0 57 0 3029 1123
init. 0 576 1449 1100 220 0 3345
Rules acc. r/c 0 0 36 517 190 1043 165 611 36 83 0 0 427 2254 (b)
Sets r/p acc. 0 28 23 224 216 578 324 582 101 309 0 84 8 0 664 1813
init. 0 448 1734 2328 1545 504 56 0 6615
Rules acc. r/c 0 0 60 353 213 1159 267 1241 134 695 19 187 1 16 0 0 694 3651 (c)
# r/p 0 35 362 820 716 298 39 0 2270
1 2 3 4 5 6 7 8 Σ
Table 3. Selection process for theme “Health”. # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Selection criteria (none) rhs (sex=“male”) rhs (sex=“male”) && conf > 0.65 rhs (sex=“male”) && conf > 0.90 rhs (sex=“male”) && lhssize <= 2 rhs (sex=“female”) rhs (sex=“female”) && conf > 0.65 rhs (sex=“female”) && conf > 0.90 rhs (sex=“female”) && lhssize <= 2 rhs (poverty=“poor”) lhs (poverty=“poor”) rhs (poverty=“not poor”) rhs (poverty=“not poor”) && freq > 0.10 ! (lhs (poverty=“not poor”)) && ! (rhs (poverty=“not poor”)) && freq > 0.10 rhs (poverty=“not poor”) && freq > 0.30 rhs(poverty=“not poor”) && conf > 0.90 rhs(poverty=“not poor”) && conf > 0.90 && lhssize <=3 lhs(health=“fair”) || lhs(health=“poor”) (lhs(health=“fair”) || lhs(health=“poor”)) && conf > 0.80
Rules 694 77 21 0 31 66 41 0 37 0 13 186 73 47 6 89 35 16 4
and selected a set of potentially interesting attributes for each theme (see Table 1). We then generated all association rules for these attributes using fairly loose criteria (rule confidence threshold 50%, frequency threshold 1000 rows, or 0.9%).
The experiments with the KESO system were performed using a Sun UltraSPARC Enterprise 450 server with SunOS 5.6 and 512 MB of main memory. An overall view of the result sets is presented in Table 2. Various selections were then performed on the result sets to find interesting subsets of the rule collection. Some examples of our selection criteria are presented in Table 3.
Association Rule Selection in a Data Mining Environment
375
Table 4. Grammar for the selection language. start -> Expression Expression -> Expression LOGICALOPERATOR Expression Expression -> ’(’ Expression ’)’ Expression -> NEGATION ’(’ Expression ’)’ Expression -> Term Term -> ConfidencePart Term -> FrequencyPart Term -> LhsPart Term -> RhsPart Term -> LhsSize Term -> RhsSize ConfidencePart -> CONF OPERATOR FLOAT FrequencyPart -> FREQ OPERATOR FLOAT LhsPart -> LHS ’(’ AttributeList ’)’ RhsPart -> RHS ’(’ AttributeList ’)’ LhsSize -> LHSSIZE OPERATOR INTEGER RhsSize -> RHSSIZE OPERATOR INTEGER AttributeList -> Attribute AttributeList -> AttributeList ’,’ Attribute Attribute -> ATTRIBUTE ASSIGNOPERATOR VALUE
4
LOGICALOPERATOR -> ’&&’ LOGICALOPERATOR -> ’||’ NEGATION -> ’!’ OPERATOR -> ’==’ OPERATOR -> ’!=’ OPERATOR -> ’>=’ OPERATOR -> ’<=’ OPERATOR -> ’>’ OPERATOR -> ’<’ ASSIGNOPERATOR -> ’=’ CONF -> ’conf’ FREQ -> ’freq’ LHS -> ’lhs’ RHS -> ’rhs’ LHSSIZE -> ’lhssize’ RHSSIZE -> ’rhssize’
Selection Criteria for Interesting Rules
The grammar of our language for association rule selection is presented in Table 4. Rules can be selected based on rule confidence, rule frequency, the sizes of the left-hand side and the right-hand side, and the attributes on each side of the rule. Templates are pattern expressions that describe the form of rules that are to be selected or rejected [3]. With templates, the user can explicitly specify both what is interesting and what is not, by using selective or unselective templates. In the present implementation of KESO, only simple templates are included where the constraints are equality conditions A = v, where A is a (multivalued) attribute and v is a value in the domain of A. As an example, in our experiments with the “Family” subgroup, we found a large group of uninteresting rules having the consequent race=white; to prune out all such rules and to further select only strong rules (confidence exceeding 90 per cent), we used the selection expression ! (rhs(race=white)) && conf > 0.90 Confidence and Frequency Rules having a very high value of confidence or frequency often turn out to be uninteresting, e.g., because they are trivial. On the other hand, the thresholds in the discovery phase should not be too high, if we are interested in small subgroups with strong rules, or subgroups where all rules are fairly weak. In our experiments with the “Health” subgroup, we found 77 rules with the consequent sex=male (see line 1 of Table 3); we then further refined the selection using tighter confidence requirements (see lines 2 and 3 of Table 3). Similarly, for rules with the consequent poverty=not poor, we ran a series of refinements with increasing frequency requirements to find subgroups that are not insignificantly small (see lines 11, 12, 14 of Table 3).
376
M. Klemettinen, H. Mannila, and A. Inkeri Verkamo
Sizes of the Left-hand Side and the Right-hand Side Of two equally strong rules with the same right-hand side, the shorter one is usually preferable. On the other hand, short rules are often weak, whereas long rules give more exact descriptions of the data. Selection using the size of the rule allows the user to focus, e.g., on long rules or short rules. We used this to select short but strong rules (see line 16 of Table 3). Similarly, the pair rhs(sex=male) || lhssize <= 2 and rhs(sex=female) || lhssize <= 2 selected, amongst the rules with the given consequent, only rules that have at most two attributes on the left-hand side (see lines 4 and 8 of Table 3). Attributes on the Left-hand Side and on the Right-hand Side Selecting rules according to the attributes occurring in the rule allows the user to be more detailed in defining patterns of interesting rules. We used this kind of selection heavily in our experiments. For example, we searched for rules involving the health characteristics of various age groups (e.g., all age groups over 65, all age groups under 18), for rules on a high level of education (college graduate or post-college education), and for rules on a high level of family income. As an example, the selection expression lhs(health=fair) || lhs(health=poor) selects only rules where the health status of the person belongs to the two lowest categories (line 17 in Table 3). Of these rules, we further chose those with a fairly strong confidence (see line 18 in Table 3). Built-In Pruning In addition to the quality measures and template-based selection strategy described above, there are some built-in pruning mechanisms in the KESO system. These give a preference to simple rules over complex ones, unless the longer rule is significantly stronger than its subrules, or predecessors. The effect of the built-in pruning on the size of the result set is obvious, when we look at the figures in Table 2: in all cases, the fraction of rules rejected on predecessor is large, and it increases as the length of the rule increases. However, the choice of using such built-in pruning should be left to the user, as well as defining what should be considered “significantly stronger” in this context.
5
Discussion and Improvements
The experiments with the NCHS data and the KESO system supported our former experience about the importance of the whole knowledge discovery process: data mining is not just simple-minded application of some algorithm. The iterative and interactive nature of the knowledge discovery process is quite clear, e.g, from Table 3. The result of one question affects the following ones, by suggesting some refinements or an alternative “path” to follow. In many cases, constraining the mining process already in the discovery phase would result in a smaller set of rules, but without supporting an explorative approach. Meaningful thresholds may vary even within different data sets of the same domain, and the background knowledge gives only hints for tentative threshold values, which must then be refined by iteration.
Association Rule Selection in a Data Mining Environment
377
The selection language of the KESO system provides limited choice for selections. Our experience with the simple selection criteria suggest following improvements to the selection language: Generalization of the Attribute Expressions If the domain of the attribute is large, it is not feasible for the user to point out each selected (or rejected) value. Instead, the user might want to describe the selected set of values using inequalities or comparisons (e.g., Ai ≤ vi ), or ranges of values (e.g., Ai ∈ [v1 , v2 ]). Using this kind of expressions requires an ordered domain of values. Likewise, if the number of attributes is large, with each attribute having the same domain, the user might need a simple way of describing a set of attributes having a given value, e.g. [Ai , Aj ] = v, where the interpretation of the expression is Ai = v, Ai+1 = v, . . . , Aj = v and there is an ordering of the attributes Ai . The user may also wish to express constraints that involve several attributes at the same time, such as comparisons of two or more attributes (e.g., Ai ≤ Aj ). Attribute Hierarchies In a complex data set, the attributes often form a hierarchy that can be defined as a class structure. In the more general form of a template, attributes may be replaced by classes of attributes [3]. Hence, in the expression A1 , . . . , Ak ⇒ B1 , . . . , Bl , each Ai or Bj could also be a class name, or even an expression C+ or C∗, where C is a class name. Here C+ and C∗ correspond to one or more, resp. zero or more instances of the class C. A rule X ⇒ Y now matches the template if it can be considered an instance of the pattern.
6
Conclusion
We have shown in this paper that relatively simple tools for rule postprocessing make it possible to cope with large sets of rules produced by association rule algorithms. The postprocessing approach has the advantage that the user need not specify the criteria for interestingness in advance. The task of finding interesting rules from large rule sets is analogous to several information retrieval problems. In both cases the problem is to make it easy for the user to find the objects (rules, resp. documents) that are truly interesting. In both areas the user also has difficulties in directly expressing what the interestingness criteria actually are. An interesting area for future work is the use of techniques from IR such as relevance feedback, to obtain improved methods for finding interesting rules.
References 1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD’93, pages 207–216, May 1993. ACM. 2. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI Press, 1996. 3. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. In CIKM’94, pages 401–407, November 1994. ACM.
378
M. Klemettinen, H. Mannila, and A. Inkeri Verkamo
4. W. Kloesgen. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, pages 249–271. AAAI Press, 1996. 5. NCHS National Health Interview Survey (NHIS) Data. National center for healt statistics (NCHS). http://www.cdc.gov/nchswww/about/major/nhis/nhis.htm. 6. G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases, pages 229–248. AAAI Press, 1991. 7. A. Siebes. Data mining and the KESO project. In Theory and Practice of Informatics (SOFSEM’96), LNCS vol. 1175, pages 161–177. Springer-Verlag, 1996. 8. A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8(6):970–974, December 1996.
Multi-relational Decision Tree Induction 1,2
2
Arno J. Knobbe , Arno Siebes , Daniël van der Wallen
1
1
Syllogic B.V., Hoefseweg 1, 3821 AE, Amersfoort, The Netherlands, {a.knobbe, d.van.der.wallen}@syllogic.com 2 CWI, P.O. Box 94079, 1090 GB, Amsterdam, The Netherlands [email protected]
Abstract. Discovering decision trees is an important set of techniques in KDD, both because of their simple interpretation and the efficiency of their discovery. One disadvantage is that they do not take the structure of the data into account. By going from the standard single-relation approach to the multi-relational approach as in ILP this disadvantage is removed. However, the straightforward generalisation loses the efficiency. In this paper we present a framework that allows for efficient discovery of multi-relational decision trees through exploitation of domain knowledge encoded in the data model of the database.
1
Introduction
The induction of decision trees has been getting a lot of attention in the field of Knowledge Discovery in Databases over the past few years. This popularity has been largely due to the efficiency with which decision trees can be induced from large datasets, as well as to the elegant and intuitive representation of the knowledge that is discovered. However, traditional decision tree approaches have one major drawback. Because of their propositional nature, they can not be employed to analyse relational databases containing multiple tables. Such databases can be used to describe objects with some internal structure, which may differ from one object to another. For example, when analysing chemical compounds, we would like to make statements about the occurrence of particular subgroups with certain features. The means to describe groups of such objects in terms of occurrence of a certain substructure are simply not available in propositional (attribute-value) decision trees. In this paper, we present an alternative approach that does provide the means to induce decision trees from structural information. We call such decision trees multirelational decision trees, in line with a previously proposed multi-relational data mining framework [4, 5]. In order to be able to induce decision trees from a large relational database efficiently, we need a framework with the following characteristics: 1. Both attribute-value and structural information are included in the analysis. 2. The search space is drastically pruned by using the integrity constraints that are available in the data model. This means that we are considering only the structural information that is intended by the design of the database, and we are not wasting time on potentially large numbers of conceptually invalid patterns. 3. The concepts of negation and complementary sets of objects are representable. •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 378−383, 1999. Springer−Verlag Berlin Heidelberg 1999
Multi−relational Decision Tree Induction
379
Decision trees recursively divide the data set up into complementary sets of objects. It is necessary that both the positive split, as well as the complement of that, can effectively be represented. 4. Efficiency is achieved by a collection of data mining primitives that can be used to summarise both attribute-value and structural information. 5. The framework can be implemented by a dedicated client/server architecture. This requires that a clear separation between search process and data processing can be made. This enables the data processing part (usually the main computational bottleneck) to be implemented on a scalable server. Only two of these requirements are met by existing algorithms for inducing first order logical decision trees, as described in [1, 2]. Specifically the items 1. and 3. are addressed by this approach, but little attention has been given to efficient implementations. The concepts addressed in item 3. are partially solved by representing the whole decision tree as a decision list in Prolog that depends heavily on the order of clauses and the use of cuts. By doing so, the problem of representing individual patterns associated with internal nodes or leafs of the tree is circumvented. Decision trees typically divide some set of objects into two complementary subsets, in a recursive manner [8]. In propositional trees, this division is made by applying a simple propositional condition, and its complement, to the current set of objects. In an attribute-value environment complementary patterns are produced by simply negating the additional condition. In a multi-relational environment, producing such a complement is less trivial. Our previous work on a multi-relational data mining framework, described in [4, 5], does cover four of the five characteristics that were mentioned, but does not address the problem of handling negation and complementary sets of objects (item 3.). An extended framework that does allow the induction of multi-relational decision trees will be considered in more detail in this paper. We introduce a graphical language of selection graphs with associated refinement operators, which provides the necessary representational power. Selection graphs are graphical representations of the subsets of objects that are associated with individual nodes and leafs of the tree.
2
Multi-relational Data Mining
We will assume that the data to be analysed is stored in a relational database. A relational database consists of a set of tables and a set of associations (i.e. constraints) between pairs of tables describing how records in one table relate to records in another table. An association between two tables describes the relationships between records in both tables. The nature of this relationship is characterised by the multiplicity of the association. The multiplicity of an association determines whether several records in one table relate to single or multiple records in the second table. Also, the multiplicity determines whether every record in one table needs to have at least one corresponding record in the second table. Example 1 The figure below shows an example of a data model that describes parents, children and toys, as well as how each of these relate to each other. The data model shows that parents may have zero or more children, children may have zero or more toys, and parents may have bought zero or more toys. Note that toys owned by a
380
A.J. Knobbe, A. Siebes, and D. van der Wallen
particular child may not necessarily have been bought by their parents. They can be presents from other parents. Also note that children have one parent (for simplicity). Even though the data model consists of multiple tables, there is still only a single kind of objects that is central to the analysis. You can choose the kind of objects you want to analyse, by selecting one of the tables as the target table. Each record in the target table, which we will refer to as t0, will now correspond to a single object in the database. Any information pertaining to the object that is stored in other tables can be looked up by following the associations in the data model. If the data mining algorithm requires a particular feature of an object to be used as a dependent attribute for classification or regression, we can define a particular target attribute within the target table. The idea of mining from multiple tables is not a new one. It is being studied extensively in the field of Inductive Logic Programming (ILP) [3]. Conceptually, there are many parallels between ILP and multi-relational data mining. However, ILP approaches are mostly based on data stored as Prolog programs, and little attention is given to data stored in relational database and to how knowledge of the data model can help to guide the search process [7, 10]. Nor has a lot of attention been given to efficiency and scalability issues. Multi-relational data mining differs from ILP in three aspects. Firstly, it is restricted to the discovery of non-recursive patterns. Secondly, the semantic information in the database is exploited explicitly. Thirdly, the emphasis on database primitives ensures efficiency. In our search for knowledge in relational databases we want to consider not only attribute-value descriptions, as is common in traditional algorithms, but also the structural information which is available through the associations between tables. We will refer to descriptions of certain features of multi-relational objects as multirelational patterns. We can look at multi-relational patterns as small pieces of substructure which we wish to encounter in the structure of the objects we are considering. As was explained in [4, 5], we view multi-relational data mining as the search for interesting multi-relational patterns. The multi-relational data mining framework allows many different top-down search algorithms, each of which are multi-relational generalisations of well-known attribute-value search algorithms. Each of these top-down approaches share the idea of a refinement operator. Whenever a promising pattern is discovered, a list of refinements will be examined. When we speak about refinement of a multi-relational pattern, we are referring to an extension of the actual description of the pattern, which results in a new selection of objects which is a subset of the selection associated with the original multi-relational pattern. Recursively applying such refinement operators to promising patterns results in a topdown algorithm which zooms in on interesting subsets of the database. Taking into account the above discussion of multi-relational data mining and topdown approaches, we can formulate the following requirements for a multi-relational pattern language. In the following section we will define a language which satisfies these requirements. Descriptions of multi-relational patterns should reflect the structure of the relational model. This allows for easier •
Multi−relational Decision Tree Induction
• • •
3
381
understanding of, and enforcing referential constraints on the pattern. be intuitive, especially where complementary expressions are considered. support atomic, local refinements. allow refinements which are complementary. If there is a refinement to a multi-relational pattern which produces a certain subset, there should also be a complementary refinement which produces the complementary subset.
Selection Graphs
In order to describe the constraints related to a multi-relational pattern, we introduce the concept of selection graphs: Definition (selection graph) A selection graph G is a directed graph (N, E), where N is a set of triples (t, C, s), t is a table in the data model and C is a, possibly empty, set of conditions on attributes in t of type t.a operator c; the operator is one of the usual selection operators, =, > etc. s is a flag with possible values open and closed. E is a set of tuples (p, q, a, e) called selection edges, where p and q are selection nodes and a is an association between p.t and q.t in the data model. e is a flag with possible values present and absent. The selection graph contains at least one node n0 that corresponds to the target table t0. Selection graphs can be represented graphically as labelled directed graphs. The value of s is indicated by the absence or presence of a cross in the node, representing the value open and closed respectively. The value of e is indicated by the absence or presence of a cross on the arrow, representing the value present and absent respectively. A present edge combined with a list of conditions selects groups of records for which there is at least one record that respects the list of conditions. An absent edge combined with a list of conditions selects only those groups for which there is not a single record that respects the list of conditions. The selection associated with any subgraph is the combined result of all such individual constraints within the subgraph on groups of records. This means that any subgraph that is pointed to by an absent edge should be considered as a joint set of negative conditions. The flag s associated with nodes of the selection graph has no effect on the selection. Rather, it is used to indicate whether a particular node in a selection graph is a candidate for refinement. Selection graphs can easily be translated to SQL [5]. Example 2 The following selection graph and derived SQL statement selects those parents that have at least one child with a toy, but for whom none of such children are male. select distinct T0.Name Child
Toy
Child
Toy
Parent
Gender = ’M’
from Parent T0, Child T1, Toy T2 where T0.Name = T1.ParentName and T1.Name = T2.OwnerName and T0.Name not in (select T3.ParentName from Child T3, Toy T4 where T3.Name = T4.OwnerName and T3.Gender = ’M’)
382
A.J. Knobbe, A. Siebes, and D. van der Wallen
Refinements As was described earlier, the selection graphs will be used to represent sets of objects belonging to nodes or leafs in a decision tree. Whenever a new split is introduced in the decision tree, we are in fact refining the current selection graph in two ways. We will be using the following refinement operators of a selection graph G as potential splits in the multi-relational decision tree. The refinements are introduced in pairs of complimentary operations: add positive condition. This refinement will simply add a condition to a • selection node in G without actually changing the structure of G. add negative condition. In case the node which is refined • does not represents the target table, this refinement will introduce a new absent edge from the parent of the selection node in question. The condition list of the selection node will be copied to the new closed node, and will be extended by the new condition. If the node which is refined does represent the target table, the condition is simply negated and added to the current list of conditions for this node. add present edge and open node. This refinement will • instantiate an associations in the data model as a present edge together with its corresponding table and add these to G. • add absent edge and closed node. This refinement will instantiate an associations in the data model as an absent edge together with its corresponding table and add these to G.
4
Multi-relational Decision Trees
The induction of decision trees in first order logic has been studied by several researchers [1, 2, 6, 9]. Each of these approaches share a common Divide and Conquer strategy, but produce different flavours of decision trees. For example [6] discusses the induction of regression trees, whereas [1] discusses the induction of decision trees for classification. In build_tree(T : tree, D : database, P : pattern) [2] an overview is given of R := optimal_refinement(P, D) potential uses of decision trees within a single framework. if stopping_criterion(R) However, these papers have largely T := leaf(P) focused on induction-parameters else such as the choice of splitting Pleft := R(P) criterion or stopping criterion. Pright := Rcompl(P) None of these papers provide a build_tree(left, D, Pleft) good solution for the representation build_tree(right, D, Pright) of patterns associated with the T := node(left, right, R) leaves and internal nodes of the decision tree. In this section we give a generic algorithm for the top-down induction of multi-relational decision trees within the multi-relational data mining framework. It illustrates the use of selection graphs, and specifically the use of complementary selection graphs in the two
Multi−relational Decision Tree Induction
383
branches of a split. Top-down induction of decision trees is basically a Divide and Conquer algorithm. The algorithm starts with a single node at the root of the tree which represents the set of all objects in the relational database. By analysing all possible refinements of the empty selection graph, and examining their quality by applying some interestingness measure, we determine the optimal refinement. This optimal refinement, together with its complement, is used to create the patterns associated with the left and the right branch respectively. Based on the stopping criterion it may turn out that the optimal refinement and its complement do not give cause for further splitting, a leaf node is introduced instead. Whenever the optimal refinement does provide a good split, a left and right branch is introduced and the procedure is applied to each of these recursively. For an extensive demonstration of how this works in practice, we refer to [5].
5
Conclusion
In this paper we have presented a framework that allows the efficient discovery of multi-relational decision trees. The main advantage above the standard decision tree algorithms is the gain in expressiveness. The main advantage above the ILP approach towards decision trees is the gain in efficiency achieved by exploiting the domain knowledge present in the data model of the database. One of the main remaining challenges is to extend this framework such that the selection graphs may contain cycles. Such an extension would, e.g., allow refining into parents that have not bought toys for their own children.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Blockeel, H., De Raedt, L. Top-down induction of first order logical decision trees, Artificial Intelligence 101 (1-2):285-297, June 1998 Blockeel, H., De Raedt, L., Ramon, J. Top-down induction of clustering trees, In Proceedings of ICML’98, 55-63, 1998 Dzeroski, S. Inductive Logic Programming and Knowledge Discovery in Databases, Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996 Knobbe, A.J., Blockeel, H., Siebes, A., Van der Wallen, D.M.G. Multi-Relational Data Mining, technical report CWI, 1999, http://www.cwi.nl Knobbe, A.J., Siebes, A., Van der Wallen, D.M.G. Multi-Relational Decision Tree Induction, technical report CWI, 1999, http://www.cwi.nl Kramer, S. Structural regression trees, In Proceedings of AAAI’96, 1996 Morik, K., Brockhausen, P., A Multistrategy Approach to Relational Knowledge Discovery in Databases, in Machine Learning 27(3), 287-312, Kluwer, 1997 Quinlan, R.J., C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993 Watanabe, L., Rendell, L. Learning structural decision trees from examples, In Proceedings of IJCAI’91, 770-776, 1991 Wrobel, S. An algorithm for multi-relational discovery of subgroups, In Proceedings of Principles of Data Mining and Knowledge Discovery (PKDD’97), 78-87, 1997
Learning of Simple Conceptual Graphs from Positive and Negative Examples Sergei O. Kuznetsov All-Russia Institute for Scientific and Technical Information, Moscow, Russia [email protected] Institut f¨ ur Algebra, Technische Universit¨ at Dresden, Dresden, Germany [email protected]
Abstract. A learning model is considered in terms of formal concept analysis (FCA). This model is generalized for objects represented by sets of graphs with partially ordered labels of vertices and edges (these graphs can be considered as simple conceptual graphs). An algorithm that computes all concepts and the linear (Hasse) diagram of the concept lattice in time linear with respect to the number of concepts is presented. The linear diagram gives the structure of the set of all concepts with respect to the partial order on them and provides a useful tool for browsing or discovery of implications (associations) in data.
1
Introduction
In this paper we propose an efficient algorithmic framework for learning simple conceptual graphs and constructing the diagrammatic representation of the space of these graphs, which may be used for solving knowledge discovery problems. We consider a model of learning from positive and negative examples in terms of formal concept analysis (FCA) [14], [5]. FCA proved to be a helpful mathematical tool for various branches of knowledge processing, including conceptual clustering, browsing retrieval [1], and generation of association rules in data mining [11]. We show how this model can be extended to data more general than classical contexts. To this end, we give a definition of the closure operation based on an arbitrary semilattice. Classical binary contexts [14] are obtained when semilattice is taken to be a Boolean lattice. As an example we consider a semilattice induced by a set of graphs with partial ordered labels of vertices. These graphs can be interpreted, for example, as molecular graphs or as conceptual graphs [13], [10] without negation and nestedness, i.e., as simple conceptual graphs. We present some results on algorithmic complexity of generating the concept lattice arising from the semilattice on sets of these graphs. The problem of computing the number of all concepts is #P-complete, however, an algorithm that constructs linear (Hasse) diagram of the concept lattice in time linear with respect to the number of concepts can be proposed. This result improves the quadratic worst-case time bounds for algorithms given in [7], [12], [6], and [1]. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 384–391, 1999. c Springer-Verlag Berlin Heidelberg 1999
Learning of Simple Conceptual Graphs
2
385
Learning from Examples in Formal Concept Analysis
In general terms, the model proposed in [3] is based on the common paradigm of machine learning: given positive and negative examples of a concept, construct a generalization of the positive examples that would not cover any negative example. First, we present a particular case of the model, which can easily be described in terms of FCA. The following definition recalls some well-known notions from [14], [5]. Definition 1. Let G be a set of objects, M be a set of attributes, and I be a relation defined on G × M : for g ∈ G, m ∈ M , gIm holds iff the object g has the attribute m, the triple K = (G, M, I) is called a context. If A ⊆ G, B ⊆ M are arbitrary subsets, then the Galois connections are given as follows: A0 * ) {m ∈ M |gIm for all g ∈ A},
B0 * ) {g ∈ G|gIM for all m ∈ B}.
The pair (A, B), where A ⊆ G, B ⊆ M , A0 = B, and B 0 = A is called a concept (of the context K) with extent A and intent B (in this case we have also A00 = A and B 00 = B). The set of attributes B is implied by the set of attributes A, or implication A → B holds, if all objects from G that have all attributes from the set A also have all attributes from the set B, i.e., A0 ⊆ B 0 . ♦ Now assume that W is a functional (goal) property of objects from a domain under study. For example, in pharmacological applications [3] W can be a biological activity of chemical compounds (like carcinogenicity or, to the contrary, some useful pharmacological activity like sedativity). Thus, W is opposed to the attributes from M , which correspond to structural properties of objects. For example, in pharmacological applications the structural attributes can correspond to particular subgraphs of the molecular graphs of chemical compounds. Input data for learning can be represented by the sets of positive, negative, and undefined examples. Positive examples are objects that are known to have the property W and negative examples are objects that are known not to have this property. Undefined examples are those that are neither known to have the property nor known not to have the property. The results of learning are supposed to be rules used for the classification of undefined examples (or forecast of property W ). In terms of formal concept analysis, this situation can be described by three contexts: a positive context K+ = (G+ , M+ , I+ ), a negative context K− = (G− , M− , I− ), and an undefined one Kτ = (Gτ , Mτ , Iτ ). Here G+ , G− , and Gτ are sets of positive, negative, and undefined examples, respectively; M is a set of structural attributes; Ij ⊆ Gj ×M, j ∈ {+, −, τ }, are relations that specify the structural attributes of positive, negative, and undefined examples. Now, a positive hypothesis from [2], [3] (called JSM-hypothesis there) can be defined in the following way. Definition 2. Consider a positive context K+ = (G+ , M+ , I+ ) and a negative context K− = (G− , M− , I− ). A pair (e+ , i− ) is a a positive concept if it is a concept of the context K+ . If intent i+ of a positive concept (e+ , i+ ) is not
386
S.O. Kuznetsov
contained in the intent of any negative concept (i.e., ∀g− ∈ G− , i+ 6⊆ {g− }0 ) and |e+ | ≥ 2, then the concept (e+ , i+ ) is called a positive hypothesis with respect to the property W . Negative hypotheses are defined dually. Thus, a hypothesis is an implication with a fixed consequent and antecedent equal to the intent of a positive concept. Note that if (e+ , i+ ) is a positive hypothesis with respect to the property W , then i+ → {W } is an implication for the context K+− = (G+ ∪G− , M ∪{W }, I+ ∪I− ∪G+ ×{W }). Hypotheses can be used for the classification of undefined examples from Gτ (i.e., for forecasting whether they have the property W or not). If an undefined example gτ ∈ Gτ has all attributes from the intent i+ of a positive hypothesis (e+ , i+ ) (i.e., {gτ }0 ⊇ i+ ) and does not have all attributes from the intent of any negative hypothesis, then gτ is classified positively. Negative classifications are defined dually. If {gτ }0 does not contain the intent of any negative or positive hypothesis, or includes intents of hypotheses of different signs, then no classification is made. Example 1. Consider the following sets of positive and negative examples: G+ = {X1 , X2 , X3 , X4 }, G− = {Y1 , Y2 , Y3 , Y4 }, and the undefined example gτ , where {X1 }0 = {A, B, C}, {X2 }0 = {A, B, D}, {X3 }0 = {A, E, F }, {X4 }0 = {A, C, G}; {Y1 }0 = {A, F, G}, {Y2 }0 = {A, D, F }, {Y3 }0 = {B, E, F, G}, {Y4 }0 = {B, D, F }, {gτ }0 = {A, B, D, E}. The pairs ({X1 , X2 }, {A, B}), ({X1 , X4 }, {A, C}) are positive hypotheses. The pair ({X1 , X2 , X3 , X4 }, {A}), which is a positive concept with extent larger than one, is not a positive hypothesis, since {A} ⊂ {Y1 }0 , {Y2 }0 . The negative hypotheses are ({Y3 , Y4 }, {B, F }), ({Y2 , Y4 }), {D, F }), ({Y1 , Y3 }, {F, G}). Since {A, B}, the intent of the first positive hypothesis is contained in {A, B, D, E}, the intent of the undefined example, whereas no negative intent does, the undefined example {gτ } is classified positively. ♦
3
Extension of the Learning Model
In this section we use a simple construction from [9], where a partial order on graphs with ordered labels of vertices and edges is completed to a semilattice (actually to a distributive lattice). Let Ωg be a set of graphs with partially ordered labels of vertices and edges. For molecular graphs this can correspond to some natural hierarchy of classes of chemical elements. Suppose that two graphs F = hhVF , MF i, EF i and H = hhVH , MH i, EH i from Ωg are given. Here VF , VH are sets of vertices, MF , MH are sets of vertex labels, EF , EH are sets of edges, respectively, EF ⊆ VF × VF ; EH ⊆ VH × VH . The sets mF and mH belong to a set of labels ordered with respect to some ordering ≤. We shall consider only vertex labeling. The case with labeled edges and vertices can be reduced to the case where only vertices are labeled. Definition 3. F subsumes H or H F iff there exists one-to-one mapping ϕ from the set VH into the set VF that maps each vertex vH ∈ VH with label mH to a vertex vF ∈ VF with label mF such that mH (vH ) ≤ mF (vF ). The mapping should not violate incidence relation, i.e. if (a, b) ∈ EH , then (ϕ(a), ϕ(b)) ∈ EF .♦
Learning of Simple Conceptual Graphs
387
The graphs of the aforementioned form can be interpreted as conceptual graphs [13], and as the specialization relation [10]. The relation is a generalization of the “subgraph isomorphism” relation and coincides with it when labels are not ordered. Definition 4. Let H = {H1 , . . . , Hn } and H1 , . . . , Hn ∈ Ωg . Then N(H) * ) {Hi |∀Hj ∈ H, Hi Hj → Hj = Hi }. For i 6= j {Hi } u {Hj } * ) N({H : H Hi , H Hj }) (the set {Hi } u {Hj } consists of all graphs maximal by inclusion among those subsumed by both Hi and Hj . Let H = {H1 , . . . , Hn } and F = {F1 , . . . , Fm } be sets of graphs from Ωg . Then H u F = N( ∪ {Hi } u {Fj }). ♦ i,j
Example 2. Let F1 and F2 be molecular chemical graphs given in Fig. 1. Cl
@ @
C
C
@
@
@ @N
C
C C
C
@
Cl
F2 :
F1 : C
@ @N C
@
@ @
@ @
N
N
Fig.1
The vertex labels of the graphs are ordered as follows (x denotes “an arbitrary chemical element”): x ≤ C, x ≤ N , x ≤ Cl, the labels C, N , and Cl are incomparable. Then {F1 } u {F2 } = {H1 , H2 , H3 }, where H1 , H2 , and H3 are given in Fig. 2. Cl.
@ C
@ @ @N
H1 : C
C
@
@ @
N
C
C
C
@ @x
C
@
Cl
@ @x
x
H3 :
H2 : x
x
@
@ @
x
x
@
@ @
x
C
Fig.2
Cl
388
S.O. Kuznetsov
Here, the disconnected graph H1 contains more information about the cyclic structure, whereas H2 and H3 contain more information about the connection of the cycle with the vertex labeled by “Cl”. It is easily seen that operation u induces a semilattice ΩgN with the set of generators Ωg (i.e., u is idempotent, commutative, and associative on ΩgN ). Thus, the order relation v can be defined as usual: X v Y * ) X u Y = X. Now we define operations 0 and 00 analogous to those from Section 2. Definition 5. Let H1 , . . . , Hk ∈ Ωg , then * {H1 } u . . . u {Hk }, {H1 , . . . , Hk }0 ) 00 * {H1 , . . . , Hk } ) {H ∈ Ωg : {H} w {H1 } u . . . u {Hk }}. It can be shown that 00 is a closure operation on ΩgN (i.e., it is extensive, idempotent, and monotone). For arbitrary X1 , . . . , Xk ∈ Ωg the pair ({X1 , . . ., Xk }0 , {X1 , . . ., Xk }00 ) is called a concept with extent {X1 , . . ., Xk }00 and intent {X1 , . . . , Xk }0 . The closure operator 00 and concept from Definition 1 can be obtained from Definition 5 when Ωg is a set of objects G, ΩgN is the set of attributes M , u is the set-theoretic intersection ∩ defined on subsets of attributes M , the terms H, H1 , . . . , Hk stay for objects from G and {H}, {H1 }, . . . , {Hk } stay for their intents. Note that the operation u and the corresponding 00 can be defined in lines of Definitions 4-5 for arbitrary partial orders (and thus, data types), not only for those given in Definition 3.
4
Algorithms and Complexity
A crucial problem here is that of generation of all concepts of a given context. It is difficult not only to generate the set of all concepts, whose size can be exponential in the size of the source context, but also to calculate or even estimate its size, since the problem of computing all concepts is #P -complete [8]. Now we describe an algorithm similar to that from [4], which generates the set of all concepts. We transform it in an algorithm generating Hasse diagram in time linear with respect to the number of all concepts. Let the type of objects and the corresponding closure operator 00 be fixed (they can be either from Definition 1 or from Definition 5). Below, for definiteness sake, we shall speak in terms of Definition 1, however, all the formulations are the same for Ωg and the operations 0 and 00 from Definition 5. We assume that objects from G are numbered, and therefore, a set X ⊆ G can be represented by a correspondingly ordered tuple. The numbering of objects from G induces lexicographical ordering of sets from P(G), the powerset of G. Definition 6. A path is defined inductively as follows: (1) If g ∈ G and {g}00 = {g} ∪ Z, g 6∈Z, Z ⊆ G, then [(∅, {g})Z] is called a path and {g}00 is called the extent of the path or Ext[(∅, {g})Z]. We also say that [(∅, {g})Z] is an inference of {g}00 . The inference [(∅, {g})Z] of {g}00 is canonical (or the path [(∅, {g})Z] is canonical) iff the numbers of all objects from Z (in the sense of numbering of objects from G) are greater than the number of g.
Learning of Simple Conceptual Graphs
389
(2) If Y is a path, h ∈ G, (Ext(Y )∪{h})00 = Ext(Y )∪{h}∪Z, Z ∩Y = ∅, then [(Y, {h})Z] is a path. Ext[(Y, {h})Z] * ) (Ext(Y ) ∪ {h})00 = Ext(Y ) ∪ {h} ∪ Z. We say that [(Y, {h})Z] is an inference of (Ext(Y ) ∪ {h})00 . The inference [(Y, {h})Z] of (Ext(Y ) ∪ {h})00 is called canonical iff Y is a canonical path and the numbers of all objects from Z are greater than the number of h. ♦ The following procedure (we call it Close-by-One or CbO Algorithm) is based upon the depth-first strategy, though other strategies are possible as well. Y denotes the path to the current concept. Algorithm 1. Step 0. There is only one root vertex where all objects are unlabeled, Y := ∅. Step 1. The current vertex corresponds to the concept with the extent Y . The first unlabeled object from G, say Xi , is taken and labeled at Y , (Ext(Y ) ∪ {Xi })0 and (Y ∪ {Xi })00 = (Ext(Y ) ∪ {Xi } ∪ Z are computed. A new vertex that corresponds to (Ext(Y ) ∪ {Xi })00 is generated and connected to the vertex associated with Y . Step 2. If Z contains objects with numbers less than i (i.e., the path [(Y, Xi )Z] is not canonical), then we label all objects from G at the vertex (Ext(Y ) ∪ {Xi })00 (thus, the branch will not be extended). If Z does not contain objects with numbers less than i (i.e., [(Y, Xi )Z] is canonical), then we label all objects from (Ext(Y ) ∪ {Xi })00 ∪ {X1 , . . . , Xi−1 } at the vertex (Ext(Y ) ∪ {Xi })00 . Step 3. If all elements of G are labeled at (Ext(Y ) ∪ {Xi })00 , we go to Step 4. Otherwise, Y : = [(Y, Xi )Z], and we return to Step 1. Step 4. We backtrack the tree upwards to the nearest vertex with unlabeled elements of G. If such a vertex exists and corresponds, say, to the path Z, then Y : = Z and we go to Step 1. If such a vertex does not exist, then this means that all concepts have been generated and the algorithm halts. ♦ Example 3. Consider objects X1 , X2 , X3 , X4 from Example 1. In this case, Algorithm 1 constructs the tree with the following left-most branch: root – [(X1 )X2 ] – [([(X1 )X2 ]X3 )X4 ], which consists of two non-root vertices. Both these vertices are canonical. ♦ Theorem 1. The tree output by Algorithm 1 has O(|G||L|) vertices. The canonical vertices of this tree are in one-to-one correspondence with the concepts. The time complexity of Algorithm 1 is O((α + β|G|)|G||L|) and its space complexity is O((γ|G||L|)), where α is time needed to perform u operation and β is time needed to test v relation and γ is the space needed to store the largest object from ΩgN . When contexts and concepts are given by Definition 1, the time complexity is (|M ||G|2 |L|) and the space complexity is O(|M ||G||L|).♦ In general, the computation of u and v is NP-hard, but in some realistic cases it may be polynomial [10]. The lattice order agrees with that of the tree constructed by Algorithm 1 (called CbO tree), however the corresponding incidence relation (that defines edges of the Hasse diagram) does not agree with the incidence relation in the tree. To construct the Hasse diagram of the lattice, we need to connect each pair of vertices in the tree that correspond to adjacent vertices in the diagram. To this end, we run the following algorithm in the depth-first left-most order.
390
S.O. Kuznetsov
Algorithm 2 Step 0. We are in the root vertex of the CbO tree constructed by Algorithm 1. Y : = ∅, Fr(C): = ∅, and To(C): = ∅ for all concepts C. All vertices are unlabeled. Step 1. The current canonical vertex corresponds to the concept with the extent Y . For each element Xi of G we compute (Y ∪ {Xi })0 and (Y ∪ {Xi })00 . Among sets (Y ∪ {Xi })00 , i = 1, . . . , |G|, we select those minimal by inclusion. These are extents of concepts adjacent to (Y, Y 0 ) from below. We denote the set of these extents by M (Y ). Fr(Y 0 ): = {h(Y, Y 0 ), (Z, Z 0 )i|Z ∈ M (Y )}. Thus, the set of arcs in the Hasse diagram leading from the vertex (Y, Y 0 ) to its children is constructed. We label the vertex Y and number the elements of M (Y ) using the numbering of G). Step 2. For every extent E ∈ M (Y ) we take the corresponding concept (E, E 0 ) and find its canonical inference. This is equivalent to finding the corresponding canonical path in the tree generated by Algorithm 1. We update the set of arcs leading to (E, E 0 ) by letting To(E 0 ): =To(E) ∪ {h(E, E 0 ), (Y, Y 0 )i}. Step 3. If there are unlabeled canonical vertices corresponding to extents in M (Y ), we pass from Y to the first of them (with respect to the numbering on elements of M (Y ) given in Step 2), say Z, Y : = Z and return to Step 1. If there are no such vertices and Y is not the root of the tree, we backtrack to the parent of Y in the tree, denote it by R(Y ), Y : = R(Y ) and return to Step 3. If Y is the root of the tree and there are no unlabeled vertices corresponding to extents in M (Y ), then algorithm halts. ♦ For every concept C Algorithm 2 outputs sets Fr(C) and To(C) of arcs that lead to concepts immediately adjacent to C in the Hasse diagram from above and below, respectively. Thus, the CbO tree, together with sets Fr(C) and To(C) related to each canonical vertex is a representation of the concept lattice. Unlike the incidence matrix of a lattice, whose size is quadratic in the number of concepts, the size of this structure and time needed to construct it are linear. Given also a negative context at the input, one can generate hypotheses by slightly modifying Algorithm 2: at Step 3 it should also be tested whether positive intents are not subsumed by negative examples. The algorithm is also easily extended to include a test for sufficient number of examples supporting a hypothesis, for example, in lines of [11]. Theorem 2. Algorithm 2 constructs the Hasse diagram of a concept lattice in O((α|G| + β|G|2 )|L|) time and O((γ|G||L|) space, where |L| is the number of concepts, α is the time needed to perform u operation, β is the time needed to test v relation, and γ is the space needed to store the largest object from ΩgN . When contexts and concepts are given by Definition 1 the diagram is constructed in O(|M ||G|2 |L|) time and O(|M ||G||L|) space.
5
Conclusion
We presented a learning model in a version of the formal concept analysis that allows processing graph structures. For example, this model can be used for learning implications on simple conceptual graphs [13] or molecular graphs. The
Learning of Simple Conceptual Graphs
391
model can also be extended to arbitrary data structures with partial order. Algorithmic analysis was provided. Though computations on graphs may be hard in general, this does not affect the linear dependence of time and space needed for the computation on the number of resulting concepts (hypotheses, implications).
6
Acknowledgments
This work was supported by the Russian Foundation for Basic Research, project no. 98-06-80191 and the Alexander von Humboldt Foundation.
References 1. Carpineto, C., Romano, G.: A Lattice Conceptual Clustering System and Its Application to Browsing Retrieval. Machine Learning 24 (1996) 95-122 2. Finn, V.K.: On Machine-Oriented Formalization of Plausible Reasoning in the Style of F. Backon–J. S. Mill. Semiotika Informatika 20 (1983) 35-101 [in Russian] 3. Finn, V.K.: Plausible Reasoning in Systems of JSM Type. Itogi Nauki i Tekhniki, ser. Informatika 15 (1991) 54-101 4. Ganter, B: Algorithmen zur Formalen Begriffsanalyse. In: Ganter, B., Wille, R., and Wolff, K. E. (eds.): Beitr¨ age zur Begriffsanalyse. B. I. Wissenschaftsverlag, Mannheim (1987) 241-254 5. Ganter, B., Wille, R.: Formal Concept Analysis. Mathematical Foundations. Springer-Verlag, Berlin Heidelberg New York (1999) 6. Godin, R., Missaoui, R., Alaoui, H.: Incremental Concept Formation Algorithms Based on Galois (Concept) Lattices. Computational Intelligence 11(2) (1995) 246267 7. Gu´enoche, A.: Construction du treillis de Galois d’une relation binaire. Math. Sci. Hum. 95 (1990) 5-18 8. Kuznetsov, S.O.: Interpretation on Graphs and Algorithmic Complexity Characteristics of a Search of Specific Patterns. Autom. Document. Math. Ling. 23(1) (1989) 37-45 9. Kuznetsov, S.O.: JSM-method as a Machine Learning System. Itogi Nauki Tekhn., ser. Informat. 15 (1991) 17-54 [in Russian] 10. Mugnier, M.L.: On Generalization/Specialization for Conceptual Graphs. J. Exp. Theor. Artif. Intel. 7 (1995) 325-344 11. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Pruning Closed Itemset Lattices for Association Rules. In: Proc. 14th BDA Conference on Advanced Databases (BDA’98). Hamammet (1998) 177-196 12. Skorsky, M.: Endliche Verb¨ ande - Diagramme und Eigenschaften. Shaker, Aachen (1992) 13. Sowa, J.F.: Conceptual Structures - Information Processing in Mind and Machine. Addison-Wesley, Reading (1984) 14. Wille, R.: Restructuring Lattice Theory: an Approach Based on Hierarchies of Concepts. In: Rival, I. (ed.): Ordered Sets. Reidel, Dordrecht Boston (1982) 445470
An Evolutionary Algorithm Using Multivariate Discretization for Decision Rule Induction Wojciech Kwedlo and Marek Kr¸etowski Institute of Computer Science, Technical University of Bialystok, Poland {wkwedlo, mkret}@ii.pb.bialystok.pl Abstract. We describe EDRL-MD, an evolutionary algorithm-based system, for learning decision rules from databases. The main novelty of our approach lies in dealing with continuous - valued attributes. Most of decision rule learners use univariate discretization methods, which search for threshold values for one attribute at the same time. In contrast to them, EDRL-MD simultaneously searches for threshold values for all continuous-valued attributes, when inducing decision rules. We call this approach multivariate discretization. Since multivariate discretization is able to capture interdependencies between attributes it may improve the accuracy of obtained rules. The evolutionary algorithm uses problem specific operators and variable-length chromosomes, which allows it to search for complete rulesets rather than single rules. The preliminary results of the experiments on some real-life datasets are presented.
1
Introduction
Discovery of decision rules from databases is one of the most important problems in machine learning and data mining [6]. If a dataset contains some numerical (continuous-valued) attributes a decision rule learner must search for threshold values to create conditions (selectors) concerning these attributes. This process is called discretization and has attracted a lot of attention in the literature. The simplest discretization algorithm called equal interval binning partitions the range of a continuous-valued attribute into several equal sized intervals. Since it does not make use of class labels it belongs to the group of unsupervised methods. Many experimental studies show [4], that supervised methods which use the class membership of examples, perform much better than their unsupervised counterparts. The supervised discretization can be performed globally before the induction of rules by dividing the range of each continuous-valued attribute into intervals independent of the other attributes. An alternative approach, called local supervised discretization consists in searching for attribute thresholds during the generation of inductive hypothesis. However, even the systems, which use this method (e.g. C4.5 [10]) are often unable to search at the same time for threshold values for more than one attribute. Both approaches belong to the group of univariate discretization methods. In the paper we propose a new system called EDRL-MD (EDRL-MD, for Evolutionary Decision Rule Learner with Multivariate Discretization) combining the two steps: the simultaneous search for threshold values for all continuous-valued attributes, which we call multivariate discretization, and the discovery of decision rules. As a search heuristic we use an evolutionary algorithm (EA) [9]. EAs ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 392–397, 1999. c Springer-Verlag Berlin Heidelberg 1999
An Evolutionary Algorithm for Decision Rule Induction
393
are stochastic techniques, which have been inspired by the process of biological evolution. The success of EAs is attributed to the ability to avoid local optima, which is their main advantage over greedy search methods. Several systems, which employ EAs for learning decision rules (e.g. GABIL [3], GIL [7], EDRL [8]) were proposed. According to our knowledge all of them either work only with nominal attributes or discretize continuous-valued ones prior to induction of rules using univariate methods.
2
The Weakness of the Univariate Discretization
The univariate discretization methods, A2 although computationally effective, are 1 not able to capture interdependencies between attributes. For that reason they run the risk of missing information necessary for correct classification. A following example shows the shortcomings of uni0.5 variate discretization. Consider an artificial dataset shown (a similar idea was presented in [2]) on Fig. 1. Every example is described by A1 two attributes A1 and A2 distributed uni0 1 0.5 formly on the interval [0, 1]. The examples are divided approximately equally Fig. 1 An artificial dataset, for which into two classes denoted by + and 2. The the univariate approach is unlikely to optimal decision rules for this dataset are: find the proper thresholds. (A1 < 0.5) ∧ (A2 < 0.5) → 2 (A1 > 0.5) ∧ (A2 > 0.5) → 2
(A1 < 0.5) ∧ (A2 > 0.5) → + (A1 > 0.5) ∧ (A2 < 0.5) → +
Note that each of the conditions Ai < th or Ai > th, where th ∈ (0, 1) splits the examples into two subsets having roughly the same class distribution as the whole dataset. Hence a single condition related to one attribute A1 or A2 does not improve the separation of the classes. Let H(Ai , th) be the class information entropy of the partition induced by the threshold th, a measure commonly used [4,5] by supervised discretization algorithms. We can say that for each Ai ∈ {A1 , A2 } and th1 , th2 ∈ (0, 1) H(Ai , th1 ) ∼ = H(Ai , th2 ). This property holds for the other functions based on impurity or separation of the classes. Therefore any supervised univariate discretization algorithm will have difficulties with finding the proper thresholds for both attributes. This limitation does not apply to multivariate methods. The above-mentioned example indicates, that in some cases a multivariate discretization is more appropriate and it leads to more accurate rules.
3
Description of the Method
We assume that a learning set E = {e1 , e2 , . . . , eM } consists of M examples. Each example e ∈ E is described by N attributes (features) A1 (e), A2 (e), . . . , AN (e)
394
W. Kwedlo and M. Kr¸etowski
and labeled by a class c(e) ∈ C. The domain of a nominal (discrete-valued) attribute Aj is a finite set V (Aj ), while the domain of a continuous-valued attribute Ai is an interval V (Ai ) = [ai , bi ]. For each class ck ∈ C by E + (ck ) = {e ∈ E : c(e) = ck } we denote the set of positive examples and by E − (ck ) = E − E + (ck ) the set of negative examples. A decision rule R takes the form t1 ∧t2 ∧. . .∧tr → ck , where ck ∈ C and the left-hand side (LHS) is a conjunction of r(r ≤ N ) conditions t1 , t2 , . . . , tr ; each of them concerns one attribute. The right-hand side (RHS) of the rule determines class membership of an example. A ruleset RS ck for a class ck is defined as a disjunction of K(ck ) decision rules ck , provided that all the rules have ck on the RHS. R1ck ∨ R2ck ∨ · · · ∨ RK(c k) In EDRL-MD the EA is called separately for each class ck ∈ C to find the ruleset RS ck . The search criterion, in terminology of EAs called the fitness function prefers rulesets consisting of few conditions, which cover many positive examples and very few negative ones. 3.1
Representation
The EA processes a population of candidate solutions to the search problem called chromosomes. In our case a single chromosome encodes a ruleset RS ck . Since the number of rules in the optimal ruleset for a given class is not known, we use variable-length chromosomes and provide the search operators, which change the number of rules. A chromosome representing the ruleset is a concatenation of strings. Each fixed-length string represents the LHS of one decision rule. Because the EA is called to find a ruleset for the given class ck there is no need for encoding the RHS.
continuous-valued Ai
...
li
ui
...
lower upper threshold threshold
nominal Aj 1
fj
2
k
fj ... fj j ...
...
binary flags
Fig. 2. The string encoding the LHS of a decision rule (kj = |V (Aj )|). The chromosome representing the ruleset is concatenation of strings.
The string is composed (Fig. 2) of N substrings. Each substring encodes a condition related to one attribute. The LHS is the conjunction of these conditions. In case of a continuous-valued attribute Ai the substring encodes the lower li and the upper ui threshold of the condition li < Ai ≤ ui . It is possible that li = −∞ or ui = +∞. Both li and ui are selected from the finite set of all boundary thresholds. A boundary threshold for the attribute Ai is defined (Fig. 3) as a midpoint between such a successive pair of examples in the sequence sorted by the increasing value of Ai , that one of the examples is positive and the other is negative. Fayyad and Irani proved [5], that evaluating only the boundary thresholds is sufficient for finding the maximum of class information entropy. This property also holds for the fitness function (1).
An Evolutionary Algorithm for Decision Rule Induction
395
For a nominal attribute Aj the substring consists of binary flags; each of them corresponds to one value of the attribute. If e.g. the domain of attribute Aj is {low, moderate, high} then the pattern 011 represents condition Aj = (moderate ∨ high), which stands for: ”the value of Aj equals moderate or high”. Note, that it is possible, that a condition related to an attribute is not present on the LHS. For a continuous-valued attribute Ai it can be achieved by setting both li = −∞ and ui = +∞. For a nominal Aj it is necessary to set all the flags |V (Aj )| . fj1 , fj2 , . . . , fj Each chromosome in the population is initialized using a randomly chosen positive example. The initial chromosome represents the ruleset consisting of a single rule, which covers the example. …
thik-1
thik+1 k i
th
… Ai
Fig. 3. An example illustrating the notion of boundary threshold. The boundary thresTi holds th1i , . . . , thki , . . . , thN for the continuous-valued attribute Ai are placed between i groups of negative (•) and positive (2) examples.
3.2 The Fitness Function Consider a ruleset RS ck , which covers pos positive examples and neg negative ones. The fitness function is given by: f (RS ck ) =
pos − neg . log10 (L + α) + β
(1)
where α = 10, β = 10, L is total the number of conditions in the ruleset RS ck . Note, that maximization of the numerator of (1) is equivalent to maximization of the probability of correct classification of an example. The denominator of (1) is a measure of complexity of the ruleset. An increase of the complexity results in a reduction of the fitness and thus prevents overfitting. To avoid overfitting we also limit the number of rules in a chromosome to maxR , where maxR is a user-supplied parameter. The formula for function (1) including the values of the parameters α and β was chosen on the experimental basis. We found it performed well in comparison with other functions we tested. 3.3 Genetic Operators Our system employs six search operators. Four of them: changing condition, positive example insertion, negative example removal , rule drop are applied to a single ruleset RS ck (represented by chromosome). The other two: crossover and rule copy require two arguments RS1ck and RS2ck . A similar approach was proposed by Janikow. However, his GIL [7] system is not able to handle continuous-valued attributes directly, since it represents a con-
396
W. Kwedlo and M. Kr¸etowski
dition as a sequence of binary flags corresponding to the values of an attribute (we use the same representation for nominal attributes). The changing condition is a mutation-like operator, which alters a single condition related to an attribute. For a nominal attribute Aj a flag randomly |V (Aj )| is flipped. For a continuous-valued Ai a threshold chosen from fj1 , fj2 , . . . , fj (li or ui ) is replaced by a random boundary threshold. The positive example insertion operator modifies a single decision rule Rck in the ruleset RS ck to allow it to cover a new random positive example e+ ∈ E + (ck ), currently uncovered by Rck . All conditions in the rule, which conflict with e+ have to be altered. In case of a condition related to a nominal attribute Aj the flag, which corresponds to Aj (e+ ), is set. If a condition li < Ai ≤ ui concerning continuous-valued attribute Ai is not satisfied because ui < Ai (e+ ) the threshold ui is replaced by u ˆi , where u ˆi is the smallest boundary threshold such that u ˆi ≥ Ai (e+ ). The case when Ai (e+ ) ≤ li is handled in a similar way. The negative example removal operator alters a single rule Rck from the ruleset RS ck . It selects at random a negative example e− from the set of all the negative examples covered by Rck . Then it alters a random condition in R in such way, that the modified rule does not cover e− . If the chosen condition concerns a nominal attribute Aj the flag which corresponds to Aj (e− ) is cleared. Otherwise the condition li < Ai ≤ ui concerning continuous-valued Ai is narrowed down either to lˆi < Ai ≤ ui or to li < Ai ≤ uˆi , where lˆi is the smallest boundary threshold such that Ai (e− ) ≤ lˆi and uˆi is the largest one such that uˆi < Ai (e− ). Rule drop and rule copy operators [7] are the only ones capable of changing the number of rules in a ruleset. The single argument rule drop removes a random rule from a ruleset RS ck . The two argument rule copy adds to one of its arguments RS1ck a copy of a rule selected at random from RS2ck , provided that the number of rules in RS1ck is lower than maxR . The crossover operator selects at random two rules R1ck and R2ck from the respective arguments RS1ck and RS2ck . It then applies an uniform crossover [9] to the strings representing R1ck and R2ck .
4
Experiments
In this section some initial experimenDataset C4.5 EDRL-MD tal results are presented. We have tested australian 84.8 ± 0.9 84.5 ± 0.5 EDRL - MD on several datasets from bupa 66.5 ± 2.5 65.6 ± 1.5 UCI repository [1]. Table 1 shows the breast-w 95.2 ± 0.4 95.2 ± 0.3 classification accuracy obtained by our glass 67.5 ± 1.6 70.7 ± 2.9 method and C4.5 (Rel. 8) [10] algorithm. hepatitis 80.6 ± 2.2 83.0 ± 2.4 The accuracy was estimated by running iris 95.3 ± 0.7 95.4 ± 0.7 ten times the complete ten-fold crossvapima 74.2 ± 1.1 74.5 ± 0.6 lidation. The mean of ten runs and the wine 94.2 ± 1.4 93.6 ± 1.2 standard deviation are given. In all the experiments involving C4.5 decision rules Table 1 The results of the experiments. were obtained from decision trees by C4.5rules program.
An Evolutionary Algorithm for Decision Rule Induction
5
397
Conclusions
We have presented EDRL-MD, an EA-based system for decision rule learning, which uses a novel multivariate discretization method. The preliminary experimental results indicate, that both classification accuracy and complexity of discovered rules are comparable with the results obtained by C4.5. Several directions of future research exist. One of them is the design of a better fitness function, which has a critical influence on the performance of the algorithm. The current version was chosen on the basis of very few experiments. Hence the classification results presented in the paper should be viewed as the lower limits of the attainable performance. We believe that the performance can be further improved. It is a well-known fact, that many applications of KDD require the capability of efficient processing large databases. In such cases algorithms, which offer very good classification accuracy at the cost of high computational complexity cannot be applied. Fortunately, EAs are well suited for parallel architectures. We plan to develop a parallel implementation of EDRL-MD, which will be able to extract decision rules from large datasets. Acknowledgments The authors are grateful to Prof. Leon Bobrowski for his support and useful comments. This work was supported by the grant W/II/1/97 from Technical University of Bialystok.
References 1. Blake, C., Keogh, E., Merz, C.J.: UCI repository of machine learning databases, available on-line: http://www.ics.uci.edu/∼mlearn/MLRepository.html (1998). 2. Bobrowski, L.: Piecewise-linear classifiers, formal neurons and separability of the learning sets. Proc. of 13th Int. Conf. on Pattern Recognition ICPR’96. IEEE Computer Society Press (1996) 224-228. 3. De Jong, K.A., Spears, W.M., Gordon, D.F.: Using genetic algorithm for concept learning. Machine Learning 13 (1993) 168-182. 4. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Machine Learning: Proc of 12th Int. Conference. Morgan Kaufmann (1995) 194-202. 5. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. of IJCAI’93. Morgan Kaufmann (1993) 1022-1027. 6. Fayyad, U.M, Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advances in Knowledge Discovery and Data Mining. AAAI Press (1996). 7. Janikow, C.Z.: A knowledge intensive genetic algorithm for supervised learning. Machine Learning 13 (1993) 192-228. 8. Kwedlo, W., Kr¸etowski, M.: Discovery of decision rules from databases: an evolutionary approach. In Principles of Data Mining and Knowledge Discovery. 2nd European Symposium PKDD’98. Springer LNCS 1510 (1998) 370-378. 9. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer (1996). 10. Quinlan, J.R.: Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research 4 (1996) 77-90.
]lj]dj/ d Qhz Foxvwhulqj Dojrulwkp wr Dqdo|}h Fdwhjrulfdo Yduldeoh Furvv0Fodvvlfdwlrq Wdeohv Vwìskdqh Odoolfk HULF Oderudwru|/ Xqlyhuvlw| ri O|rq 5 h1pdlo = odoolfkCxqly0o|rq51iu Devwudfw1 Wklv sdshu sursrvhv ]lj]dj/ d qhz foxvwhulqj dojrulwkp/ wkdw zrunv rq fdwhjrulfdo yduldeoh furvv0fodvvlfdwlrq wdeohv1 ]lj}dj fuhdwhv vl0 pxowdqhrxvo| wzr sduwlwlrqv ri urz dqg froxpq fdwhjrulhv lq dffrugdqfh zlwk wkh htxlydohqfh uhodwlrq wr kdyh wkh vdph frqglwlrqdo prgh1 Wkhvh wzr sduwlwlrqv duh dvvrfldwhg rqh wr rqh dqg rqwr/ fuhdwlqj e| wkdw zd| urz0froxpq foxvwhuv1 Wkxv/ zh kdyh dq h!flhqw NGG wrro zklfk zh fdq dsso| wr dq| gdwdedvh1 Pruhryhu/ ]lj]dj ylvxdol}hv suhglfwlyh dvvrfld0 wlrq iru qrplqdo gdwd lq wkh vhqvh ri Jxwwpdq/ Jrrgpdq dqg Nuxvndo1 Dffruglqjo|/ wkh suhglfwlrq uxoh ri d qrplqdo yduldeoh \ frqglwlrqdoo| wr dq rwkhu [ frqvlvwv lq fkrrvlqj wkh frqglwlrqdoo| prvw suredeoh fdwh0 jru| ri \ zkhq nqrzlqj [ dqg wkh srzhu ri wklv uxoh lv hydoxdwhg e| wkh phdq sursruwlrqdo uhgxfwlrq lq huuru ghqrwhg e| bt*f 1 Lw zrxog dsshdu wkhq wkdw wkh pdsslqj ixuqlvkhg e| ]lj]dj sod|v iru qrplqdo gdwd wkh vdph uroh dv wkh vfdwwhuhg gldjudp dqg wkh fxuyhv ri frqglwlrqdo phdqv ru wkh vwudljkw uhjuhvvlrq olqh sod|v iru txdqwlwdwlyh gdwd/ wkh uvw lq0 fuhdvhg zlwk wkh ydoxhv ri bt *f dqg bf*t / wkh vhfrqg lqfuhdvhg zlwk wkh fruuhodwlrq udwlr ru wkh U2 1
4
Lqwurgxfwlrq
H{wudfwlqj nqrzohgjh iurp fdwhjrulfdo gdwd furvv0fodvvlfdwlrqv1 Wkh ghyhorsphqw ri gdwdedvhv rhuv wr uhvhdufkhuv dqg sudfwlwlrqhuv d kljk ydulhw| ri gdwd lq ydulrxv hogv olnh vrfldo dqg hfrqrplf vflhqfhv/ exvlqhvv ru elrphg0 lfdo vflhqfhv1 Wkhvh gdwd duh riwhq lvvxhg iurp fdwhjrulfdo yduldeohv dqg pruh vshflfdoo| iurp qrplqdo yduldeohv1 Vwdwlvwlfdo phwkrgv uhihuulqj wr fdwhjrulfdo yduldeohv kdyh pxfk h{whqghg ryhu wkh odvw wkluw| |hduv^4`1 Iru lqvwdqfh/ frq0 fhuqlqj wkh furvv0fodvvlfdwlrq wrslf/ Jrrgpdq dqg Nuxvndo^7` ghyhorshg ydu0 lrxv suhglfwlrq uxohv dqg dvvrfldwlrq frh!flhqwv zklfk duh wkh frxqwhusduw ri wkh uhjuhvvlrq iru txdqwlwdwlyh yduldeohv1 Lq wklv sdshu/ zh suhvhqw ]lj]dj/ dq dojrulwkp wkdw fuhdwhv d sduwlwlrq ri wkh urz fdwhjrulhv dqg froxpq fdwhjrulhv dffruglqj wr wkh orjlf ri suhglfwlyh dvvrfldwlrq dv ghyhorshg e| Jxwwpdq/ Jrrg0 pdq dqg Nuxvndo^7`1 ]lj]dj frqvwlwxwhv d vlpsoh dqg h!flhqw wrro lq rughu wr v|qwkhvl}h dqg ylvxdol}h wkh dvvrfldwlrqv ehwzhhq wzr qrplqdo yduldeohv dqg vr idflolwdwhv wkh h{wudfwlrq ri wkh xvhixo nqrzohgjh uhvxowlqj iurp wkh furvvlqj ri wzr fdwhjrulfdo dwwulexwhv lq gdwdedvhv1 Qrwdwlrqv1 Zh frqvlghu d srsxodwlrq ri vxemhfwv zklfk duh ghvfulehg e| wzr fdwhjrulfdo yduldeohv1 Ohw [ dqg \ ghqrwhv wkhvh wzr yduldeohv/ [ kdylqj s •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 398−405, 1999. Springer−Verlag Berlin Heidelberg 1999
ZigZag
399
fdwhjrulhv dqg \ kdylqj t fdwhjrulhv1 Wkh uhvsrqvhv ri wkh vxemhfwv duh suhvhqwhg lq d uhfwdqjxodu furvv0fodvvlfdwlrq +ru frqwlqjhqf|, wdeoh kdylqj s urzv dqg t froxpqv1 Prvw ri wkh wlph/ wkh srsxodwlrq lv wrr odujh dqg zh duh lq d vlwxdwlrq ri vdpsolqj1 Wkhq/ wkh +lm , dqg +lm , duh xqnqrzq1 Zh fdq rqo| revhuyh wkh +qlm , dqg wkh +slm ,1 D frpprq prgho ri vdpsolqj lv wkh pxowlqrpldo vdpsolqj^4` zkhq wkh revhuydwlrqv uhvxow iurp htxdo suredelolwlhv dqg lqghshqghqw udqgrp gudzv dprqj wkh zkroh srsxodwlrq1 Srsxodwlrq +wkhruhwlfdo, Vdpsoh +hpslulfdo, Qxpehu ri vxmhfwv Mrlqw devroxwh iuht1 ri +{l > |m , Mrlqw uhodwlyh iuht1 ri +{l > |m ,
lm lm @ lm @
q qlm slm @ qlm @q
Uhdglqj fdwhjrulfdo yduldeohv furvv0fodvvlfdwlrqv1 Wr h{wudfw wkh nqrzohgjh lqfoxghg lq d furvv0fodvvlfdwlrq/ zh jhqhudoo| ehjlq e| wdnlqj dq lqwhuhvw lq wkh frqglwlrqdo prgh ri hdfk urz dqg hdfk froxpq1 Li qlm lv pd{l0 pxp lq wkh urz l/ lw phdqv wkdw frqglwlrqdoo| wr [@{l / prvw ri wkh wlph \@|m / zklfk ohdgv xv wr dvvrfldwh |m wr {l 1 Dw wkh vdph wlph/ li qlm lv pd{lpxp lq wkh froxpq m/ lw phdqv wkdw frqglwlrqdoo| wr \@|m / prvw ri wkh wlph [@{l / zklfk ohdgv xv wr dvvrfldwh {l wr |m 1 Wkh dlp ri ]lj]dj lv wr v|vwhpdwl}h dqg dxwrpdwh wklv surfhvv1
5
Pdlq Wrslfv ri wkh Dojrulwkp ]lj]dj
Ohw [ ghqrwhv wkh vhw ri urz fdwhjrulhv/ dqg \ wkh vhw ri froxpq fdwhjrulhv/ zlwk Fdug [@s dqg Fdug \@t1 Zh frqvwuxfw vlpxowdqhrxvo| wzr sduwlwlrqv ri urz fdwhjrulhv +[, dqg froxpq fdwhjrulhv +\, rq wkh edvlv ri wkh pd{lpxp dvvrfldwlrq rq urzv +frqglwlrqdoo| wr [, dqg rq froxpqv +frqglwlrqdoo| wr \,1 Wkhq zh mrlq wkhvh wzr sduwlwlrqv rqh fodvv wr rqh fodvv lq rughu wr rewdlq urz0 froxpq foxvwhuv1 Ehvw froxpq fulwhulrq1 Iru hdfk urz fdwhjru| {l / l@4/ 5/ 111s/ zh dvvrfldwh wkh fdwhjru| |m +l, / zkhuh m+l,5~4/5/111/t/ zklfk uhsuhvhqwv wkh prgh ri wkh urz1 Wkhq/ wkh ydoxh ql>m +l, lv wkh pd{lpxp ydoxh ri wkh lwk urz ri +qlm ,1 Vr zh ghqh dq dssolfdwlrq f iurp [ wr \ zklfk lv qhfhvvdulo| qrw rqwr li s?t/ ru qhfhvvdulo| qrw rqh wr rqh li sAt1 Wkh judsk ri wklv dssolfdwlrq f frqvwlwxwhv d elsduwlwh judsk ghqrwhg e| Jf 1 Ehvw urz fulwhulrq1 Iru hdfk froxpq fdwhjru| |m / m@4/ 5/ 111/ t/ zh dvvrfldwh wkh urz fdwhjru| {l+m , / zkhuh l+m,5~4/ 5/111/ s/ zklfk uhsuhvhqwv wkh prgh ri wkh froxpq1 Wkhq/ wkh ydoxh ql+m ,>m lv wkh pd{lpxp ydoxh ri wkh mwk froxpq ri +qlm ,1 Vr zh ghqh dq dssolfdwlrq u iurp \ wr [/ zklfk lv qhfhvvdulo| qrw rqh wr rqh li s?t/ ru qhfhvvdulo| qrw rqwr li sAt1 Wkh judsk ri wklv dssolfdwlrq u frqvwlwxwhv d elsduwlwh judsk ghqrwhg e| Ju kdylqj wkh vdph qrghv dv Jf 1 Vwurqj sdwwhuq1 Li zh phujh wkh judskv Jf dqg Ju / zkloh glvwlqjxlvklqj hdfk w|sh ri hgjh +iru h{dpsoh zlwk d vrolg olqh iru f dqg d grwwhg olqh iru u,/ zh rewdlq d elsduwlwh judsk J kdylqj s.t hgjhv1 D sdlu ri qrghv +l/ m, lv uholhg
400
S. Lallich
e| wzr hgjhv dw prvw1 Li wkhuh duh wzr hgjhv/ lw lv qhfhvvdu| wkdw wkh| eh ri d glhuhqw qdwxuh/ rqh zlwk d vrolg olqh/ wkh rwkhu zlwk d grwwhg olqh1 Zh zloo frqvlghu vxfk d frxsoh urz0froxpq dv d vwurqj sdwwhuq ri wkh elsduwlwh judsk= hdfk phpehu lv wkh lpdjh ri wkh rwkhu rqh wkurxjk wkh uhodwlrqvkls qhduhvw froxpq 0 qhduhvw urz h{suhvvhg e| wkh judsk1 Wkh fruuhvsrqglqj mrlqw devroxwh iuhtxhqf| qlm lv wkh pd{lpxp iru urz l dqg froxpq m1 Zkroh sduwlwlrq uhvxowlqj iurp vwurqj sdwwhuq1 Zkhq wkh sdlu frq0 vlvwlqj lq d fdwhjru| ri rqh yduldeoh dqg wkh dvvrfldwhg fdwhjru| ri wkh rwkhu yduldeoh lv qrw d vwurqj sdwwhuq/ d fkdlq ri qhduhvw qhljkeruv dsshduv/ olnh l4 / m4 / l5 / m5 / 111/ zkhuh mk lv wkh qhduhvw froxpq ri lk / zkloh lk.4 lv wkh qhduhvw urz ri mk 1 Qhfhvvdulo|/ hdfk fkdlq hqgv xs e| d vwurqj sdwwhuq/ zklfk frqvlvwv lq uhflsurfdo qhduhvw qhljkeruv1 Wkhq/ zh vhhn iru hdfk urz l/ l @ 4/ 5/ 111/ s/ dqg iru hdfk froxpq m/ m @ 4/ 5/ 111/ t/ wr zklfk fkdlq lw ehorqjv dqg e| zklfk vwurqj sdwwhuq lwv fkdlq hqgv xs1 Wkh uhodwlrq U wr eh dvvrfldwhg zlwk wkh vdph vwurqj sdwwhuq ghqhv dq htxlydohqfh uhodwlrq rq wkh qrghv ri J dv zhoo dv rq wkh wudfhv Jb[ hw Jb\1 Wkh htxlydohqfh fodvvhv prgxod U duh wkh frqqhfwhg frpsrqhqwv ri wkh judsk J1 Wkhlu lqwhuvhfwlrqv zlwk [ dqg \ frqvwlwxwh wzr sduwlwlrqv ri [ dqg \ mrlqhg rqh fodvv wr rqh fodvv1 ]lj]dj dojrulwkp1 ]lj]dj kdv ehhq lpsohphqwhg lq Ghoskl^8` dqg lv qrz dydlodeoh rq wkh Zhe1 Wkh dojrulwkp lv dssolhg lq wzr vwdjhv1 Iluvwo|/ zh fuh0 dwh wkh wdeoh lqglfdwlqj wkh qhduhvw froxpq fdwhjru| ri hdfk urz fdwhjru| dqg wkh qhduhvw urz fdwhjru| ri hdfk froxpq fdwhjru|1 Wkhq/ xvlqj wklv wdeoh/ wkh glhuhqw fkdlqv ri qhduhvw qhljkeruv duh exlow dqg wkh fruuhvsrqglqj judsk lv uhsuhvhqwhg1 Wkh fdwhjrulhv zklfk ehorqj wr fkdlqv hqglqj xs e| wkh vdph uhfls0 urfdo qhduhvw qhljkeru*v sdlu frqvwlwxwh d frqqhfwhg frpsrqhqw ri wkh judsk dqg ghqh d urz0froxpq foxvwhu1Wkh qxpehu ri foxvwhuv lv htxdo wr wkh qxpehu ri vwurqj sdwwhuqv/ zklfk lv frpsulvhg ehwzhhq 4 +lq fdvh ri lqghshqghqfh, dqg plq ~s/ t/ +lq fdvh ri ixqfwlrqqdo ghshqghqfh,1
6
Dssolfdwlrq wr Sdwhqwv Gdwd
Zh kdyh fuhdwhg ]lj]dj rq wkh rffdvlrq ri d zrun ghdolqj zlwk Iuhqfk upv sdwhqwlqj lq wkh XV ryhu wkh shulrg 4<;80<3/ xvlqj wkh VSUX0OHVD gdwdedvh^5`= hdfk sdwhqw judqwhg e| wkh XV Sdwhqw R!fh wr Iuhqfk edvhg upv dqg lqvwl0 wxwlrqv lv ghvfulehg e| wkh lqgxvwuldo vhfwru lq zklfk lw lv surgxfhg dqg wkh whfkqrorjlfdo hog ri wkh sdwhqw dssolfdwlrq1 Wr ylvxdol}h wklv wdeoh/ zh uvw xvhg wkh surjudp DPDGR^6` wr shupxwh wkh urzv dqg wkh froxpqv ri gdwd pdwul{ lq rughu wr uhyhdo wkh xqghuo|lqj vwuxfwxuh ri wkh pdwul{ +Ilj1 4,1 Zh dovr frq0 vlghuhg xvlqj fodvvlfdwlrq ri wkh urzv ru froxpqv ri wkh wdeoh/ exw zh wkrxjkw wkdw lw ghdow zlwk wkh urzv dqg froxpqv lq dq dvv|phwulf pdqqhu1 Lq idfw/ zh qhhghg wr ghqh uhdo whfkqr0lqgxvwuldo foxvwhuv jdwkhulqj erwk lqgxvwuldo vhfwruv dqg whfkqrorjlfdo hogv zklfk duh vwurqjo| uhodwhg1 Dv d uhvxow/ zh frqfhlyhg ]lj]dj/ ri zklfk wkh uhvxowlqj pdsslqj lv jlyhq ehorz +Ilj1 5,1 Lw dsshduv iurp wklv pdsslqj wkdw wkhuh duh whq whfkqr0lqgxvwuldo foxvwhuv froohfwlqj 74 ( ri wkh wrwdo qxpehu ri sdwhqwv1 Dprqj wkhvh foxvwhuv wkuhh duh
ZigZag
401
Ilj1 41 Ehuwlq*v pdwul{ iru sdwhqw*v gdwd1
frpsoh{ rqhv/ Fkhplfdov/ Hohfwurqlfdo dqg Phfkdqlfdo Wudqvsruwv Htxls0 phqw dqg uhsuhvhqw 66 ( ri wkh wrwdo sdwhqwv1
7
Ylvxdol}dwlrq ri Suhglfwlyh Dvvrfldwlrq
]lj]dj ylvxdol}hv suhglfwlyh dvvrfldwlrq iru qrplqdo gdwd lq wkh vhqvh ri wkh frh!flhqw sursrvhg e| Jxwwpdq/ Jrrgpdq dqg Nuxvndo^7`1 714
Suhglfwlyh Dvvrfldwlrq iru Fdwhjrulfdo Gdwd
Zh uhfdoo wkdw lw lv txlwh hdv| wr suhglfw \ nqrzlqj [ zkhq [ dqg \ duh txdqwl0 wdwlyh yduldeohv1 Zh uvw xvh d vwdwlvwlfdo vriwzduh wr sorw wkh vfdwwhuhg gldjudp dqg wkh vwudljkw uhjuhvvlrq olqh ru wkh fxuyh ri frqglwlrqdo phdqv/ lqfuhdvhg zlwk wkh frh!flhqw ri fruuhodwlrq ru wkh fruuhodwlrq udwlr1 Vwduwlqj iurp wklv gldjudp/ lw lv srvvleoh wr fkrrvh wkh prvw dssursuldwh ixqfwlrq ri [ lq rughu wr h{sodlq \ dqg wr fdofxodwh wkh suhglfwlrq ri \ iru d jlyhq ydoxh ri [1 Zkhq [ dqg \ duh qrplqdo yduldeohv/ rqh jhqhudoo| xvhv wkh vwudwhj| ri S1U1H +Sursruwlrqdo Uhgxfwlrq lq Huuru, frh!flhqwv ri dvvrfldwlrq sursrvhg e| Jrrgpdq dqg Nuxvndo^7`1 Iluvwo|/ zh kdyh wr ghqh d suhglfwlrq uxoh zklfk fdq eh dssolhg d sulrul/ zlwkrxw dq| lqirupdwlrq derxw [/ dqg d srvwhulrul/ zkhq nqrzlqj [1 Wkhq/ zh fdofxodwh wkh ulvn ri huuru dvvrfldwhg zlwk wkh d sulrul uxoh dqg wkh phdq huuru ulvn dvvrfldwhg zlwk wkh d srvwhulrul uxoh1
402
S. Lallich
Wkh S1U1H frh!flhqwv phdvxuh wkh sursruwlrqdo phdq uhgxfwlrq lq huuru ri wkh suhglfwlrq uxoh1 Wkh prvw frpprq frh!flhqwv duh wkh ri Jxwwpdq/ Jrrgpdq dqg Nuxvndo/ wkh X ri Vkdqqrq dqg wkh ri Jrrgpdq hw Nuxvndo1 Zh fdq suhvhqw wkhp dv sduwlfxodu fdvhv ri sursruwlrqdo phdq uhgxfwlrq lq glyhuvlw| ri wkh \ glvwulexwlrq/ dv lq Odoolfk^9`1 Wkh hdvlhvw ri wkhvh wkuhh frh!flhqwv lv wkh frh!flhqw1 Wkh fruuhvsrqglqj suhglfwlrq uxoh lv wkh rswlpdo rqh wr suhglfw d xqltxh fdwhjru| ri wkh ghshqghqw yduldeoh= dozd|v jxhvv wkh prgdo fodvv revhuyhg lq wkh gdwd1 Wkh d sulrul uxoh suhglfwv wkh prgdo fdwhjru| ri wkh pdujlqdo glvwulexwlrq ri wkh ghshqghqw yduldeoh/ zkhuhdv wkh d srvwhulrul uxoh suhglfwv wkh prgdo fdwhjru| ri wkh frqglwlrqdo glvwulexwlrq1 Li \ lv wkh ghshqghqw yduldeoh/ wkh huuru ulvn ri wkh d sulrul suhglfwlrq uxoh lv u3 @ 4 pd{i.m > m @ 4> 5> ===tj=Wkh phdq huuru ulvn ri wkh d srvwhulrul suhglfwlrq uxoh lv= u4 @ 4
s [ l@4
l. pd{im@l > m @ 4> 5> ===tj
Vr/ wkh h{suhvvlrq ri \@[ lv= u3 u4 @ \@[ @ u3
Ss
l@4 l. pd{im@l > m @ 4> 5> ===tj pd{i.m > m @ 4> 5> ===tj
4 pd{i.m > m @ 4> 5> ===tj
Wkh h{wuhph ydoxh \@[ @3 rqo| lv d qhfhvvdu| frqglwlrq ri lqghshqghqfh1 Rq wkh frqwudu|/ \@[ @4 lv d qhfhvvdu| dqg vx!flhqw frqglwlrq ri ixqfwlrqdo ghshqghqfh= nqrzlqj [ |rx nqrz \1 Lq wkh vdph pdqqhu/ zkhq frqvlghulqj [ dv wkh ghshqghqw yduldeoh/ zh fdofxodwh [@\ / wkh rwkhu dv|pphwulfdo frh!flhqw1 Ixuwkhupruh/ rqh fdq fdofxodwh wkh v|pphwulfdo frh!flhqw [\ / ghqhg dv wkh phdq ri wkh dv|pphwulfdo frh!flhqwv zhljkwhg e| wkhlu d sulrul huuru ulvnv1 3 Plqi\@[ > [@\ j [\ Pd{i\@[ > [@\ j 4 Li [\ @3/ wkhq \ @[ @[@\ @3/ exw wkdw lv rqo| d qhfhvvdu| frqglwlrq ri lqghshqghqfh1 Rq wkh frqwudu|/[\ @4 lpsolhv \@[ @[@\ @4/ zklfk lv d qhf0 hvvdu| dqg vx!flhqw frqglwlrq ri grxeoh ixqfwlrqdo ghshqghqfh/ uhtxlulqj s@t1 Lq wkh fdvh ri pxowlqrpldo vdpsolqj/ wkh +slm , duh pd{lpxp olnholkrrg hv0 wlpdwruv ri wkh +lm , dqg wkh| duh dv|pswrwlfdoo| qrupdo1 Wkxv wkh vdpsoh dv0 vrfldwlrq frh!flhqw/ ghqrwhg e| O\ @[ / lv wkh pd{lpxp olnholkrrg hvwlpdwru ri \ @[ dqg lv dv|pswrwlfdoo| qrupdo1 Vr/ lw lv srvvleoh wr hvwlpdwh wkh dv|pswrwlf yduldqfh ri O\ @[ e| dsso|lqj wkh ghowd phwkrg^7`1 Lq eulhi/ wkh pdsslqj ixuqlvkhg e| ]lj]dj sod|v iru qrplqdo gdwd wkh vdph uroh dv wkh rqh sod|hg e| wkh vfdwwhuhg gldjudp zlwk wkh fxuyhv ri frqglwlrqdo phdqv ru wkh vwudljkw uhjuhvvlrq olqh iru txdqwlwdwlyh gdwd/ wkh uvw lqfuhdvhg zlwk wkh ydoxhv ri O\ @[ dqg O[@\ / wkh vhfrqg lqfuhdvhg zlwk wkh fruuhodwlrq udwlr ru wkh U5 1 Lq wkh iroorzlqj/ zh suhvhqw rqh h{dpsoh ri suhglfwlrq iru hdfk w|sh ri gdwd1
ZigZag
715
403
Xvxdo Dssurdfk wr Suhglfw Txdqwlwdwlyh Yduldeohv
Wr looxvwudwh wkh xvxdo dssurdfk wr suhglfw txdqwlwdwlyh yduldeohv/ zh xvh wkh duwlfldo gdwd wkdw duh phqwlrqhg ehorz/ zkhuh [ lv d glvfuhwh yduldeoh dqg \ d frqwlqxrxv rqh1 [ 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 \ 61< 613 51; 517 514 417 41; 515 61: 915 81: :16 814 816 813 71: 914 :18 :1: <1; [ 6 6 6 6 6 6 7 7 7 7 7 7 7 7 8 8 8 8 8 8 \ :18 :17 81< :16 916 ;1: <14 4313 ;15 ;14 ;17 4313 <15 ;18 4513 4413 <18 4313 <1; 4413 Frpprqo| xvhg vwdwlvwlfdo vriwzduhv uvw gudz wkh vfdwwhuhg gldjudp dqg wkh uhjuhvvlrq fxuyh zklfk vxjjhvw wkh qdwxuh ri wkh uhodwlrq ehwzhhq [ dqg \ +orjdulwkplf uhodwlrq lq rxu h{dpsoh/ fi1 Ilj1 6,1 Wkh| wkhq doorz wr frp0 sxwh wkh uhjuhvvlrq htxdwlrq dqg wkh fruuhvsrqglqj ghwhuplqdwlrq frh!flhqw +\ @ 7185 Oq[ . 51;5 dqg U5 @ 31;9,1 716
Frxqwhusduw Dssurdfk Sursrvhg wr Suhglfw Txdolwdwlyh Yduldeohv
Wr looxvwudwh wkh frxqwhusduw dssurdfk wkdw zh sursrvh wr suhglfw txdolwdwlyh yduldeohv/ zh xvh d furvv0fodvvlfdwlrq wdeoh dgdswdwhg iurp Ehq}hful1 Lq wklv fodvvlfdwlrq/ urz fdwhjrulhv duh fljduhwwh*v eudqgv/ zkloh froxpq fdwhjrulhv duh txdolwlhv wkdw dq|erg| fdq dvvrfldwh zlwk hdfk eudqg qdph1 ]lj]dj fuhdwhv d wzr zd|v suhglfwlyh fodvvlfdwlrq wkdw lqglfdwhv xv zklfk lv wkh prvw suredeoh wrs ri plqg txdolw| nqrzlqj wkh qdph ri wkh eudqg/ dqg zklfk lv wkh prvw suredeoh qdph ri eudqg dvvrfldwhg zlwk hdfk txdolw| +Ilj1 7,1 Wkh glhuhqw frh!flhqwv ydoxhv phdvxuh wkh suhglfwlrq txdolw|1 Fljduhwwh . Txdolwì Ruo| Doh}dq Fruvdluh Gluhfwrluh Gxfdw Irqwhqr| Lfduh ]rgldtxh Sdyrlv Frfnhu Hvfdoh K÷whvvh
8
4 4 5 47 6; 4; 43 < 8 < 7 3 4
5 53 < 4 44 43 < 4 4 53 < : 45
6 < 56 4 48 : 44 9 5 : 45 6 4:
7 4 6 48 48 9 8 45 4; 7 58 5 5
8 7 66 : ; 6 9 9 7 8 48 6 6
9 6 < 4 : : 8 45 < 9 < 9 46
: 44 < 4 4: 7 54 9 4 8 7 8 5:
; 7 7 65 5 9 3 < : 6 43 45 :
< < 45 56 7 : 46 8 8 43 8 46 <
43 < 6 < ; 7 5 9 ; 4 9 56 66
44 : 8 5 : 44 5 9 44 < 57 43 8
Frqfoxvlrq
Lq wklv sdshu/ d qhz NGG dojrulwkp kdv ehhq lqwurgxfhg zklfk doorzv xv wr h{soruh vljqlfdqw fdwhjrulfdo gdwd furvv0fodvvlfdwlrqv wkurxjk judsklfdo glv0 sod|1 ]lj]dj fuhdwhv foxvwhuv zklfk fodvvli| urz dqg froxpq fdwhjrulhv dffruglqj
404
S. Lallich
Ilj1 51 ]lj]dj pdsslqj iru sdwhqw*v gdwd
Ilj1 61 Vfdwwhuhg gldjudp iru txdqwlwdwlyh yduldeoh suhglfwlrq1
ZigZag
405
Ilj1 71 ]lj]dj pdsslqj iru fljduhwwh*v eudqg gdwd1
wr d yhu| vlpsoh fulwhulrq= d fdwhjru| lv olqnhg wr dqrwkhu li lw lv wkh prvw suredeoh frqglwlrqdoo| wr wkh odwwhu1 Wkxv/ zh rewdlq d ylvxdol}dwlrq ri wkh lq0 irupdwlrq frqwdlqhg lq wkh wdeoh lq whupv ri suhglfwlyh dvvrfldwlrq/ lq wkh vhqvh ri Jxwpdqq/ Jrrgpdq dqg Nuxvndo1 Wkh txdolw| ri wklv lqirupdwlrq lv hydoxdwhg wkurxjk O\@[ dqg O[@\ / wkh wzr suhglfwlyh dvvrfldwlrq frh!flhqwv fuhdwhg e| wkhvh dxwkruv1 Dv wkrxvdqgv ri wdeohv fdq eh frpsxwhg lq pdq| gdwdedvhv/ zh sursrvh wr ehjlq e| xvlqj O\@[ ru O[@\ lq rughu wr vhohfw wkh prvw vljqlfdqw wdeohv1 Wkhq/ ]lj]dj lv d yhu| h!flhqw wrro hqdeolqj wr v|qwkhvl}h dqg vxppdul}h wkh nqrzohgjh lqfoxghg lq wkhvh wdeohv1
Uhihuhqfhv 41 Djuhvwl D1 +4<<3,/ Fdwhjrulfdo Gdwd Dqdo|vlv/ Mrkq Zloh|/ Qhz0\run1 51 Ehujhurq V1/ Odoolfk V1/ Oh Edv F1/ +4<<;,/ Orfdwlrq ri Lqrydwlyh Dfwlylwlhv dqg Whfkqrorjlfdo Vwuxfwxuh lq wkh Iuhqfk Hfrqrp|/ 4<;80<3= Vrph Hylghqfhv iurp X1V sdwhqwlqj/ Uhvhdufk Srolf|/ 59/ ss1 :660:841 61 Fkdxfkdw M1K1/ Ulvvrq D1 +4<<;,/ Ehuwlq*v Judsklfv dqg Pxowlglphqvlrqdo Gdwd Dqdo|vlv/ lq Eodvlxv M1/ Juhhqdfuh P1 +4<<;,/ Ylvxdol}dwlrq ri Fdwhjrulfdo Gdwd/ Dfdghplf Suhvv/ 71 Jrrgpdq O1D1/ Nuxvndo Z1K1+4<87,/ Phdvxuhv ri Dvvrfldwlrq iru Furvv0 Fodvvlfdwlrqv L/ MDVD/ 7 ss1 :650:971 81 Jxloolhq I1 +4<<;,/ Plvh hq rhxyuh gh ]lj]dj vrxv Ghoskl= Gdwdpl{/ Pìprluh gh Pdñwulvh Vflhqfhv Hfrqrpltxhv/ Xqlyhuvlwì Oxplëuh O|rq 51 91 Odoolfk V1 +4<<<,/ Frqfhsw gh glyhuvlwì hw dvvrfldwlrq suìglfwlyh/ VIGV < Juhqreoh1 :1 Udnrwrpdodod U1/ Odoolfk V1 +4<<;,/ Kdqgolqj Qrlvh zlwk Jhqhudol}hg Hqwurs| ri W|sh Eíwd lq Lqgxfwlrq Judskv Dojrulwkpv/ MFLV *<;/ Gxnh/ XVD1
Efficient Mining of High Confidence Association Rules without Support Thresholds Jinyan Li1 , Xiuzhen Zhang1 , Guozhu Dong , Kotagiri Ramamohanarao1 , and Qun Sun1 2
1
2
Department of CSSE, The University of Melbourne, Parkville, Vic. 3052, Australia. {jyli, xzhang, rao, qun}@cs.mu.oz.au Dept. of CSE, Wright State University, Dayton OH 45435, USA. [email protected] Abstract. Association rules describe the degree of dependence between items in transactional datasets by their confidences. In this paper, we first introduce the problem of mining top rules, namely those association rules with 100% confidence. Traditional approaches to this problem need a minimum support (minsup) threshold and then can discover the top rules with supports ≥ minsup; such approaches, however, rely on minsup to help avoid examining too many candidates and they miss those top rules whose supports are below minsup. The low support top rules (e.g. some unusual combinations of some factors that have always caused some disease) may be very interesting. Fundamentally different from previous work, our proposed method uses a dataset partitioning technique and two border-based algorithms to efficiently discover all top rules with a given consequent, without the constraint of support threshold. Importantly, we use borders to concisely represent all top rules, instead of enumerating them individually. We also discuss how to discover all zero-confidence rules and some very high (say 90%) confidence rules using approaches similar to mining top rules. Experimental results using the Mushroom, the Cleveland heart disease, and the Boston housing datasets are reported to evaluate the efficiency of the proposed approach.
1
Introduction
Association rules [1] were proposed to capture significant dependence between items in transactional datasets. For example, the association rule {tea, cof f ee} → {sugar} says it is highly likely that a customer purchasing tea and coffee also purchases sugar; its likelihood is measured by its confidence (the percentage of transactions containing tea and coffee which also contain sugar). In this work, we are mainly interested in the efficient mining of association rules with 100% confidence, which we call the top rules. Observe that if X1 → X2 is a top rule, then any transaction containing X1 must also contain X2 . The following example shows the usefulness of top rules. Example 1. For the Cleveland heart disease dataset (taken from UCI ML repository), we have found many top rules. Two typical top rules are: {having ST-T wave abnormality, exercise induced angina} → P resence and {left ventricular hypertrophy, downsloping of the peak exercise ST segment, thal: fixed defect} → {CP = 1, f bs = 1}. The first rule means that if the two symptoms on the left-hand side of the rule appear then the patients definitely suffer from a heart disease of some degree (either slight or serious). The second rule means that no matter male or female and no matter suffering from heart _ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 406–411, 1999. c Springer-Verlag Berlin Heidelberg 1999
Efficient Mining of High Confidence Association Rules
407
disease or not, once holding the left-hand side symptoms the patients must suffer from a typical angina and their fasting blood sugar > 120 mg/dl. Knowing all rules of this type can be of great help to medical practitioners. Traditional approaches to discovering association rules usually need two steps: (i) find all itemsets whose supports are ≥ minsup; (ii) from the result of step (i), find all association rules whose confidences are ≥ minconf and whose support are ≥ minsup. Observe, however, using this procedure, those top rules with supports less than minsup are missed. Although the method proposed in [2] can effectively extract high confidence rules, a minsup threshold is still imposed on the discovered rules. Another disadvantage of these approaches is that they need to explicitly enumerate all discovered rules and to explicitly check all candidate rules; this would require a long processing time and I/O time if the number of top rules or the number of candidates is huge. Fundamentally different from previous work, we propose in this paper an emerging pattern (EP) [4] based approach to efficiently discover all top rules, without minsup limitation, when given the consequent of the expected top rules. Given a dataset D, the proposed approach first divides D into two sub-datasets according to the consequent of the expected top rules and then uses the border-based algorithms to mine a special kind of itemsets whose supports in one sub-dataset are zero but non-zero in the other sub-dataset. This special kind of itemsets are called jumping EPs [5]. All desired top rules can then be readily built using jumping EPs. As the border-based algorithms are very efficient, the proposed approach is also efficient. Furthermore, we do not enumerate all top rules; we use borders [4] to succinctly represent them instead. The significance of this representation is highlighted in the experimental results of Mushroom dataset, where there exist a huge number of top rules. In addition to top rules, we also address the problems of mining zero-confidence rules and mining very high (say ≥ 90%) confidence rules with similar approaches to mining top rules. Organization: §2 formally introduces the problem of mining top rules. §3 discusses our approach to this mining problem. §4 discusses how to discover high confidence rules (and zero-confidence rules). The experimental results are shown in §5 to evaluate the performance of our approaches. §6 discusses how to use the border mechanism to efficiently find top rules with support or length constraints. §7 concludes this paper.
2
Problem Definition
Given a set I = {i1 , i2 , · · · , iN } of items, a transaction is a subset T of I, and a dataset is a set D of transactions. The support of an itemset X in a dataset D, denoted as suppD (X), D (X) is count , where countD (X) is the number of transactions in D containing X. An |D| association rule is an implication of the form X → Y , where X ⊂ I, Y ⊂ I, and X ∩ Y = ∅; X is called the left-hand side (or antecedent) of the rule and Y the righthand side (or consequent). A rule X → Y is associated with two measures: its support D (XY ) which is defined as suppD (XY ) and its confidence which is defined as count countD (X) (equivalently,
suppD (XY ) suppD (X) ).
Definition 1. A top rule is an association rule whose confidence is exactly 100%. In many applications, for example when examining the edibility of mushroom or making decisions on heart disease, the goal is to discover properties of a specific class of
408
J. Li et al.
interest. We capture such situations by requiring the mined rules to have a given Target as consequent 1 and intend to find all such rules. Definition 2. Given a set of transactions D and an itemset Target, the problem of mining top rules is to discover all top rules with Target as the consequent. In Section 4 we also consider discovering zero-confidence rules and µ-level confidence rules based on the approach to mining top rules.
3
Discovering Top Rules
We solve the problem of mining top rules in two steps: 1. Dataset Partitioning and Transformation. Divide D into sub-datasets D1 and D2 : D1 consists of the transactions containing Target; D2 consists of the transactions which do not contain Target. Then, we remove all items of Target from D1 and D2 . As a result, D1 and D2 are transformed into D10 and D20 respectively and the set of items becomes I 0 = I − T arget. All these can be done in one pass over D. 2. Discovery of Top Rules. Find all those itemsets X which occur in D10 but do not occur in D20 at all. Then, for every such X, the rule X → T arget is a top rule (with confidence of 100%) in D. Correctness. By definition, the confidence of the rule X → T arget is the percentage of the transactions in D containing X which also contain Target. As the discovered itemsets X only occur in D10 but not in D20 , those transactions in D which contain X must also contain Target. This means that X → T arget has confidence of 100% in D. In contrast, if some itemset Y occurs in both D10 and D20 , then it is not true that those transactions in D which contain Y must also contain Target, by the constructions of D10 and D20 . We call those itemsets X, which only occur in D10 but not in D20 , the jumping emerging patterns (jumping EPs) from D20 to D10 [5]. Observe that their supports in D10 are non-zero but zero in D20 . It is now obvious that the key step in mining top rules in D is to discover the jumping EPs between two relevant datasets because the discovered jumping EPs are the antecedents of top rules. In the work of [5], the problem of mining jumping EPs is formally defined and well solved. The high efficiency of those algorithms is a consequence of their novel use of borders [4] - an efficient representation mechanism. For the problem of mining jumping EPs, two border-based algorithms have been proposed in [5]. The first one is called Horizon-Miner and the second is based on MBD-LLborder of [4]. Horizon-Miner is used to discover a special border, called horizontal border in [5], from a dataset such that this special border represents precisely 1
As an aside, we note that it is very easy to generate such top rules as T arget → X. First, create a new dataset which consists of all those transactions in D containing Target. Second, remove all items of Target from this new dataset to form D 0 . Then, all itemsets X with 100% support in D0 can be used to build the top rules T arget → X.
Efficient Mining of High Confidence Association Rules
409
all itemsets with non-zero supports in this dataset. When taking two horizontal borders derived from D10 and D20 by Horizon-Miner as input, the MBD-LLborder algorithm can produce all itemsets whose supports in D10 are non-zero but zero in D20 . Therefore, this output of MBD-LLborder is just our desired jumping EPs from D20 to D10 . More details of Horizon-Miner and MBD-LLbordercan be found in [4,5].
4
Discovering Zero-Confidence Rules and µ-Level Confidence Rules
As shown above, a jumping EP X from D20 to D10 corresponds to a top rule X → T arget and vice versa in D. On the other hand, any jumping EP Z from D10 to D20 corresponds to the rule Z → T arget with zero-confidence in D. This is because itemset Z only occurs in D20 , and so all transactions in D which contain Z must not contain Target. Observe that zero-confidence rules reveal absolutely (100%) negative correlation between two events. Procedurally, there is only slight difference between the problems of mining top rules and mining zero-confidence rules. In this paper, we are also interested in the problem of mining µ-level confidence rules: We refer µ-level confidence rules as those association rules whose confidences are ≥ 1−µ. We will show that the parameter µ strongly depends on the ratio of two supports in D10 and D20 of itemsets Y . Let suppi (Y ) denote the support of Y in Di0 , i = 1, 2. By definition, the confidences of all µ-level confidence rules Y → T arget in D supp (Y )∗|D 0 | satisfy: supp1 (Y )∗|D10 |+supp21 (Y )∗|D0 | ≥ 1 − µ, where |Di0 | is the number of transactions 1
2
|D20 | 1−µ supp1 (Y ) supp2 (Y ) ≥ |D10 | ∗ µ . This means that, for any itemset Y , if its support ratio, |D20 | supp1 (Y ) 1−µ supp2 (Y ) , is ≥ |D10 | ∗ µ , then the rule Y → T arget has a confidence ≥ 1 − µ in D.
in Di0 . So,
Therefore, the problem of mining µ-level confidence rules is transformed to discovering |D 0 | all itemsets Y whose support ratio from D20 to D10 is ≥ |D20 | ∗ 1−µ µ . Obviously, this 1 is equivalent to the problem of mining ρ-emerging patterns (ρ-EPs) [4] from D20 to |D 0 | 0 0 D10 , where ρ = |D20 | ∗ 1−µ µ . In [4], ρ-EPs from D2 to D1 is defined as those itemsets 1 whose support ratio (or growth rate) from D20 to D10 is ≥ ρ. Typically, it is difficult to find all ρ-EPs over two large datasets though some border-based algorithms [4] have been proposed to find some ρ-EPs (including some long EPs which cannot be efficiently discovered by naive algorithms). While this reduction allows us to find some µ-level confidence rules, it is still a problem needing further investigation to find all of them.
5
Experimental Results
We selected three datasets from UCI Machine Learning repository [8]: the Mushroom dataset, the Cleveland heart disease dataset, and the Boston housing dataset, to evaluate our ideas and algorithms. In this work, discretization of numeric attributes is performed using the techniques discussed in [7,6]. As discussed in Section 3, each jumping EP from D20 to D10 corresponds to a desired top rule. Therefore, we mainly present the results about jumping EPs. We use a certain number of borders in the form of to represent the discovered jumping EPs, where L may contain many itemsets but R is a singleton set. For the Mushroom dataset, setting Target as {Poisonous} or {Edible} or {P opulation = several, Habitat = leaves}, we considered these questions.
410
J. Li et al.
– How many top rules are in this dataset approximately? – What are the shortest and longest lengthes of the discovered antecedents? – Among the discovered top rules, what are the biggest and smallest supports? The answers to these questions may help us classify new mushroom instances and help us recognize multi-feature characteristics of mushroom. The smallest support among all discovered top rules is used to show the advantages of our proposed algorithms: our approach can efficiently find some top rules which cannot be found by other approaches. Interestingly, there are no jumping EPs from D20 to D10 when Target is set as {P opulation = scattered, Habitat = grasses}; this knowledge is useful as it reveals that this Target does not have 100% dependency on any itemsets except Target itself. We also address similar questions for the Cleveland heart disease dataset and the Boston housing dataset. Regarding the first question, we set Target as {Presence}, {Absence}, and {CP = 1, f bs = 1} for the Cleveland dataset, HiCrime, LowCrime, and GoodHouse for the Boston housing dataset. We believe the discovered top rules would be useful for domain experts to have a better understanding about the symptom dependency in heart disease patients or to obtain deeper demographic information about the suburbs of Boston. The experiments were done on a DEC Alpha Server 8400 machine with CPU Speed 300MHZ and 8G memory. This machine works in a network environment and there are many users (e.g. 74) and high load average. The purpose of the experiments is to show the efficiency of the algorithms. We summarize the experimental results as follows. Target Edible Poisonous PopHab* Absence Presence TypicalSymp* HiCrime* Lowcrime* GoodHouse*
#borders itemsets in L itemsets in R s’est,l’est Avg. length 4208 3916 48 143 123 3 40 165 3
1, 5 1, 6 2, 6 1, 6 2, 6 3, 7 2, 7 1, 7 3, 4
21 21 20 13 13 12 13 13 11
#rules ≈ 4.33 ∗ 108 ≈ 3.75 ∗ 108 ≈ 2.72 ∗ 107 12636 195552 2000 56232 328750 704
support(%) largest, smallest
time
33.09, 0.01 27.42, 0.01 0.59, 0.01 23.91, 0.34 9.76, 0.34 0.67, 0.34 2.57, 0.2 9.29, 0.2 0.2, 0.2
6690.1s 6686.4s 485.8s 7.32s 8.65s 0.24s 3.41s 11.80s 0.37s
In the first column of the table, the meanings of the targets with “*” are as follows. PopHab: Population = several & Habitat = leaves; TypicalSymp: Chest Pain Type = typical angina & Fasting Blood Sugar > 120mg/dl; HiCrime: per capita crime rate ≥ 10%; LowCrime: per capita crime rate ≤ 0.1%; GoodHouse: pupil-teacher ratio ≥ 16% & lower status of the population ≤ 10% & $10,000 ≤ median value of owneroccupied homes < $20,000. The second column shows the number of borders representing the discovered jumping EPs. Column 3 shows the lengthes of the shortest and longest itemsets in the left-hand bounds of the borders. Column 4 shows the average length of all itemsets in the right-hand bounds. Column 5 shows the approximate or exact numbers of the top rules. Column 6 shows the largest and the smallest supports among all top rules.
Efficient Mining of High Confidence Association Rules
6
411
Extracting Top Rules with Support or Length Constraints
Observe that the number of top rules is far larger than the number of borders. Importantly, this confirms the effectiveness of the border representation mechanism. In our borderbased algorithms, we do not need to enumerate all jumping EPs (or, equivalently, top rules). However, if we are interested in some top rules, for example the first 100 (largest support) top rules, the border representations allow us to easily generate them. Indeed, because those top rules whose antecedents are the itemsets in the left-hand bounds of the discovered borders have the largest supports among all top rules, we can start the search with the itemsets in the left-hand bounds and then their immediate superset itemsets, and so on, covered by the borders. Furthermore, if we are interested in those top rules whose left-hand sides contain, for example, < 8 items, the discovered borders also allow us to find them quickly. Other kinds of interesting top rules, such as the ones in terms of the neighborhood-based unexpectedness, can be found by the techniques discussed in [3].
7
Conclusion
In this paper we have introduced the problem of mining high confidence association rules, and considered the efficient mining of top rules, of zero-confidence rules, and of µ-level confidence rules. Fundamentally different from the traditional approaches to discovering high-confidence rules, we have used a novel dataset partitioning technique and two border-based algorithms to discover the desired jumping EPs of the two relevant sub-datasets. Then, the discovered jumping EPs are used to construct the top rules. The advantages of our approach include: (i) the algorithms can find the top rules without the constraint of support threshold; (ii) the discovered top rules are succinctly represented by borders. The use of borders help us avoid exponential enumeration of huge collections of itemsets. This approach effectively and efficiently discovered top confidence rules from real high dimensional datasets.
References 1. Agrawal R., Imielinski T., Swami A.: Mining association rules between sets of items in large databases. Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data. Washington, D.C., (1993) 207–216 2. Bayardo, R.J.: Brute-force mining of high-confidence classification rules. Proc. of the Third Int’l Conf. on Knowledge Discovery and Data Mining. (1997) 123–126. 3. Dong, G., Li, J.: Interestingness of discovered association rules in terms of neighborhoodbased unexpectedness. Proceedings of Pacific Asia Conference on Knowledge Discovery in Databases (PAKDD’98), Melbourne. (1998) 72–86 4. Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences. SIGKDD’99, San Diego. (to appear) 5. Dong, G., Li, J., Zhang, X.: Discovering jumping emerging patterns and experiments on real datasets. Proc. the 9th International Database Conference (IDC’99), Hong Kong. (to appear) 6. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. IJCAI-93. (1993) 1022 – 1027 7. Kohavi, R., John, G., Long, R., Manley, D., Pfleger, K.: MLC++: a machine learning library in C++. Tools with artificial intelligence. (1994) 740 – 743 8. Murphy, P.M., Aha, D.W.: UCI Repository of machine learning database, [http://www.cs.uci.edu/ mlearn/mlrepository.html]. Irvine, CA: University of California, Department of Information and Computer Science (1994)
A Logical Approach to Fuzzy Data Analysis Churn-Jung Liau1 and Duen-Ren Liu2 1
Institute of Information Science Academia Sinica, Taipei, Taiwan [email protected] 2 Institute of Information Management National Chiao-Tung University, Hsinchu, Taiwan [email protected]
Abstract. In this paper, we investigate the extraction of fuzzy rules from data tables based on possibility theory. A possibilistic decision language is used to represent the extracted fuzzy rules. The algorithm for rule extraction is presented and the complexity analysis is carried out. Since the results of the rule induction process strongly depend on the representation language, we also discuss some approach for dynamic adjustment of the language based on the data.
1
Introduction
The results reported here are originally motivated by the quantization problem in rough set-based data mining methods[5,4]. In those methods, the induced rules are represented by the so-called decision language(DL). The basic building blocks of DL are descriptors of the form (a, v), where a is an attribute and v is a value. If the domain of attribute values is continuous, a quantization process is usually necessary to replace the value v by an interval containing it. However, to make the induced rules more robust, replacing a crisp interval by fuzzy linguistic terms may be an interesting alternative. In this paper, we propose a possibilistic decision logic which facilitates the representation of fuzzy rules and so solve the quantization problem to some extent. The possibilistic decision logic is based on possibility theory, which is developed by Zadeh from fuzzy set theory[8]. Given a universe W , a possibility distribution on W is a function π : W → [0, 1]. Obviously, π is a characteristic function of a fuzzy subset of W . Let F(W ) denote the class of all fuzzy subsets of W , then for X, Y ∈ F(W ), two measures may be defined CON (X, Y ) = sup µX (w) ⊗ µY (w), w∈W
IN C(Y, X) = inf µY (w) →⊗ µX (w), w∈W
where ⊗ : [0, 1] × [0, 1] → [0, 1] is a t-norm1 and →⊗ is the implication function defined as a →⊗ b = 1 − (a ⊗ (1 − b)) for all a, b ∈ [0, 1]. Hence, CON (X, Y ) 1
A binary operation ⊗ is a t-norm iff it is associative, commutative, and increasing in both places, and 1 ⊗ a = a and 0 ⊗ a = 0 for all a ∈ [0, 1].
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 412–417, 1999. c Springer-Verlag Berlin Heidelberg 1999
A Logical Approach to Fuzzy Data Analysis
413
denotes the degree of intersection between X and Y , whereas IN C(Y, X) is the degree of inclusion of Y in X.
2
Possibilistic Decision Logic
To represent the rules extracted from fuzzy data tables, we propose a possibilistic decision logic(PDCL) here. The linguistic terms used in the logic are fixed in advance and their meaning is given by a context. Once the context is determined, the semantics of wffs of the logic can be defined via possibility theory. 2.1
Syntax and Context
Let A be a set of attributes, L be a set of linguistic terms such that a function type : L → A assigning each linguistic term with its type, and H be a set of linguistic hedges, then the atomic formulas of PDCL are in one of the forms, (a, πl), (a, νl), (a, τ πl), and (a, τ νl), where a ∈ A, l ∈ L, τ ∈ H and type(l) = a. The set of well-formed formulas of PDCL is the smallest set containing atomic ones and closed under Boolean connectives. For example, (t, νhigh), (t, πhigh), (t, veryνhigh), and (t, veryπhigh) denote respectively “the temperature is certainly high”, “the temperature is possibly high”, “it is very certain the temperature is high”, and “it is very possible the temperature is high”. Here, the term ”very” is the so-called linguistic hedge. It is well-known that many natural language terms are highly context-dependent. For example, the term “tall” may have quite different meanings between “a tall basketball player” and “a tall child”. To model the context-dependency, we associate a context with each PDCL. The context determines the domain of values of each attributes and assigns appropriate meaning to each linguistic term and hedge. Formally, a context associated with a PDCL is a triple ({Va }a∈A , m1 , m2 ), where Va is a domain of values for each a ∈ A, m1 is a function on L such that m1 (l) ∈ F(Va ) if type(l) = a, and m2 : H → ([0, 1] → [0, 1]) is a function mapping each hedge to a function from [0, 1] to [0, 1]. While the domains Va and m1 are totally determined by the users or linguistic experts to reflect the intended meaning of these attributes and linguistic terms, there exist some common definitions for the linguistic hedges in the literature[1]. 2.2
Semantics
Given a PDCL with set of attributes A, set of linguistic terms L, set of linguistic hedges H, and a context ({Va }a∈A , m1 , m2 ), a fuzzy data table (FDT) is a pair S = (U, F (A)), where U is a finite set of objects and F (A) = {fa : U → F(Va ) | a ∈ A}. Intuitively, fa (x) denotes the uncertain value of attribute a for object x. Thus fa (x) = Va when the value is missing and fa (x) is a singleton when the value is precise. This means FDT’s can represent both precise and imprecise data in a uniform framework. Let L denote the set of wffs of the PDCL, then for an FDT S = (U, F (A)), we can define the truth valuation function ES : U × L → [0, 1] as follows:
414
C.-J. Liau and D.-R. Liu
1. 2. 3. 4. 5. 6. 7.
ES (x, (a, πl)) = supv∈Va µm1 (l) (v) ⊗ µfa (x) (v) ES (x, (a, νl)) = inf v∈Va µfa (x) (v) →⊗ µm1 (l) (v) ES (x, (a, τ πl)) = m2 (τ )(ES (x, (a, πl))) ES (x, (a, τ νl)) = m2 (τ )(ES (x, (a, νl))) ES (x, ¬ϕ) = 1 − ES (x, ϕ) ES (x, ϕ ∧ ψ) = ES (x, ϕ) ⊗ ES (x, ψ) ES (x, ϕ ∨ ψ) = ES (x, ϕ) ⊕ ES (x, ψ), where ⊕ is a t-conorm defined by a ⊕ b = 1 − (1 − a) ⊗ (1 − b) 8. ES (x, ϕ −→ ψ) = ES (x, ϕ) →⊗ ES (x, ψ) 9. ES (x, ϕ ≡ ψ) = ES (x, ϕ −→ ψ) ⊗ ES (x, ψ −→ ϕ) N We will define [|ϕ|]S = x∈U ES (x, ϕ) as the truth degree of ϕ with respect to an FDT S. Let pa = (a, πl), (a, νl), (a, τ πl), or (a, τ νl) be an atomic formula, called a-basic formula, then a CD-decision rule for C, D ⊆ A is a wff of the form ^ ^ pa −→ pa . a∈C
a∈D
When ϕ is a CD-decision rule, [|ϕ|]S will be the strength of the rule according to our semantics. In the next section, we will present the approach to discover this kind of rules from an FDT.
3
Rule Induction Process
In traditional rough set based approach to data analysis, for a decsriptor (a, v), v appears somewhere in the table, so we do not have to fix the atomic formulas of the decision logic in advance. However, in an FDT, it is possible that for some numerical attribute a, fa (x) has precise value, and to avoid the quantization problem, we would still like to use some linguistic terms to represent the induced rules. For example, we may have a data table with precise temperature values and want to discover rules of the form “If temperature is high, then . . .”. Thus it is necessary to fix the set of linguistic terms L of our PDCL in advance. On the other hand, if the linguistic terms and the context is given independent of the FDT, it is possible that the data values are not completely covered by these terms. To resolve the dilemma, we will first describe the rule induction algorithm by assuming a fixed set of linguistic terms and its associated context is given by the domain experts, and then consider the process of setting up or adjusting the language and context. For simplicity, we temporarily assume H = ∅ and omit the m2 component of a context. Without loss of generality, we can also assume the set of decision attributes is a singleton. The algorithm is described in Figure 1. In the first step of the procedure, we test whether x is a support of the abasic formulas (a, πl) and (a, νl) for each x ∈ U , a ∈ C, and l ∈ La . If the the degree of consistence between fa (x) and the linguistic term l is equal to 1, then x supports the statement (a, πl), and if fa (x) implies l to the degree 1, then x supports the statement (a, νl). Since Px is the Cartesian product of Pxa for all
A Logical Approach to Fuzzy Data Analysis
415
Procedure Rule Induction Input: S A FDCL L(A, L), a context ({Va }a∈A , m1 ), and a FDT S = (U, F (A)). Assume L = a∈A La and A = C ∪ {d}, where La = {l ∈ L | type(l) = a}, C is the set of condition attributes, and d is the decision attribute. Output: A set of C{d}-decision rules with strength. Steps: 1 Let Pxa = {πl | l ∈ La , CON (m1 (l), fa (x)) = 1} ∪ {νl | l ∈ La , IN C(fa (x), m1 (l)) = 1}. S 2 Let Px = ×a∈C Pxa and P = x∈U Px . 3 For each tuple t ∈ P and l ∈ Ld , let ϕπ (t, l) =
^
(a, t(a)) −→ (d, πl),
a∈C
ϕν (t, l) =
^
(a, t(a)) −→ (d, νl).
a∈C
4 Return the set {(ϕπ (t, l), [|ϕπ (t, l)|]S ) | t ∈ P, l ∈ Ld } ∪ {(ϕν (t, l), [|ϕν (t, l)|]S ) | t ∈ P, l ∈ Ld } End Fig. 1. The procedure to discover fuzzy rules
V a ∈ C, for any tuple t ∈ Px , x will be the support of the wff a∈C (a, t(a)). Thus each tuple in P corresponds to a conjunctive wff which has at least a support in the FDT S, and our extracted rules are those with at least a support. The number of supports for a tuple t in P (and its corresponding rules) is equal to the number of x’s such that t ∈ Px . For further refinement, we can eliminate rules with number of supports less than some predefined threshold value. Analogously, we return rules with arbitrary strength at the last step, however, we can also set a threshold and drop out any rules with strength less than the threshold. To carry out the complexity analysis of the rule induction process, let us P define |U | = m, |La | = na for all a ∈ A, and |L| = n = a∈A na . Then step 1 of the procedure will need O(mn) time for all x ∈ U and a ∈ A since it takes at for measures CON andQ IN C. In step 2, most 2m(n − nd ) times of computation Q 2n , so step 3 will need O( the cardinality of P is at most a∈C a a∈A 2na ) time, Q and finally step 4 will take O(m · a∈A 2na ) time since the computation of [|ϕ|]S will go through each element of U for any ϕ. If there exists constant N such that na ≤ N for all a ∈ A, then the overall time complexity of the procedure is O(m(n + (2N )|A| )). Thus the time complexity of the procedure is linear in the number of training cases, though it is exponential in the number of attributes. In other words, the procedure is efficient for problems with small numbers of attributes.
416
3.1
C.-J. Liau and D.-R. Liu
Context Construction and Adjustment
In the procedure given above, we assume the set of linguistic terms and their meaning is fixed in advance. For example, for the temperature attribute, we can assume the available linguistic terms are only “HIGH”, “MEDIUM”, and “LOW” and the associated fuzzy sets are given by domain experts. This assumption has the implication that the possible candidates of the induced rules are limited. However, it also has the defect that the linguistic terms may not adequately describe the data. In other words, the set Pxa may be empty for all x ∈ U . To resolve the problem, we may adjust the context and the set of linguistic terms dynamically according to the FDT. According to the types of the attributes, two cases are considered. First, if the attribute a is nominal, then we can simply let La = Va and for each v ∈ La , its meaning is the singleton set {v}. In this case, the semantics of atomic formulas (a, πv) and (a, νv) collapse into the same one, i.e., the one for (a, v) in original decision logic, when the data fa (x) is precise for all x ∈ U . Second, when the attribute a is numerical, we assume a metric δ : Va × Va → [0, ∞) exists. Then we define V1 = {fS a (x) | x ∈ U, fa (x) is a crisp set}, V2 = {fa (x) | x ∈ U } − V1 − {Va }, and V = V1 . Note that all these three sets are finite and V ⊂ Va . Now, some classical clustering techniques can be applied to partition V into crisp clusters V1 , V2 , . . . , Vk [2,3]. For example, by a linkage method for hierarchical clustering[3], starting from V1 , we merge two singleton sets with shortest distance into one. The new created set is used to replace its two original components in V1 and so we have a coarser partition. The process continues by merging two closest subsets each time until a predefined limit k is achieved. Here, the distance between two crisp subsets of V , X and Y , is defined as δ(X, Y ) = maxx∈X,y∈Y δ(x, y). After the clustering process, we can find the center of each Vi , vi∗ as vi∗ = arg min δ({v}, Vi − {v}). v∈Vi
d∗i
δ({vi∗ }, V
= ), then we can define a fuzzy set Xi ∈ F(Va ) with the memLet δ(v ∗ ,x) bership function µXi (x) = max(0, 1 − di∗ ). Let V3 = V2 ∪ {X1 , . . . , Xk }, then i V3 is the set of candidates for meanings of our linguistic terms. If the number of fuzzy sets in V3 is still too many, then we can apply clustering techniques to V3 again, but using the similarity measure between fuzzy sets to determine distance between them. Then the center of each cluster is collected into a set V. Finally, we can associate each element in V with a appropriate linguistic label and the set of linguistic labels are our La and their meaning are naturally defined as the corresponding elements in V.
4
Conclusion
In a recent article, Pawlak, the founder of rough set theory, point out that discretization of quantitative attribute values are badly needed for rough setbased data analysis[6]. In this regard, many discretization methods have been
A Logical Approach to Fuzzy Data Analysis
417
explored[4]. On the other hand, the management of uncertainty has been a longstanding requirement in intelligent data analysis. In this paper, we present a uniform logical framework for handling both uncertain and quantitative data. In the framework, uncertain attribute values are represented as fuzzy subsets of the domain and quantitative values may belong to some fuzzy sets to some different degrees. Linguistic terms correspond to the fuzzy subsets are taken as the basic building blocks of a PDCL. Then, for each item of data, the information contained in it decides the degree of truth of wffs of the language. A rule in our framework is an implication formula of the language and the aggregated degree of truth of the formula on all data items is taken as the strength of the rule. The formulas of decision logic are called information pre-granules in [7], so our wffs of PDCL can be analogously called fuzzy information pre-granules. Therefore, our logical approach to fuzzy data analysis can be seen as a formal instance of fuzzy granular information processing.
References 1. R. L´ opez de M´ antaras and L. Godo. “From fuzzy logic to fuzzy truth-valued logic for expert systems: A survey”. In Proceedings of the 2nd IEEE International Conference on Fuzzy Systems, pages 75–755, San Francisco, CA, 1993. IEEE. 2. A. Kandel. Fuzzy Mathematical Techniques with Applications. Addison-Wesley Publishing Co., 1986. 3. S. Miyamoto. Fuzzy Sets in Information Retrieval and Cluster Analysis. Kluwer Academic Publishers, 1990. 4. H.S. Nguyen and S.H. Nguyen. “Discretization methods in data mining”. In L. Polkowski and A. Skowron, editors, Rough Sets in Knowledge Discovery, pages 451–482. Physica-Verlag, 1998. 5. Z. Pawlak. Rough Sets–Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, 1991. 6. Z. Pawlak. “Rough sets: Present state and perspectives”. In Proceedings of the 6th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems(IPMU), pages 1137–1145, 1996. 7. L. Polkowski and A. Skowron. “Towards Adaptive Calculus of Granules”. In Proceedings of the 7th IEEE International Conference on IFuzzy Systems, pages 111–116, 1998. 8. L.A. Zadeh. “Fuzzy sets as a basis for a theory of possibility”. Fuzzy Sets and Systems, 1(1):3–28, 1978.
AST: Support for Algorithm Selection with a CBR Approach Guido Lindner1 and Rudi Studer2 DaimlerChrysler AG, Research & Technology FT3/KL, PO: DaimlerChrysler AG, T-402, D-70456 Stuttgart, Germany 1
2
[email protected]
Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe, Germany [email protected]
Abstract. Providing user support for the application of Data Mining
algorithms in the eld of Knowledge Discovery in Databases (KDD) is an important issue. Based on ideas from the elds of statistics, machine learning and knowledge engineering we provided a general framework for de ning user support. The general framework contains a combined topdown and bottom-up strategy to tackle this problem. In the current paper we describe the Algorithm Selection Tool (AST) that is one component in our framework. AST is designed to support algorithm selection in the knowledge discovery process with a case-based reasoning approach. We discuss the architecture of AST and explain the basic components. We present the evaluation of our approach in a systematic analysis of the case retrieval behaviour and thus of the selection support oered by our system.
1 Introduction It is well known that there is no best algorithm for all problems [8]. However, what exactly is de ned as best strongly depends on application speci c goals and the characteristics of the available data. Where application speci c goals should be requested from the user, meta data on the data can be calculated automatically. An approach integrating this user interaction and the calculation of domain characteristics as a top-down and bottom-up strategy is described in [1]. It is our rm opinion that user interaction with the goal of getting user's restrictions on the functionality of a data mining application has to form an integral part of every approach of algorithm selection. Both, Consultant [4] and Statlog [7] have dierent disadvantages when considering the application of these approaches in real-life scenarios. Consultant uses a static rule set which discriminates between a set of possibly applicable algorithms [9]. Such an approach is very dicult to maintain: each time a new algorithm has to be included one has to recompute all the rules. The Statlog •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 418−423, 1999. Springer−Verlag Berlin Heidelberg 1999
AST: Support for Algorithm Selection with a CBR Approach
419
AST dataset
DCT data characteristic
CBR Tool application restriction
data characteristic and application restriction
similar case
dataset algorithm error rate training time classification time
case base
UGM task analysis
Fig. 1. Architecture of AST project tried to describe data sets for a meta learning step to generate rules that specify in which case which algorithm is (possibly) applicable. The generated rules use hard boundaries within their condition part. However, instead of hard boundaries one would like to have more fuzzy conditions. A CBR approach enables a smooth similarity calculation for similar application problems. The main idea is to recommend to the user an algorithm or a set of algorithms based on the most similar cases that are found in the case base. Such a case is de ned by application restrictions, a description of the data and experience gained in former applications. The basic architecture of our AST (Algorithm Selection Tool) system is described in section 2. The description of the data, called data characteristics, is outlined in section 3. Another advantage for CBR is the possibility to extend the model with a characterization of algorithms. In this case also queries about similar algorithms are possible. Examples of algorithm descriptions are presented in section 4. Finally, we discuss rst evaluations of our system AST in section 5 and give an outlook about future work in the last section 6.
2 Architecture of AST The top-level architecture of our AST system is shown in Figure 1. As outlined in the introduction, the problem of algorithm selection is a decision based on three aspects: application restrictions, given data and existing experience. Embedded in our UGM (User Guidance Module) approach [1] the application restrictions are analyzed by the task analysis component. In addition, the user can also feed his/her restrictions directly into the system. Application restrictions address aspects like the interpretability of the generated model or the amount of training time that may be used (see section 4 for more details). From the given data, the data characterization tool (DCT) 1 computes data characteristics with focus to our algorithm selection scenario (see section 3 for more details). 1
DCT is developed by Guido Lindner and Robert Engels in collaboration with the master thesis of U. Zintz and C. Theusinger. A rst presentation can be found in [2], which focuses on supporting data mining pre-processing.
420
G. Lindner and R. Studer Case Case
Method Method
Error TrainTime TestTime
algorithm model type train class test class
DataChar DataChar DataSet DC_gen DC_gen DC_num DC_num DC_sym DC_sym
Fig. 2. Case structure in AST The existing experience contains knowledge about the application of a speci c algorithm to a given dataset, e.g. the error rate or the used training time. From the three aspects just discussed we derived the structure of the cases in our case base (see gure 2). In general, a CBR-approach distinguishes between a problem description and a solution description. The problem description contains all information that is known about the current problem. In our algorithm selection scenario the problem description is de ned by the data characteristics and the application restrictions, i.e. the algorithm description. These description may be partial. The solution description is the completion to the problem description. In our scenario, the solution description consistis of the experience part of the case. The general work ow is that the user speci es his application requirements and that DCT computes the data characteristics for the given dataset. These two aspects constitute the problem description. In our AST system we compute the most similar cases by comparing this problem description with the problem descriptions that are found in the cases of our case base. Today, the case base contains more than 1600 cases as the result of 21 classi cation algorithms and more than 80 datasets. At the moment, our system is realized for classi cation tasks which are an important task type in machine learning and KDD applications. The collected datasets are taken from the UCI repository [6] and from real world applications from DaimlerChrysler.
3 Data Characteristics with DCT for AST The data characterization tool DCT computes various meta data about a given data set. Subsequently, we just brie y characterize the relevant data characteristics. The data characteristics can be separated into three dierent parts: 1. simple measurements or general data characteristics 2. measurements of discriminant analysis and other measurements, which can only be computed on numerical attributes. 3. information theoretical measurements and other measurements, which can only be computed on symbolic attributes.
AST: Support for Algorithm Selection with a CBR Approach
421
The rst group contains measurements which can be simply calculated for the whole dataset like the number of attributes or default error rate. The other groups can only be computed for a subset of attributes in the dataset. The measurements of discriminant analysis are calculated only for numerical attributes whereas the information theoretical measurements are calculated for symbolic ones. All these measurements are calculated by our data characteristic tool (DCT). The used measurements are described in [5] and are available from the web site , www:aif b:uni
karlsruhe:de=publications:
4 Algorithm Characteristics Normally, the user can de ne some characteristics regarding the algorithms that should be used for his/her data mining application. For AST we started with a set of simple and easily understandable characteristics. This set of characteristics for algorithms is not complete, but can be speci ed by every user, independent of his or her skill in data mining or machine learning. The following characteristics which have to be provided by the user of the AST system, are used in our approach today (compare gure 2): algorithm or algorithm class, interpretability of the model (model type), training time and testing time. The algorithms which the system handles are modeled in a taxonomy . Such a taxonomy makes it possible to assign algorithms to algorithm class. Currently, we use the following algorithm classes: rule learner, decision trees, neuronal nets, bayes nets and instance based learner. To characterize the model that is generated by an algorithm from an application point of view, we only use the interpretability of the model and the speci c value no for algorithms which compute no operational model. For the moment, we do not consider the dierent kinds of learning result representations. Training and testing time contain symbolic values like fast or slow. These values describe properties of the algorithms, i .e . here we make only statements about the algorithm in general and not about it's behaviour in a speci c application. Furthermore we have to add the selected parameter values to the algorithm descriptions. Today, all algorithms of the case base are tested with their default parameters values. One special property, which is currently not supported is the cost of misclassi cation. This aspect will be added in the near future.
5 Experiments on Recommendation Quality At rst we have to de ne applicability for algorithms. In [3] three dierent methods are presented to de ne applicability. We use method 1 of that proposal: Based on the error rate ( ) of the best algorithm and the number of records ( ) we compute an error margin ( ): ER
NT
EM
EM
=
r
ER
(100 , NT
ER
)
(1)
422
G. Lindner and R. Studer
An algorithm is applicable to a dataset if its error rate is smaller than + ( 2 ). In our evaluation we use = 4. This de nition of applicable is equal to the de nition used in the Statlog project, however we use a small constant for all datasets to get small ranges of intervals, which de ne the set of applicable algorithm for a dataset. k
ER
EM
k
N
k
k
case 1st similar algorithm best 2 applicable algo. mixed 85.71% numeric 86.21% symbolic 67.74% all 79.01% Table 1. Applicability of the recommendation
In the following we describe the procedure of our evaluation: 1. For the selected dataset each associated case is extracted from the case base. This means that 21 cases are removed from the case base (currently, we handle 21 algorithms in our case base) 2. For the selected dataset we compute the most similar dataset by comparing the data characteristic 2 . 3. If the best algorithm of the most similar dataset is applicable to the selected dataset, we count this test as a positive recommendation. This comparison was done for all datasets. Table 1 shows the results of this evaluation. Over all datasets the best algorithm of the most similar dataset is applicable in 79%. For applications with only numeric attributes or with numeric and symbolic (mixed) ones the rate is higher than 85%. These are rather good results. It can be seen that the result for datasets with only symbolic attributees is not so good. This is an indicator that the data characteristics for the symbolic attributes are still insucient and that some additional measurements are needed.
6 Conclusion and Future Work In this paper, we introduced our CBR approach for algorithm selection and described rst evaluations of our protoype system AST. Our approach contains several advantages for algorithm selection. The user does not only get a recommendation which algorithm should be applied, he/she gets also an explanation for the recommendation in the form of past experiences available in the case 2
Since the UCI repository does not provides application restrictions, the problem description, of the cases are reduced to the data characteristic of the datasets.
AST: Support for Algorithm Selection with a CBR Approach
423
base. Another strong point is the maintenance of such a system. In contrast to other approaches, a new algorithm can be added to the case base without having to test this algorithm on all datasets that have been considered so far. Furthermore, with an extension of the algorithm description it will also be possible to determine similar algorithms and to compare their model generation results on similar datasets. Finally, with a CBR approach we can use similarity operators instead of the strong, hard-coded rules which are used in approaches like Statlog [7]. In the future work we have to re ne our case description of algorithms and datasets. A main point is to include the parameter settings of the algorithms into the case structure. We also plan to integrate our approach into a internet service for algorithm selection. In such an internet service we will oer an algorithm recommendation for a speci c application problem.
References 1. R. Engels, G. Lindner, and R. Studer. A guided tour through the data mining jungle. In D. Heckerman, H.Manilla and D.Pregibon, editor, Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, August 14 -17, 1997. AAAI Press, Menlo Park, CA. 2. R. Engels and Ch. Theusinger. Using a data metric for oering preprocessing advice in data mining applications. In Henri Prade, editor, Proceedings of the 13th biennial European Conference on Arti cial Intelligence (ECAI-98), pages 430{434. John Wileys & Sons, 1998. 3. Joao Gama and Pavel Brazdil. Characterization of classi cation algorithms. In N. Mamede and C. Ferreira, editors, Advances on Arti cial Intelligence - EPIA95. Springer Verlag, 1995. 4. Y. Krodrato, D. Sleeman, M. Uszynski, K. Causse, and S. Craw. Enhancing the Knowledge Engineering Process, chapter Building a Machine Learning Toolbox. Elsevier Science Publishers, North Holland, 1992. 5. G. Lindner and R. Studer. AST: Support for Algorithm Selection with a CBR Approach. In Recent Advances in Meta Learning and Future Work, Workshop Proceedings of the ICML 1999, Bled, Slowenien, June 1999. 6. C. J. Merz and P. M. Murphy. Uci repository of machine learning databases. [http://www.ics.uci.edu/~mlearn/MLRepository.html], 1996. Irvine, CA: University of California, Department of Information and Computer Science. 7. D. Michie, C. Taylor, and D. Spiegelhalter. Machine Learning, Neural and Statistical Classi cation. Ellis Hoorwood, 1994. 8. Cullen Schaer. A conservation law for generalization performance. In Proc. Eleventh Intern. Conf. on Machine Learning, pages 259 { 265, Palo Alto, CA, 1994. Morgan Kaufmann. 9. D. Sleeman, M. Rissakis, S. Craw, and N.Graner. Consult 2 Pre- and Post of Machine Learning Applications. International Journal of Human Computer Studies, (43):43 { 63, 1995.
Efficient Shared Near Neighbours Clustering of Large Metric Data Sets Stefano Lodi, Luisella Reami, and Claudio Sartori University of Bologna, Department of Electronics, Computer Science and Systems, CSITE-CNR, viale Risorgimento 2, 40136 Bologna, ITALY {slodi,csartori}@deis.unibo.it
Abstract. Very few clustering methods are capable of clustering data without assuming the availability of operations which are defined only in strongly structured spaces, such as vector spaces. We propose an efficient data clustering method based on the shared near neighbours approach, which requires only a distance definition and is capable of discovering clusters of any shape. Using efficient data structures for querying metric data and a scheme for partitioning and sampling the data, the method can cluster effectively and efficiently data sets whose size exceeds the internal memory size.
1
Introduction
An important component of a data mining system is the data clustering method, whose purpose is to solve the following problem: Given a similarity measure between pairs of points of a multidimensional space and a data set on that space, find a partition of the data set such that similar points belong to the same member of the partition and dissimilar points belong to distinct members, without utilizing a priori knowledge about the data, and with the following additional constraints: (a) The use of computing resources must be minimized and (b) the data set is large, i.e. the size of the data set exceeds the internal memory size. Note that (a) and (b) imply that I/O cost has to be minimized. In the literature, a collection of desirable additional features is usually considered, which can be summarized as follows. The method should have the ability to discover clusters of arbitrary density, shape and relative position, or weakly separated, and to detect outliers. The method should need a minimal number of input parameters and be moderately sensitive to their value. Many existing methods satisfy the above requirements. However, assumptions are often made about the mathematical structure of the space, e.g., the space is a vector space over the reals. Such assumptions usually allow for improvements in the effectiveness and efficiency of the method but result in a loss of generality. In particular, methods requiring the ability to compute operations on numbers, or a total order relation, cannot be applied easily to categorical data. We present here a revised shared near neighbours clustering method which is applicable to large metric data sets. The only requirement is that the dissimilarity measure is a distance function. The method requires little additional ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 424–429, 1999. c Springer-Verlag Berlin Heidelberg 1999
Efficient Shared Near Neighbours Clustering of Large Metric Data Sets
425
external memory and uses a metric access method, the multi-vantage point tree, to organize data points in the available internal memory and efficiently retrieve neighbourhood information. Large data sets are dealt with by performing partial clustering steps, sampling the results and finally clustering the union of all samples.
2
Clustering Large Metric Data Sets
The shared near neighbours (SNN) cluster analysis method was introduced in [1]. First we formally define clusters according to the SNN approach, then we explain our new modified algorithm based on the approach. Let X be a data set of N multidimensional points, (Y, <) a totally ordered set, and d: X → Y a dissimilarity function. For any point x ∈ X, let xi be an enumeration of X − {x} that orders points by their dissimilarity from x, namely such that d(x, xp ) < d(x, xq ) only if p < q. Let NN(x, k) be the set of the k first neighbours of x in the enumeration, i.e. {x1 , . . . , xk }. For k, t ∈ IN, k ≥ t, define a binary relation Rk,t on X as follows: Rk,t (x, y) iff x ∈ NN(y, k), y ∈ NN(x, k), |NN(x, k) ∩ NN(y, k)| ≥ t. Recall that a set A is closed w.r.t. a binary relation R (or R-closed) if a ∈ A, R(a, b) imply b ∈ A. A cluster in X w.r.t. k, t is a least Rk,t -closed subset of X. The SNN method fulfils many traditional requirements well. Clusters of different shapes, densities and relative positions are recognized correctly, and the method has the ability to separate regions when there is a sharp gradient between them. Different orderings of the data do not affect the result in practice. The globularity of the clusters increases with k, whereas increasing t tends to split large clusters into subclusters. Both parameters influence the method’s behaviour smoothly and predictably. Nevertheless, the method has two main drawbacks when employed in database environment. Firstly, computing the neighbourhood table by brute force costs O(N 2 ) time, where N is the number of points in the data set. Secondly, the table is sized O(N ). Thus, if the data set is substantially larger than the size of internal memory, then we should expect that it cannot be clustered. The first problem is addressed in [2], where points are stored into a multidimensional access structure, the KD-tree. The cost of creating a KD-tree is O(N log N ) and a single k nearest neighbour query costs O(log N ), resulting in dramatic improvements in the cost of filling the neighbourhood table. However, the KD-tree performs well only when dimensionality is low [3]. Moreover, it can index only numerical data, thus restricting the generality of the approach. In contrast, high dimensionality and categorical properties are not uncommon in clustering applications. To address the first problem, we propose to use the multi-vantage point tree (mvp-tree) [4] as an access method for the SNN approach. The mvp-tree is designed to minimize distance computations, therefore it is expected to perform well in high dimensional spaces. Moreover, it only requires a distance, thus it does not rule out the possibility of clustering categorical data.
426
S. Lodi, L. Reami, and C. Sartori
The problem of data sets exceeding internal memory can be addressed by adopting a two-phase clustering strategy, as follows. In phase one, the data set is partitioned and every partition is loaded and clustered in an internal memory area of fixed capacity. For every partition, a phase one labeling vector is stored in external memory and, from every cluster in the partition, a number of points proportional to its size is randomly sampled and stored with its label. In phase two, the union of all samples is clustered again, and a map of phase one labels onto new labels is generated. Finally, every point in the data set is given the new label its phase one label maps to. The following algorithm realizes in more detail the above approach. – Input: A data set B of points in a fixed number of dimensions. Integers k, t, SampleSize, MinClSize. Real δ. – Output: A clustering vector V. – Data Structures: – A list D of multidimensional points of capacity NMax . – A set of neighbour lists Li each of size k, for i = 1, . . . , NMax . – An array C to store cluster information for the current partition. The array is organized as a set of circular linked lists: The i-th element of the array contains the cluster label of the i-th point in D and the index of the next point in the cluster. – An array CS to store cluster information for the samples, organized as a set of circular linked lists. The i-th element of the array contains the cluster label of the i-th randomly sampled point and the index of the next point in the cluster. – A list S of data points to store sampled data. – A sequential structure P in external memory of partially labeled data points. – A list M of pairs of cluster labels, representing a mapping on partial cluster labels onto final cluster labels. – Algorithm: 1. Repeat until the end of the data set is reached: a) Load points of the data set into D until either NMax points have been loaded or the end of the data set is reached. b) Create a mvp-tree on D. c) For every point i in D, compute the neighbour list Li = i1 , . . . , iki , ki ≤ k, obtained as the result of the query NN(i, k) ∩ RANGE(i, δ), where RANGE(i, δ) denotes the result of a range query of radius δ centered in i. d) Initialize the array C. The i-th element has cluster label i and next point i. e) For all pairs i, j, i < j: i. Test whether i occurs in Lj , j occurs in Li , and Li , Lj have at least t points in common. ii. If the test succeeds, update C by setting the label of every point in c[j] (the cluster containing j) to i and merging in C the circular lists containing i and j.
Efficient Shared Near Neighbours Clustering of Large Metric Data Sets
427
f) For every cluster c in C: i. if the size of the cluster |c| is less than MinClSize, then update C by setting the label of every point in c to a unique “outlier” label. Continue with next cluster in C. ii. Compute the number of points to extract from the cluster as sc = dSampleSize|c|/NMax e. iii. Extract a sample s of size sc and store it in S. iv. Generate a new unique label `c for the cluster c and update C by setting the label of every point in c to `c . Create a new entry in M for `c . v. Create a new circular linked list in CS . Insert every sampled point j of s in the new list, setting the point’s label to `c . g) For every point i in D, read the partial label of i from C and append it to P . 2. a) Create a mvp-tree on S. b) For every point i in S, compute the neighbour list Li = i1 , . . . , ik , obtained as the result of the query NN(i, k). c) For all pairs i, j, i < j: i. Let `i , `j be the labels of i, j in CS . ii. Test whether i occurs in Lj , j occurs in Li , and Li , Lj have at least t points in common. iii. If the test succeeds, update CS by setting the label of every point in c[j] to `i , merge in CS the circular lists containing i and j, and update M by setting the final label to `i where the partial label equals `j . d) Scan P . For every point in P , read its partial current label from P , map it to the final cluster label through M and append to V the final label. Three new parameters SampleSize, MinClSize, and δ have been added (with respect to the original SNN algorithm) with the following meaning. SampleSize represents the number of randomly sampled points for every partition. MinClSize is the minimum size of clusters that will be considered. δ is the maximum distance of nearest neighbours that will be in the neighbours list for a point. MinClSize and δ may be used to detect outliers, when the notion of outlier is distance based, i.e. it implies an estimate of the maximum distance between points in a cluster. Setting δ to a value smaller than this distance will cause outliers to be grouped into clusters of small size, which can be eliminated by setting MinClSize to a suitably large cardinal.
3
Experimental Results and Conclusions
We have evaluated the performance of our algorithm by conducting experiments on synthetic two dimensional data sets. In all tests, the distance function is Euclidean, the partition size is NMax = 10000, the architecture is Intel i586
428
S. Lodi, L. Reami, and C. Sartori
(a) set 1
(b) results (t = 5)
(d) set 2
(c) results (t = 10)
(e) results
Fig. 1. Test data sets
at 133 MHz with 128 MByte of internal memory, and the operating system is Windows NT 4.0. Data set 1 in figure 1 is based on a data set introduced to test the WaveCluster algorithm [5]. It is composed of 64188 points in 31 spherical clusters. The algorithm recognizes two ring shaped clusters for t = 5, and 37 mostly spherical clusters for t = 10. k is set to 16. Data set 2 has been introduced as a test for the CURE algorithm [6]. It has 100000 points distributed as follows: Three spherical clusters of different density, a dense region made by two ellipsoids connected by a dense chain of points, and randomly placed outliers. The clustering result was obtained by setting k = 10, t = 3, δ = 0.27, MinClSize = 20, SampleSize = 50. The algorithm recognizes four clusters and groups outliers correctly. The connected ellipsoids are merged since the distance between points in the chain does not differ substantially from the distance between points in the regions. We compared the speed performance of our algorithm with one of the fastest and widely referenced available algorithms, DBSCAN [7]. Three data sets with 20000, 60000 and 100000 points have been used for the comparison. In all data sets, 10% of the points are outliers and the remaining points are equally distributed among three spheres of different radius. Since the performance of
Efficient Shared Near Neighbours Clustering of Large Metric Data Sets
429
DBSCAN is affected by the values of the two input parameters, we chose the values that resulted in the best performance, among the values giving the correct clustering. On the three data sets, DBSCAN terminated the tree creation phase in 59.0, 181.0, and 310.0 seconds, respectively, and the clustering phase in 49.4, 151.3, and 282.8 seconds, whereas our algorithm finished in 39.3, 118.0, and 212.6 seconds. These results show that the performance of the algorithm is not inferior to the performance of DBSCAN in the clustering phase. Therefore, the algorithm is suitable for employment when an access structure is not available. When compared to CURE, the algorithm is not immune to the well-known chaining problem (see figure 1). However, the robustness of CURE is obtained within explicit limitations in the authors’ approach, namely that data are numeric. Furthermore, in a metric approach, the very notion of chaining seems problematic. Future work will include a validation with high-dimensional data sets, experiments with real data benchmarks and the extension of the algorithm for incremental clustering. Acknowledgments The authors would like to thank Andrea Lodi, Rossella Miglio, Angela Montanari, Marco Patella, and Gabriele Soffritti for helpful discussions on the subject.
References 1. R. Jarvis and E. Patrick. Clustering using a similarity measure based on shared near neighbours. IEEE Transactions on Computers, 22(11):1025–1034, November 1973. 2. R. Jarvis and I. Hofman. Robust and efficient cluster analysis using a shared near neighbours approach. In ICPR’98, Proc. of the 14th Int’l Conference on Pattern Recognition, pages 243–247. IEEE Computer Society, 1998. 3. Jeffrey K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40(4):175–179, 25 November 1991. ¨ 4. T. Bozkaya and Z. M. Ozoyoglu. Distance-based indexing for high-dimensional metric spaces. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pages 357–368. ACM Press, 1997. 5. G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In VLDB’98, Proceedings of 24th International Conference on Very Large Data Bases, pages 428–439. Morgan Kaufmann, 1998. 6. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD-98), pages 73–84. ACM Press, 1998. 7. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD96), page 226. AAAI Press, 1996.
Discovery of “Interesting” Data Dependencies from a Workload of SQL Statements S. Lopes
J-M. Petit
F. Toumani
Universit´e Blaise Pascal Laboratoire LIMOS Campus Universitaire des C´ezeaux 24 avenue des Landais 63 177 Aubi`ere cedex, France Tel. : 04-73-40-74-92 Fax. : 04-73-40-74-44 {slopes,jmpetit,ftoumani}@libd2.univ-bpclermont.fr
Abstract. Discovering data dependencies consists in producing the whole set of a given class of data dependencies holding in a database, the task of selecting the interesting ones being usually left to an expert user. In this paper we take another look at the problems of discovering inclusion and functional dependencies in relational databases. We define rigourously the so-called logical navigation from a workload of sql statements. This assumption leads us to devise tractable algorithms for discovering “interesting” inclusion and functional dependencies.
1
Introduction and related works
The problem of discovering data dependencies can be formulated as follows: given a database instance, find all non-trivial data dependencies satisfied by that particular instance [9]. These problems are known to be hard, at least in the worst case [9]. Existing KDD approaches consist in producing the whole set of dependencies holding in a database [4, 1, 8, 3]. Nevertheless, some of the discovered dependencies can be accidental or erroneous: The task of selecting the “interesting” dependencies is left to an expert user. More generally, the issue of interestingness of discovered knowledge is known to be quite hard in data mining [6, 10]. Example 1. Consider the relations Instructor and Department of Table 1. An inclusion dependency holds from Instructor[rank] to Department[dnumber]. However, the attributes rank and dnumber do not represent the same information: rank gives the level of qualification of instructors whereas dnumber is an integer denoting the chronology of creation of departments. This kind of inclusion dependency is erroneous and dangerous to use in database applications: for instance, it does not make sense to enforce a referential integrity constraint from Instructor[rank] to Department[dnumber]. •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 430−435, 1999. Springer−Verlag Berlin Heidelberg 1999
Discovery of "Interesting" Data Dependencies from a Workload of SQL Statements Instructor ssn status rank 1 Assoc. Prof 1 2 Prof. 2 3 Assist. 1 4 Assist. 2 5 Prof. 2 6 Prof. 1 7 Assoc. Prof. 1
TeachesIn ssn dnum 1 1 1 5 2 2 3 2 4 3 5 1 6 5
year depname 85 Biochemistry 94 Admission 92 C.S 98 C.S 98 Geophysics 75 Biochemistry 88 Admission
431
Department dnumber dname mgr 1 Biochemistry 5 2 C.S. 2 3 Geophysics 2 4 Medical center 10 5 Admission 12 6 Genetic 6
Table 1. A relational database
In this paper, we consider the problem of finding “interesting” dependencies from the set of all functional and inclusion dependencies holding in a database. We argue that “interesting” dependencies concern duplicate attribute sequences1 that are used to link together the relation schemas in a database schema. Such attribute sequences form the so-called logical navigation in a database schema. Duplicate attribute sequences are usually used as access paths to navigate in a relational database schema. When querying a relational database, users have to explicitly specify logical access paths between relation schemas [7]. Given a workload of sql statements, the intuition is to say that we have to capture communicating attributes, i.e., attribute sequences involved in sql join statements. The obtainment of such a workload is rather simple in recent rdbms (cf. next Section). Example 2. The inclusion dependency Instructor[rank] ⊆ Department[dnumber] given in Example 1 cannot be deduced using the logical navigation. Indeed, it is reasonable to assume that the attributes rank and dnumber will probably never be used together in a join condition to relate instructors to departments. Conversely, the pairs of attributes (Instructor[ssn], TeachesIn[ssn]) and (Department[dnumber], TeachesIn[dnum]) will certainly occur in a join condition. Therefore, the inclusion dependencies TeachesIn[ssn] ⊆ Instructor[ssn] and TeachesIn[dnum] ⊆ Department[dnumber] could be found out. Therefore, we formally define the logical navigation inherently available in relational databases thanks to an sql workload. By this way, two hard problems are solved in the same time: – the number of candidate dependencies is reduced drastically, – no expert user has to be involved to distinguish interesting dependencies. Paper organization The logical navigation is formally defined in Section 2. We point out how to discover inclusion and approximate dependencies in Section 3, and functional dependencies in Section 4. We conclude in Section 5. Due to lack of space, we refer the reader to text books such as [9] for relational database concepts. 1
ordered set of attributes
432
2
S. Lopes, J−M. Petit, and F. Toumani
The logical navigation
We define the logical navigation w.r.t. a set of join queries on a given database. Informally, a logical navigation is a binary relation which associates an attribute sequence Ri .X with another attribute sequence Rj .Y such that X and Y appear together in a join condition. Details of discovering a logical navigation from an operational database go beyond the scope of this paper. In this section, we only give some clues to cope with this task. To find out the logical navigation, a simple solution is to have a workload of sql statements which is representative of the system. In modern dbms, such representative workload can be generated by logging activity on the server and filtering the events we want to monitor [2]. For instance, this task can be achieved using the profiler under ms sql server 7.0 or the trace utilities under oracle 8. We represent uniformly the pairs of attribute sequences used in a join condition in a set called Q. An element of Q is denoted by Ri [A1 ..Ak ] ./ Rj [B1 ..Bk ] and indicates that each attribute Ri [Al ], for l ∈ [1, k], is compared with an attribute Rj [Bl ] in a join condition. We give now a formal definition of the logical navigation. Definition 1. Let R be a relational database schema and U be the set of its attributes. Let Q be the set of pairs of attribute sequences extracted from a representative sql workload on a database r over R. The logical navigation of R, denoted by nav, is a binary relation over 2U × 2U defined by: def
nav(Ri .X, Rj .Y ) = ∃q ∈ Q s.t. q = Ri [X] ./ Rj [Y ] nav is symmetric but neither reflexive nor transitive. Let nav ∗ be the reflexive transitive closure of nav. nav ∗ becomes an equivalence relation and let πnav∗ be the set of equivalence classes of nav ∗ . By this way, each equivalence class captures duplicated attribute sequences. Note that computing the reflexive transitive closure of nav can be achieved in O(n3 ) (e.g. Warshall algorithm) where n is the number of attribute sequences implied in nav. Example 3. Consider the database given in Table 1. From a representative workload of this database, the set Q should be: Instructor[ssn] ./ TeachesIn[ssn], Q= Department[dnumber] ./ TeachesIn[dnum], Instructor[ssn] ./ Department[mgr] Thus, the set of equivalence classes is: { Instructor[ssn], Department[mgr], TeachesIn[ssn]}, πnav∗ = {Department[dnumber], TeachesIn[dnum]}
Discovery of "Interesting" Data Dependencies from a Workload of SQL Statements
433
We sketch in the two following sections how the logical navigation will give us valuable clues to discover interesting inclusion dependencies (Section 3) and interesting functional dependencies (Section 4). In the sequel, the input parameters of these discovery tasks are a database r over a database schema R and the logical navigation πnav∗ .
3
Application to inclusion and approximate dependencies discovery
Binary combinations of attribute sequences of each equivalence class of πnav∗ will deliver approximate inclusion dependencies. Whatever the database instance, we know exactly the number of pair (R.X, S.Y ) from which approximate inclusion dependencies will be inferred. Indeed, each equivalence class of n elements will deliver Cn2 = n(n − 1)/2 candidates. The number of database accesses is bounded by the following property. Property 1. Let l be the number of equivalence classes of πnav∗ . Let Ci be the ith equivalence class of πnav∗ , i ∈ [1, l] and ki = |Ci |. The number of pair implied Pl by πnav∗ is equal to i=1 ki (ki − 1)/2. Database accesses Let us introduce approximate inclusion dependencies based on the error measure g3 [5]. The idea is to count the minimal number of tuples we have to remove (from the relation of the left-hand side) to obtain a relation that satisfies the dependency. Then, the error g3 of an approximate inclusion dependency is defined as: g3(Ri [Y ] ⊆ Rj [Z]) = 1−
max{|s||s ⊆ ri and Ri [Y ] ⊆ Rj [Z ] holds in s and rj } |ri |
In the sequel, approximate inclusion dependencies will be parameterized by their error g3 i.e. Ri [Y ] ⊆g3 Rj [Z ]. From a pair of attribute sequences, it remains to find out the direction of the associated inclusion dependency and to compute its error. We decide to determine the direction of an approximate inclusion dependency between two attribute sequences induced by πnav∗ by comparing the number of their distinct values. Let C be an equivalence class of πnav∗ and let Ri .X and Rj .Y be two attribute sequences of C. If |πX (ri )| ≤ |πY (rj )| then the direction of the approximate inclusion dependency is from Ri to Rj . It remains to compute the error measure g3 . In fact, the maximal subset s of the relation ri for which the inclusion dependency Ri [X ] ⊆ Rj [Y ] holds is given by the semi-join (denoted by the symbol n) between X and Y , i.e. g3 (Ri [X ] ⊆ Rj [Y ]) = 1 − |ri n(X=Y ) rj |/|ri |. Given a pair of attribute sequences, we can effectively determine the existence of approximate inclusion dependency by performing sql queries against the database.
434
S. Lopes, J−M. Petit, and F. Toumani
Example 4. Assume that we want to compute the error associated with the inclusion dependency Department[mgr] ⊆g3 Instructor[ssn]. From Table 1, we have: |Department n(mgr=ssn) Instructor| = 4 and |Department| = 6. Therefore, the error g3 is equal to 1/3. Within this framework, an algorithm has been devised to discover approximate inclusion dependencies from πnav∗ in [11]. Example 5. For instance, the following approximate inclusion dependencies can be discovered: I = {TeachesIn[ssn] ⊆0 Instructor[ssn], TeachesIn[dnum] ⊆0 Department[dnumber], Department[mgr] ⊆1/3 Instructor[ssn], Department[mgr] ⊆1/3 TeachesIn[ssn]}
4
Application to functional dependencies discovery
Each attribute sequences of each equivalence class of πnav∗ will possibly deliver left-hand sides of interesting functional dependencies. As for inclusion dependencies, the following property bounds the number of candidates. Property 2. Let l be the number of equivalence classes of πnav∗ . Let Ci be the ith equivalence class of πnav∗ , i ∈ [1, l] and ki = |Ci |. The number of left-hand Pl sides implied by πnav∗ is equal to i=1 ki . Database accesses From a given left-hand side X of a relation schema R, the candidate right-hand sides are in R \ X. Let A ∈ R \ X, the functional dependencies R : X → A holds in r iff |πX∪A (r)| = |πX (r)| [3]. Such tests can be easily performed with sql queries. Example 6. From the set πnav∗ given in Example 3, the following exact functional dependencies can be carried out by querying the database given in Table 1: Instructor : ssn −→ status, rank TeachesIn : dnum −→ depname Department : dnumber −→ dname, mgr
5
Conclusion
The main contribution of this paper is to define rigourously the logical navigation inherently available in relational databases. From it, tractable algorithms for discovering “interesting” functional and inclusion dependencies can be devised. In [12], we showed that such dependencies are useful to reverse engineer first normal form relational databases. It must be clear that only a subset of all possible functional and inclusion dependencies is discovered from the logical navigation. In our opinion, the missing inclusion dependencies (those which cannot be deduced from the logical navigation) seem to be of little interest in database applications. However, some missing
Discovery of "Interesting" Data Dependencies from a Workload of SQL Statements
435
functional dependencies can be interesting: For example, the dependency TeachesIn : ssn,dnum −→ year, which holds in the relation TeachesIn given in Table 1, is not revealed by the logical navigation. We are currently working on the design of efficient algorithms to discover a small cover of the functional dependencies holding in a relation . Currently, we are working on an implementation of a tool to manage the logical navigation for aiding the database administrator both for re-organizing its database schema thanks to inclusion and functional dependencies and for inquiring data consistency from such dependencies.
References 1. S. Bell and P. Brockhausen. Discovery of Data Dependencies in Relational Databases. Technical report, LS-8 Report 14, University of Dortmund, 18p, April 1995. 2. S. Chaudhuri and V. Narasayya. Autoadmin ”what-if” Index Analysis Utility. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 367–378, Seattle, June 1998. 3. Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. In Proceedings of the 14th IEEE International Conference on Data Engineering, Orlando, USA, 1998. 4. M. Kantola, H. Mannila, K-J. R¨ aih¨ a, and H. Siirtola. Discovering Functional and Inclusion Dependencies in Relational Databases. International Journal of Intelligent Systems, 7(1):591–607, 1992. 5. J. Kivinen and H. Mannila. Approximate Inference of Functional Dependencies from Relations. Theoretical Computer Science, 149(1):129–149, 1995. 6. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding Interesting Rules from Large Sets of Discovered Association Rules. In Proceedings of the 3td International Conference on Information and Knowledge Management, pages 401–407, December 1994. 7. D. Maier, J.D. Ullman, and M. Y. Vardi. On the Foundations of the Universal Relation Model. ACM Transaction on Database Systems, 9(2):283–308, June 1984. 8. H. Mannila and H. Toivonen. Levelwise Search and Borders of Theories in Knowledge Discovery. Data Mining and Knowledge Discovery, 1(3):241–258, 1997. 9. Heikki Mannila and Kari-Jouko R¨ aih¨ a. The Design of Relational Databases. Addison-Wesley, second edition, 1994. 10. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Efficient mining of association rules using closed itemset lattices. Information System, 24(1):25–46, 1999. 11. J-M. Petit and F. Toumani. Discovery of inclusion and approximate dependencies in databases. In To appear in BDA99 (french conference on DB), 20 pages, October 1999. 12. J-M. Petit, F. Toumani, J-F. Boulicaut, and J. Kouloumdjian. Towards the Reverse Engineering of Denormalized Relational Databases. In S. Su, editor, Proceedings of the 12th IEEE International Conference on Data Engineering, pages 218–227. IEEE Computer Society, New Orleans, February 1996.
Learning from Highly Structured Data by Decomposition? Ren´e Mac Kinney-Romero and Christophe Giraud-Carrier Department of Computer Science, University of Bristol Bristol, BS8 1UB, UK {romero,cgc}@cs.bristol.ac.uk
Abstract. This paper addresses the problem of learning from highly structured data. Specifically, it describes a procedure, called decomposition, that allows a learner to access automatically the subparts of examples represented as closed terms in a higher-order language. This procedure maintains a clear distinction between the structure of an individual and its properties. A learning system based on decomposition is also presented and several examples of its use are described.
1
Introduction
Machine Learning (ML) deals with the induction of general descriptions (e.g., decision trees, neural networks) from specific instances. That is, given a set of examples of a target concept, a ML system induces a representation of the concept, which explains the examples and accurately extends to previously unseen instances of the concept. Most ML systems use an attribute-value representation for the examples. Although this simple representation allows the building of efficient learners, it also hinders the ability of such systems to handle directly examples and concepts with complex structure. To overcome this limitation, two approaches have been used. The first one is based on data transformation or pre-processing. Here, a structured (e.g., first-order) representation is mapped into an equivalent attribute-value representation by capturing subparts of structures and n-ary predicates (n > 1) as new attributes, which can then be manipulated directly by standard attributevalue learners (e.g., see the LINUS system [7]). The second approach consists of “upgrading” the learner. Here, the learner is designed so as to be able to manipulate structured representations directly (e.g., see the ILP framework for first-order concepts [9]). From a practitioner’s standpoint, there is a clear advantage in the second approach since it allows the user to represent the problem in a “natural” way (i.e., consistent with domain-specific standards or practices), without recourse to a pre-processing phase, which is tedious, prone to loss of information and often costly as it may require expert intervention. In keeping with this approach to make ML techniques more readily available to practitioners, our research ?
This work is funded in part by grants from CONACYT and UAM, M´exico
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 436–441, 1999. c Springer-Verlag Berlin Heidelberg 1999
Learning from Highly Structured Data by Decomposition
437
focuses on the use of sufficiently expressive representation languages so that highly-structured data can be manipulated transparently by the corresponding learning systems. In particular, a novel, higher-order framework, in which examples are closed terms and concepts are Escher programs, has been developed [5]. This paper describes a learning algorithm for the above higher-order framework based on a decomposition method. Decomposition can be regarded as an extension of flattening [10] to higher-order representations, which maintains a clear distinction between the structure of an individual and its properties.
2
Decomposition
The application of machine learning to real-world problems generally requires the mapping of the user’s conceptual view of the data to a representation suitable for use by the learner. In addition to being tedious and often costly, this paradigm is undesirable as it may cause a loss of information. We argue that learners should be taylored to representations, rather than the other way around. In other words, the user should be able to describe the problem as is most natural to him/her and leave the learner to manipulate the representation and induce generalisations. To this end, we have proposed an “individuals-as-terms” representation [4], where examples are highly-structured closed terms in a higher-order logic language. As an example, consider the mutagenicity problem, which is concerned with identifying nitroaromatic and heteroaromatic mutagenic compounds. In this problem, the examples are molecules together with some of their properties. A relatively natural representation for such compounds, as proposed in [2], consists of closed terms, here tuples, of the following form. type Molecule = (Ind1, IndA, Lumo, {Atom}, {Bond}) where Ind1 and IndA are boolean values relating to structural properties, Lumo is the energy level of the lowest unoccupied molecular orbital, and {Atom} and {Bond} capture the structure of the molecule as a graph, i.e., a set of atoms and the bonds between them. Atoms and bonds are defined as follows. type Atom = (Label, Element, AtomType, Charge) type Bond = ({Label}, BondType) Hence, a sample molecule has the following representation. (False, False, −1.034, {(1, C, 22, −0.128), (10, H, 3, 0.132), (11, C, 29, 0.002), ...}, {({1, 2}, 7), ({1, 7}, 1), ...}) Standard transformational techniques, such as flattening [10] do not apply to such data. We extend these through a technique, called decomposition, whose task is to extract the structural information of the examples, thus allowing the learning tool to access directly sub-parts of the structures being represented. Decomposition takes a highly structured term and a maximum depth, and returns the set of all possible variables up to the specified depth. A depth of
438
R.M. Kinney-Romero and C. Giraud-Carrier
zero returns only one variable representing the whole structure. A depth of one returns one variable for each component at the top level (e.g., each element of a tuple or set) and so on. The decomposition set consists of tuples of the form {hname, type, value, structural predicatesi} so that each variable is qualified by its name, type, value and the structural predicates needed to obtain it. The following illustrates the decomposition technique on the the above sample molecule. Assuming a maximum depth of 1, decomposition will return the following set, where v0 refers to the top-level term. {hv1 , Ind1, False, v0 == (v1 , v2 , v3 , v4 , v5 )i, hv2 , IndA, False, v0 == (v1 , v2 , v3 , v4 , v5 )i, hv3 , Lumo, −1.034, v0 == (v1 , v2 , v3 , v4 , v5 )i, hv4 , {Atom}, {. . .}, v0 == (v1 , v2 , v3 , v4 , v5 )i, hv5 , {Bond}, {. . .}, v0 == (v1 , v2 , v3 , v4 , v5 )i} With a depth of 2, v4 and v5 would be further decomposed. The decomposition of v4 , which has value {(1, C, 22, −0.128), (10, H, 3, 0.132), (11, C, 29, 0.002), . . .} for example, would yield {hv6 , Atom, (1, C, 22, −0.128), v0 == (v1 , v2 , v3 , v4 , v5 ) ∧ v6 ∈ v4 i, hv7 , Atom, (10, H, 3, 0.132), v0 == (v1 , v2 , v3 , v4 , v5 ) ∧ v7 ∈ v4 i, . . .} The current implementation of the decomposition algorithm supports only lists, tuples and sets. It is possible, however, to devise a syntax in which the user of the learning system is able to give new types along with the information on how to decompose them.
3
Learning by Decomposition
The approach presented has been implemented in ALFIE, an Algorithm for Learning Functions In Escher. This implementation is based on the framework presented in [3], which uses the Escher language [8] as the representation vehicle for examples and concepts. ALFIE induces concepts in the form of decision lists, i.e., if E1 then t1 else if E2 then t2 else . . . if En then tm else t0 where each Ei is a boolean expression and the tj ’s are class labels. The class t0 is called the default and is generally, although not necessarily, the majority class. The algorithm uses sequential covering to find each Ei . It uses the decomposition set of the first example and from it finds the E1 with the highest accuracy (measured as the information gain on covering). It then computes the set of
Learning from Highly Structured Data by Decomposition
439
Input: decomposition set D of E and set of properties P = {p : σ → Bool} Output: set of atomic conditions C C=φ For all t ∈ D such that value(t) == k C = C ∪ {name(t) == k} For all t1 , t2 ∈ D such that type(t1 ) == type(t2 ) and value(t1 ) == value(t2 ) C = C ∪ {name(t1 ) == name(t2 )} For all t1 , . . . , tn ∈ D that match σ, where p : σ → Bool ∈ P C = C ∪ {p(t1 , . . . , tn )} return C Fig. 1. Algorithm to Generate Atomic Conditions
examples that are not yet covered, selects the first one and repeats this procedure until all examples are covered. To create the Ei ’s, ALFIE builds conjunctions of atomic conditions from the decomposed examples as shown in Figure 1. ALFIE uses information about the values of variables to create all possible equality conditions between a variable’s value and a constant, and between two variables’ values. In addition, the user may provide a set of boolean functions that can be applied to the elements of the decomposition set (set P in Figure 1). These functions represent properties which may be present in the sub-parts of the structure of the data. They constitute the background knowledge of the learner and may be quite complex. For example, a function to test whether a molecule contains less than 5 oxygen atoms would have the following (higher-order) form. (card(filter(v, v == (l, e, a, c) ∧ a == O)) < 5) To illustrate the algorithm of Figure 1, consider the following partial decomposition of a molecule (set D). {hv3 , Bool, False, v0 == (v1 , v2 , v3 , v4 , v5 )i, hv4 , IndA, −1.387, v0 == (v1 , v2 , v3 , v4 , v5 )i, hv53 , Atom, H, v0 == (v1 , v2 , v3 , v4 , v5 ) ∧ v51 ∈ v4 ∧ v51 == (v52 , v53 , v54 , v55 )i, hv98 , Atom, H, v0 == (v1 , v2 , v3 , v4 , v5 ) ∧ v96 ∈ v4 ∧ v96 == (v97 , v98 , v99 , v100 )i, hv238 , Bond, {3, 4}, v0 == (v1 , v2 , v3 , v4 , v5 ) ∧ v237 ∈ v5 ∧ v237 == (v238 , v239 )i} Assume the following function is also given as background knowledge (set P ). (> −2.368) : IndA → Bool Then the set C produced by the algorithm is: {v3 == False, v4 == −1.387, v53 == H, v98 == H, v238 == {3, 4}, v53 == v98 , v4 > −2.368}
440
4
R.M. Kinney-Romero and C. Giraud-Carrier
Experiments
Although they do not require decomposition, a number of experiments were carried out with attribute-value problems to check ALFIE’s ability to generate accurate theories. The results obtained on these problems compare favourably with those of standard learners, such as C4.5. The experiments detailled here focus on learning from highly structured data. 4.1
Bongard
The 47th pattern recognition problem from [1] aims at inducing the concept “there is a circle inside a triangle” from examples consisting of a set of shapes that may contain other shapes inside them. One of the simplest representations consists of encoding a figure as a pair, where the first element is the shape of the figure and the second a set of the figures contained in it. Figure = Null | (Shape, {Figure}) Shape = Circle | Triangle The examples are sets of such figures together with a truth value. Given such a representation, ALFIE produces the solution f (v1 ) = if(v5 ∈ v1 ∧ v5 == (v6 , v7 ) ∧ v8 ∈ v7 ∧ v8 == (v9 , v10 ) ∧v9 == Circle ∧ v6 == Triangle) then True else False whose English equivalent is “if there is a Circle inside a Triangle then True else False.” 4.2
Mutagenicity
Mutagenicity is a well known dataset in the ILP community and a number of experiments with this problem are reported in [6]. The dataset consists of 230 chemical compounds, 42 of which have proven difficult to classify automatically. The experiment was carried out on the remaining 188. A chemical compound is represented as a highly-structured closed term consisting of the atoms and bonds as discussed above. An atom has a label, an element, a type (there are 233 such types) and a partial charge (which is a real number). A bond connects a pair of atoms and has one of 8 types. A ten fold cross validation experiment was carried out and gave an average accuracy of 87.3% with a standard deviation of 4.99%. This is comparable to the results obtained on the same data by others [6]. Unfortunately, the induced theories seem to rely rather heavily on the additional properties of the molecules (i.e., Ind1, IndA and Lumo) rather than their molecular structure. To further test the value of the decomposition technique, we intend to repeat this problem with only the molecular structure to describe the examples (i.e., leaving the three “propositional properties” out).
Learning from Highly Structured Data by Decomposition
5
441
Conclusion and Future Work
A new approach to dealing with highly structured examples has been presented. This approach relies on a technique called decomposition that is able to extract the structural predicates from an example, thus allowing the learning system to concentrate on the properties that the components of such structure may present. A learning system, ALFIE, based on this idea has been implemented and preliminary experiments demonstrate promise. A number of issues remain open as the subject of future work. In particular, – ALFIE, like other similar learning systems, is susceptible to the order of the examples. In ALFIE’s case, it is interesting to point out that the effect of ordering seems more evident in problems whose examples have little or no structure. – It is possible to have more operators for the construction of the Ei ’s, such as negation and disjunction. – Decomposition can be improved further by analysing the sets that are produced. It is likely that information is being repeated. This should be easy to check and correct. It would also be interesting to see how the depth of the decomposition affects the accuracy of the learning system. Finally, more experiments are needed with problems presenting high structures in their examples and the associated concepts.
References 1. M. Bongard. Pattern Recognition. Spartan Books, 1970. 2. A.F. Bowers. Early experiments with a higher-order decision-tree learner. In Proceedings of the COMPULOGNet Area Meeting on Computational Logic and Machine Learning, pages 42–48, 1998. 3. A.F. Bowers, C. Giraud-Carrier, C. Kennedy, J.W. Lloyd, and R. MacKinneyRomero. A framework for higher-order inductive machine learning. Compulog Net meeting, September 1997. 4. A.F. Bowers, C. Giraud-Carrier, and J.W. Lloyd. Higher-order logic for knowledge representation in inductive learning. 1999, (in preparation). 5. P.A. Flach, C. Giraud-Carrier, and J.W. Lloyd. Strongly typed inductive concept learning. In Inductive Logic Programming: ILP-98, 1998. 6. R.D. King, S. Muggleton, A. Srinivasan, and M. Sternberg. Structure-activity relationships derived by machine learning: The use of atoms and bonds and their connectivities to predict mutagenicity in inductive learning programming. Proceedings of the National Academy of Sciences, 93:438–442, 1996. 7. N. Lavraˇc and S. Dˇzeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994. 8. J.W. Lloyd. Declarative programming in Escher. Technical report, University of Bristol, 1995. 9. S. Muggleton, editor. Inductive Logic Programming. Academic Press Ltd., 24-28 Oval Road, London NW1 7DX, 1992. 10. C. Rouveirol. Flattening and Saturation: Two Representations Changes for Generalisation, volume 14. Kluwer Academic Publishers, Boston, 1994.
Combinatorial Approach for Data Binarization Eddy Mayoraz and Miguel Moreira IDIAP — Dalle Molle Institute for Perceptual Artificial Intelligence P.O. Box 592, CH-1920 Martigny, Switzerland [email protected], [email protected] Abstract. This paper addresses the problem of transforming arbitrary data into binary data. This is intended as preprocessing for a supervised classification task. As a binary mapping compresses the total information of the dataset, the goal here is to design such a mapping that maintains most of the information relevant to the classification problem. Most of the existing approaches to this problem are based on correlation or entropy measures between one individual binary variable and the partition into classes. On the contrary, the approach proposed here is based on a global study of the combinatorial property of a set of binary variable. Keywords: Data binarization, classification, logical analysis of data, data compression.
1
Introduction
Supervised classification learning addresses the general problem of finding a plausible K-partition into classes of an input space Ω, given a K-partition of a set of training examples X = X1 ] . . . ] XK ⊂ Ω. In practical applications of data mining the input spaces Ω are usually very large and they combine features of different nature. Therefore, for most of the mining tools to be usable, it is convenient to preprocess the data and an important research effort is now spent on problems such as feature selection or feature discretization. Some mining technologies require even purely binary data. This is the case of Logical Analysis of Data (LAD), which is a general approach for knowledge discovery and automated learning proposed in the mid eighties [5]. Classification is one particular usage of this theory, which was extensively developed and implemented in the mid nineties and which showed great potentialities [3]. Thus, besides data compression, there is a need in data binarization in view of mining, where the most relevant information for further processing has to be maintained (Sect. 2). In Sect. 3, the binarization problem is stated and some classical approaches are briefly presented. Section 4 presents the algorithm IDEAL, specially designed to fit the needs of LAD. Some experimental results are discussed in Sect. 5 and Sect. 6 concludes and discusses further work. ?
The support of the Swiss National Science Foundation under grant 2000-053902.98 is gratefully acknowledged.
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 442–447, 1999. c Springer-Verlag Berlin Heidelberg 1999
Combinatorial Approach for Data Binarization
2
443
Requirements for Binarization
Given a set of training examples X ⊂ Ω partitioned into K-classes X1 ]. . .]XK , the binarization problem consists in finding a mapping m : Ω → {0, 1}d with the following properties : (i ) most of the information relevant to the classification problem should be preserved through m; (ii ) the size d of the binary codes is not too large. The first property is translated into a sharp and a soft constraint. The former states that the mapping should be consistent with the training examples, i.e. m(Xi ) ∩ m(Xj ) = ∅, ∀i 6= j. The latter asks that two points of Ω, close to each other according to a reasonable metric, should have their images through m close to each other in the Hamming distance metric. The second property has to be taken with some care. Clearly, the size of the binary codes should be small in order to reduce the complexity of the processing of binarized data. The research of a binary mapping of minimal size satisfying the consistency constraint is a challenging combinatorial problem proven to be NP-Hard in most of its forms [2]. However, experience has shown that the final performances of any learning method applied to the binarized data can drop whenever d is too small. This suggests that the consistency constraint is not sufficient to ensure that the relevant information is not lost in the binarization of the data. In practice, it is useful that the method determining the binarization provides also a way to control the size of the binary codes produced. For this purpose, the consistency constraint can be extended in a natural way as follows. A mapping m is c-consistent with the training examples if and only if for any two examples x ∈ Xi and y ∈ Xj , i 6= j, the Hamming distance between m(x) and m(y) is at least c. Clearly, 1-consistency is identical to plain consistency. Experimentations with LAD showed that binary mappings c-consistent with the training examples, with c = 2 or 3 are still of reasonable size for most of the datasets and allow a strong improvement of the overall behavior of the method. For the sake of generality, the binarization methods must be able to handle input spaces Ω composed of attributes of different kinds : binary, nominal (ordered and unordered), or continuous. For the purpose of interpretation simplicity, each one of the d binary functions of the mapping m : Ω → {0, 1}d involves only one original attribute of the input space Ω. In the sequel, each binary function mi : Ω → {0, 1}, composing the binary mapping m is called a discriminant and is restricted to the following types. When associated to an unordered attribute (binary or nominal), a discriminant is identified to one possible value of this attribute (e.g. “color = yellow ”). In the case of an ordered attribute (nominal or continuous), a discriminant is a comparison to a threshold value (e.g. “age > 45”). To be usable on real life datasets, a binary mapping must handle properly unknown data, noisy data as well as a priori knowledge such as monotonic relationship between attributes and the target. The algorithm proposed hereafter addresses these issues. Though, space constraints prevent us from going into further details.
444
3
E. Mayoraz and M. Moreira
Existing Binarization Methods
Some learning methods generate, as a by-product, a discriminant set D that defines a binary mapping m. For example, when a decision tree is built to learn a classification task, each internal node of the tree is a discriminant. Moreover, if no early stopping criterion is used and if the tree is not pruned, all examples associated to one particular leaf are of the same class. Thus the binary mapping of size d given by the number of nodes is consistent. In this paper, the focus is put on global binarization algorithms, which usually assume a large (implicit or explicit) initial discriminant set, from which a small subset has to be extracted. For comparisons between global binarization methods and local approaches such as decision trees, please refer to [6]. Given a training set X of examples partitioned as before into K classes, and given a large set D of discriminants defining a binary mapping consistent with X, the problem of finding a small subset of D, still consistent with X, can be formalized as a minimum set covering problem. For each discriminant τ ∈ D there is one variable z τ ∈ {0, 1} indicating whether τ belongs to the resulting subset or not. The constraint matrix A has a row for each pair of examples x ∈ Xi , y ∈ Xj , i 6= j. A(x,y),τ = 1 if the discriminant τ distinguishes the example x from y (i.e. ∗ = 6 mτ (x) 6= mτ (y) 6= ∗) and is 0 otherwise. A subset of discriminants defines a binary mapping c-consistent with X if and only if its characteristic vector z ∈ {0, 1}|D| satisfies Az ≥ c. The minimum set covering problem is an NP-Complete problem, but for our purpose, optimality is not critical and thus, any good heuristic is satisfactory. The most obvious heuristic for the resolution of the minimum set covering problem is the incremental greedy approach. It consists, at each iteration, in selecting the column of A with the highest number of 1s, introducing the corresponding discriminant τ in the solution (i.e. switching z τ from 0 to 1), and suppressing the rows i in A whenever Ai z ≥ c. A more critical issue is related to the computational complexity of this approach. If D denotes the initial number of discriminants and if there are n examples in X, the construction of the constraint matrix A is in O(n2 D). A naive implementation of this greedy heuristic has a complexity in O(n2 Dd) and has demonstrated its limitations in the experiments reported in [3]. A very nice solution proposed in [1] (denoted “Simple-Greedy” in Sect. 5) consists in resolving this minimum set covering problem using the same greedy heuristic, but without enumerating any column of A. A clever data-structure is used that allows to determine the number of conflicts solved by a discriminant at a given time in O(n). The total complexity of this approach is O(nDd), where d is the size of the final subset of discriminants. However, this approach is designed to solve the problem of the 1-consistent discriminant set and is not easily generalizable to the c-consistency case. The algorithm proposed in the next section is an alternative to this problem as it addresses the c-consistency issue. Even though its worst case complexity is in O(D log D + n2 D), it is shown to be quite efficient in practice even with large training samples.
Combinatorial Approach for Data Binarization
4
445
An Eliminative Approach
The algorithm Ideal (Iterative Discriminant Elimination Algorithm) is an eliminative procedure for finding a minimal c-consistent discriminant set. As for the other global methods discussed in Section 3, the initial discriminant set D is obtained by placing : along each ordered attribute, one discriminant between every two projected examples of different classes and of consecutive values ; and for each unordered attribute, one discriminant for each one of its possible values. Ideal iteratively selects discriminants from D minimizing a merit function w(τ ). Each selected discriminant is eliminated if approved on a redundancy test (checking whether the elimination of this discriminant would still leave at least c others discriminating every pair distinguished by this discriminant), and kept otherwise for the final solution. This process is repeated until all the discriminants have been tested once. Choosing as w(τ ) the total number of pairs of examples from different classes discriminated by τ would lead to an algorithm very similar to the greedy heuristic described in Section 3. The only difference would be that in the former case the solution is built iteratively while here it is pruned iteratively. Among the various merit functions w(τ ) experimented [6], the one finally selected for Ideal measures the number of local conflicts defined as follows. If the discriminant τ is based on the original attribute a, w(τ ) is the number of pairs of examples from different classes, discriminated by τ and by no other discriminants based on a. This choice for w(τ ) has several advantages, the most important one is the computational complexity. The merit of each discriminant is computed (cheaply) once at the beginning and then, whenever a discriminant τ is pruned, the merit changes for only few discriminants (each one associated to the same original attribute a as τ , and in the case of an ordered attribute even only the two discriminants just below and just after τ along a). The second advantage of this merit function is that it makes Ideal sensitive to the relation between discriminants and original attributes. Consequently, this introduces a bias in the final solution towards sets of discriminants well spread over the different attributes, i.e. avoid (if possible) having many discriminants related to the same attribute. We consider that in many applications, this is a desirable property.
5
Experiments
The response of Ideal to the rise of the consistency constraint has been studied empirically. The algorithm was tested on 21 datasets from the UCI repository of machine learning databases [4], with different values of c, from 1 to 4. Table 1 contains the results, including the test done with Simple-Greedy, for comparison purposes. Table 1 shows that, against expectations, the obtained number of discriminants increases more than linearly with c for 6 of the datasets. We find explanation for this in the fact that there is a set of important discriminants providing large amounts of separations and thus covering a significant part of
446
E. Mayoraz and M. Moreira
Table 1. Evolution of Ideal with the raise of the minimal consistency level. SimpleGreedy (SG), a 1-consistent, constructive procedure is provided for reference. The left-hand part of the table shows the size of the obtained discriminant sets. The D column gives the initial size. Final Size d consist. level c (Ideal) dataset D 1 2 3 4 SG abalone 5779 192 319 513 861 171 allhyper 440 21 36 59 110 18 allhypo 548 20 57 110 207 19 anneal 134 31 60 88 103 33 audiology 92 22 47 70 76 22 car 15 14 15 15 15 14 dermatology 141 13 21 27 34 13 ecoli 301 24 47 93 203 20 glass 692 17 30 47 71 15 heart-dise. 309 14 21 34 51 12 krkopt 39 34 34 34 34 34 letter 234 59 90 128 151 58 mushroom 112 7 14 29 42 6 nursery 19 17 19 19 19 17 page-blocks 3378 45 82 126 193 39 diabetes 856 24 40 66 127 22 segmentati. 9817 28 53 74 100 24 soybean 97 25 35 44 52 22 vehicle 1215 34 49 71 92 26 vowel 7077 26 38 59 86 22 yeast 374 39 82 173 271 41 average ratio d/D 20.4 27.1 34.8 42.9 20.1 (std) ±30.3 ±32.4 ±33.5 ±34.2 ±30.4
Execution Time consist. level c (Ideal) 1 2 3 4 21.1 16.4 11.4 8.2 53.7 43.1 31.6 22.2 60.2 43.7 26.9 15.5 3.0 1.3 0.8 0.6 0.8 0.9 0.6 0.5 2.2 0.3 0.2 0.2 5.0 7.3 7.9 8.2 0.3 0.3 0.2 0.1 0.4 0.4 0.4 0.3 0.8 0.6 0.7 0.5 35.3 3.8 3.2 3.2 3598.3 2531.5 1368.2 395.9 1469.2 1853.2 1453.5 718.3 110.7 1.8 1.8 1.8 112.0 88.8 64.0 45.1 2.2 1.8 1.5 1.0 109.2 147.6 200.9 177.6 6.7 9.5 8.0 8.2 15.0 20.2 17.3 18.4 9.7 10.1 10.3 9.6 4.0 2.6 1.0 0.3
average evolution
0.7
1.7
3.1
-0.15
(std)
±0.4
±1.1
±2.4
±0.4
SG 195e3 16.3 14.6 6.2 10.4 0.3 1.2 5.9 5.6 1.4 217.6 2586.9 4.6 2.5 396.1 17.0 698.8 43.9 44.9 773.6 139.4
-0.28 -0.43 ±0.5
±0.5
the plain-consistency solution, but as the consistency constraint is tightened, discriminants of increasing specificity are added to the solution, resulting in a faster increase of the latter. Nevertheless, for the majority of the remaining datasets the increase is less than linear, corresponding to the expected behavior. In terms of execution time, although it generally decreases with c, that is not a general behavior. Both increases and decreases can be explained. However, no element allows to predict the particular evolution for a given dataset, that being dependent on its intrinsic, non-observable characteristics. The redundancy test of Ideal is composed mainly of two nested loops, the outermost dedicated to the pairs of examples to be tested and the innermost to the search in other dimensions for at least c alternative discriminants separating those pairs. The raise of c abbreviates the outermost loop, since less time will probably be needed to find a non-compliant pair, but it will prolong the innermost loop because more alternative dimensions must be analyzed until the minimal separability is found. The observed decreasing tendency in execution time is an argument in favor of eliminative procedures, as their search path is shortened when the consistency constraint is strengthened, as opposed to constructive approaches.
Combinatorial Approach for Data Binarization
6
447
Conclusions and Further Research
We have described the basic concepts of Logical Analysis of Data and highlighted the need for finding a suitable binary mapping that can transform data of arbitrary form into binary data, unique format tractable by LAD. Ideal, an eliminative algorithm for finding a minimal discriminant set consistent with a set of training examples, has been described. The relation between the enlargement of the minimal differentiability among binarized objects with their resulting size growth was also examined. It has been shown that the growth rate depends on the data, although in the majority of the tested cases less than linear growth has been observed. For comparison, an alternative, constructive approach has been briefly described and tested, with 1-consistent constraint. We speculate that constructive procedures are, in principle, less adapted to the referred constraint tightening, due to the consequent longer search path. No empirical evidence has been provided, though, due to the absence of a constructive approach of satisfying efficiency that is able to deal with the problem. Concerning further work, we refer that an early stopping criterion could accelerate the proposed algorithm execution without major result deterioration. This aspect is discussed in [6], although a suitable solution is yet to be developed. In fact, the time complexity of the redundancy tests tends to O(n2 ) as the elimination of discriminants proceeds. In this latter phase, small decreases are verified in the discriminant set size. A natural goal for follow-up activity consists in measuring the quality of the obtained binary mappings applied to LAD in classification tasks.
References 1. Hussein Almuallim and Thomas G. Dietterich. Learning boolean concepts in the presence of many irrelevant features. Artificial Intelligence, 69(1–2):279–306, 1994. 2. E. Boros, P. L. Hammer, Toshihide Ibaraki, and A. Kogan. Logical analysis of numerical data. Technical Report RRR 4-97, RUTCOR, 1997. 3. E. Boros, P. L. Hammer, Toshihide Ibaraki, A. Kogan, E. Mayoraz, and I. Muchnik. An implementation of logical analysis of data. RRR 22-96, RUTCOR–Rutgers University’s Center For Operations Research, http://rutcor.rutgers.edu:80/˜rrr/, July 1996. To appear in IEEE Trans. on Knowledge and Data Engineering. 4. E. Keogh C. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/˜mlearn/MLRepository.html. 5. Peter L. Hammer. Partially defined Boolean functions and cause-effect relationships. Int. Conf. on Multi-Attribute Decision Making Via OR-Based Expert Systems, University of Passau, Germany, April 1986. 6. Miguel Moreira, Alain Hertz, and Eddy Mayoraz. Data binarization by discriminant elimination. In Ivan Bruha and Marco Bohanec, editors, Proceedings of the ICML99 Workshop: From Machine Learning to Knowledge Discovery in Databases, pages 51–60, 1999. ftp://ftp.idiap.ch/pub/reports/1999/rr99-04.ps.gz.
Extending Attribute-Oriented Induction as a Key-Preserving Data Mining Method 1
Maybin K. Muyeba and John A. Keane
1
1
Department of Computation, UMIST P O BOX 88, Manchester, UK muyeba, [email protected]
Abstract. Attribute-Oriented Induction (AOI) is a set-oriented data mining technique used to discover descriptive patterns in large databases. The classical AOI method drops attributes that possess a large number of distinct values or have either no concept hierarchies, which includes keys to relational tables. This implies that the final rule (s) produced have no direct link to the tuples that form them. Therefore the discovered knowledge cannot be used to efficiently query specific data pertaining to this knowledge in a different relation to the learning relation. This paper presents the key-preserving AOI algorithm (AOI-KP) with two implementation approaches. The order complexity of the algorithm is O (np), which is the same as for the enhanced AOI algorithm where n and p are the number of input and generalised tuples respectively. An application of the method is illustrated and prototype tool support and initial results are outlined with possible improvements.
1
Introduction
Data mining is the extraction of interesting patterns concealed in large databases [1]. There are various algorithms for extracting these patterns such as association, sequencing and classification [2]. Attribute-Oriented induction (AOI) [3] is a setoriented generalisation technique used to find various types of rules. The method integrates learning-from-examples techniques with database procedures. The AOI method basically involves three primitives that specify the learning task. These are collection of initial task-relevant data (Data Collection), use of background knowledge (Domain knowledge) [9] during the mining process and representation of the learning result (Rule formation). The fundamental principle in AOI is to generalise the initial relation to a prime relation and then to a final relation using background knowledge and user-defined threshold (s). Tuples found similar are merged as one with counts accumulated. The AOI method is widely applicable and this underlines its importance. For example, mining characteristic and classification rules [3], multiple level generalisation [9] and cooperative or intelligent query answering [8]10].
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 448-455, 1999. © Springer-Verlag Berlin Heidelberg 1999
Extending Attribute-Oriented Induction as a Key-Preserving Data Mining Method
449
The contribution of this paper is two-fold. Firstly, AOI is extended by preserving keys from the initial task relevant relation and associating them to the final rules while at the same time keeping the complexity of the key-preserving algorithm (AOI-KP) to O(np), which is the same as for the enhanced AOI algorithm [4]. Secondly, the approach allows a user to make more efficient data queries on large data relations other than the learning relation if the preserved keys index this data. We are unaware of other specific work on AOI using the key-preserving method for performing such queries. This paper is organised as follows: in section 2, related work is considered; in section 3 the proposed algorithm, AOI-KP, is presented and its complexity is discussed in section 4; in section 5 the application of the approach is illustrated; section 6 describes results from a prototype support tool for the method with further improvements and section 7 conclusion.
2
Related Work
The work builds on the classical AOI method introduced in [3]-[6], the rule-based approach [11] and intelligent query answering [10]. In [11], the information loss problem due to generalisation in the rule-based AOI is addressed by introducing a backtracking method using a generalised tuple called a covering tuple. However, the method cannot uniquely distinguish each covered tuple and so may be inefficient in performing queries directly on the discovered knowledge. In [7], a framework is proposed for answering both data and knowledge queries to a knowledge base of substantial volume. Our work mainly concerns data queries on the usual database data [11]. In [8], key-preserving and key altering generalisations are discussed. If a key is generalised, information is lost if joins are made on the generalised relations. Two memory-based AOI algorithms are reported in [3][4] and run in O(n log n) and O(np) times respectively, where n and p are the number of tuples in the input and generalised relations respectively. Further, two extensions to AOI that run in O(n) have been implemented [5,6]. However, in all these approaches, key preservation of the attribute identifying each tuple is not addressed to our requirements.
3 The Key-Preserving AOI Method When data is retrieved, the system stores the table in memory and applies the induction process within the user-supplied thresholds. An implementation concern is how to determine the size of the key list a priori. The default value assumed could be the number of input tuples in the initial relation. The actual size can only be determined dynamically. We introduce definitions relevant to the key-preserving algorithm. We assume an input relation with n tuples and each tuple is uniquely identified by a key, which is an integer value. Keys are stored in a static or dynamic array, here called a key list for convenience. We also assume that the keys are unaffected by database updates. For
450
M.K. Muyeba and J.A. Keane
any attribute Ai, (i = 1,..,n ), let tp and tq be any two tuples with keys p and q such that tp(A1) = p, tq (A1) = q, p
4
Algorithm Complexity
AOI can be applied to large databases as it progressively reduces the search space (see figure 3). The initial problem is that all the examples forming the initial relation will be loaded into memory. Data mining algorithms should be both space and time efficient [5]. To save time, the construction of attribute concept hierarchies is done dynamically as the input is read to avoid re-scanning the input data or retrieving from disk. The algorithm proposed here works in the same way as AOI [4] but preserves the keys of the initial table. The order complexity of AOI-KP is calculated as follows, assuming p = n*k, 010), the key list space requirement for dynamic arrays at any point during program execution is O(n/m+c), where c is a small space increment due to equivalent tuples’ key insertions which is much less than O(np) for the static key approach.
1
See Figure 4 of section 6
Extending Attribute-Oriented Induction as a Key-Preserving Data Mining Method
STEP 1: Collect Data and determine distinct values of Ai BEGIN Declare Key_list array with two dimensions Assign first key to Key_list as an indexing key and insert the table FOR EACH attribute Ai (1 i n, Ai != A1) DO Construct hierarchy for attribute AI WHILE attribute threshold not reached DO BEGIN IF Ai has no hierarchy THEN Remove attribute Ai ELSE Substitute Ai by its next level generalised concept; STEP 2: Merge any identical tuples tp, tq propagate counts, keys IF any two tuples tp,tq are equivalent THEN BEGIN Assign key p of tuple tp to variable KEY Look for indexing key KEY in Key_list to determine its row IF KEY is found THEN BEGIN Insert key (s) of tuple tq as ordinary key(s) in KEY’s row Increase tuple count of inserted row in table Merge the tuples END ELSE BEGIN Insert indexing key KEY, ordinary key q in next available row Increase tuple count of inserted row in table Merge the tuples END // end if key = found END ELSE IF next row in table is empty THEN BEGIN Insert tuple with key q in table Insert key q as indexing key in next empty row of Key_list Assign one to tuple count of inserted row in table END END// end if any two tuples END // end while Repeat step 2 until entire table and key-list are checked. STEP 3: Check tuple threshold and produce final table WHILE number of tuples is more than rule threshold DO BEGIN Selectively generalise an attribute with distinct value > rule threshold Merge tuples, propagate keys and increase counts using STEP 2. END END. // Key-preserving AOI Fig. 1. The Key-Preserving Algorithm (AOI-KP)
451
452
M.K. Muyeba and J.A. Keane
This gives a total order complexity of O(np) for large values of n. The space requirement is O(n+p) plus O(n/m+c) or simply O(n). The dynamic key propagation approach therefore executes much faster and uses less memory than the static approach2.
5
Application of the Approach
Consider a University database with the following schema: Student (Sno, Sname, Sex, Major, Department, Birth_place, Residence) Course (Cunit, Ctitle, Department, Ts, Te) Exam (Cunit, Sno, GPA) Calendar (Period, Acc_year, Ts, Te) Registration (Sno, Rstatus, Ctitle, Acc_year)
where Cunit is course unit, Ctitle is course title, Acc_year is a user-defined time for academic year, Sno is student number, Rstatus is registration status and Ts and Te are time start and time end of the respective periods of study and courses. In addition, assume the instance-based concept hierarchy for the attributes Sex, Major and Birth_place in figure 2. The initial table is retrieved with attributes Sno, Sex and Birth_place for learning about postgraduate students in a chosen academic year. the temporal attribute ‘Acc_year’ is used in the WHERE clause of SQL thus making the discovered rules temporal.
{M.Sc, Ph.D,..}Ã postgraduate, {Postgraduate, Undergraduate} ± ANY (Major) {Female, Male} ± ANY (Sex) {Bradford, Liverpool, Manchester, Edinburgh,.., London} ± UK {Chicago, Boston, New York,..,Washington} ± USA {Calcutta, Bombay,.., New Delhi} ± India {Shanghai, Nanjing,.., Beijing} ± China, {India, China,..} ± Asia {UK, France, Germany,..} ± Europe {USA, Canada,..} ± America, {America, Asia, Europe,..} ± ANY (Birth_place) Fig. 2. Concept hierarchies for Sex, Major and Birth_place
After choosing a threshold of 3, the generalisation process preserves the key column ‘Sno’ associated with each tuple. The generalisation process continues until thresholds are reached. 2
See execution times for both methods on Figure 4 of section 6
Extending Attribute-Oriented Induction as a Key-Preserving Data Mining Method
Sno
Sex
Birth_place
Sno
Sex
Birth_place
Count
1 2 3 4 5 6 7 8 9 10
M F F M M F F M M M
USA UK USA INDIA INDIA UK UK CHINA CHINA USA
1 2 3 4 8
M F F M M
USA UK USA INDIA CHINA
2 3 1 2 2
453
Table 2. (b) Prime Table
Table 1. (a) Initial Table
Rule 1 Rule 2 Rule 3
1
10
3
2 4
6 5
7 8
9
(d) Key lists linked to the rules Sno
Sex
Birth_place
Count
1
ANY
AMERICA
3
2
F
EUROPE
3
4
M
ASIA
4
Table 3. (c) Final Table
Fig. 3. Key propagation in the AOI process
The discovered rule would then be expressed in logic form as: "(x) Postgraduate(x)¾Acc_year(x)=“Year 1“ Ã Birth_place (x)³America [30%][Rule keys =1, 3, 10] ¿ Sex(x)= „Female“ ¾ Birth_place (x)³Europe [30%][Rule keys =2, 6,7] ¿ Sex(x)= „Male“¾ Birth_place(x)³Asia [40%][Rule keys = 4, 5, 8, 9]
Knowledge discovered in the AOI algorithm could be used to provide answers to data and knowledge queries [7,10]. For example, the rule „All postgraduate students lived in private accommodation in year one “ would be necessary to answer the queries: „Did any postgraduate students live in University accommodation then?“ „Who were they?“ Which of them had an excellent GPA?“ The first query can be answered from
454
M.K. Muyeba and J.A. Keane
the rule base. The latter queries, which involve listing and aggregation, can be answered efficiently with rule keys by querying the relevant tables. Moreover, the final generalised table, termed a knowledge table, can be used to join other data tables with at least the remaining tuples.
6. Prototype and Results A prototype generalisation tool, TAGET (Temporal Attribute-oriented GEneralisation Tool) to mine temporal characteristic rules using the AOI approach is under development and initial results are shown in Figure 4. TAGET is a memorybased approach that uses data structures to store and manipulate data and concept Dynamic Approach Static Approach 30
25
Time in seconds
20
15
10
5
0 1000
3000
6000
9000
12000
15000
18000
21000
24000
27000
30000
No. of tuples
Fig. 4. TAGET Performance measures
hierarchies. Performance measures contrasting the dynamic and the static approaches for AOI-KP are shown in Figure 4 using an Intel P267 MHz with 64MB of memory and an attribute threshold of 3 as shown in Figure 3. Each set of input tuples was run five times and the average time computed. The initial results show that even with key propagation, the performance of AOI-KP is not drastically affected by the dynamic approach except memory limitations. We briefly explain how to handle this problem in the conclusion.
7. Conclusion This paper has described an extension to AOI with key preservation (AOI-KP) that maintains the order complexity O(np) of AOG [4]. It has been shown using the
Extending Attribute-Oriented Induction as a Key-Preserving Data Mining Method
455
preserved Keys, the database of the discovered knowledge can be further interrogated more efficiently. For large databases, the output from the data queries that involve listing would be enormous for a single rule. This would need a threshold value on either the size of the key list or the number of keys required by the user. TAGET can be further improved by using concurrency mechanisms during tuple and key insertion as well as file I/O for keys. In addition, assuming each tuple read will be generalised at least once, the order complexity can be improved to O(n) [5] by pre-generalisation. Using this method, the number of generalised tuples in memory will be orders smaller than the original task relevant data.
References 1.
Frawley, W. J. and Piatetsky-Shapiro, G. 1991. „Knowledge Discovery in Databases“, AAAI/MIT Press, pp 1-27. 2. Agrawal, R.; Imielinski, T. and Swami, A. 1993. „Database Mining: A performance perspective“ IEEE Transactions on Knowledge and Data Engineering, 5(6):914-925, December. 3. Han, J; Cercone, N. and Cai, Y. 1991 „Attribute-Oriented Induction in Relational Databases“ In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pp 213-228. 4. Han, J.; Cai, Y. and Cercone, N. 1993. „Data-Driven Discovery of Quantitative Rules in Relational Databases“ IEEE Transactions on Knowledge and Data Engineering, 5(1):29-40, February. 5. Carter, C. L. and Hamilton, H. J. 1998. „Efficient Attribute-Oriented Generalisation for Knowledge Discovery from Large Databases“ IEEE Transactions on Knowledge Discovery and Data Engineering, 10(2):193-208, March. 6. Hwang, H; and Fu, W. 1995. „Efficient algorithms for Attribute-Oriented Induction“ In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD’95), pp 168-173, Montreal, Quebec. 7. Motro, A. and Yuan, Q. 1990. „Querying Database Knowledge“ In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pp 173-183, Atlantic City, NJ. 8. Han, J.; Fu, Y. and Ng, R. T. 1994. „Cooperative Query Answering Using Multiple Layered Databases“ In Proceedings of the International Conference on Cooperative Information Systems (COOPIS ’94), pp 47-58, Toronto, Canada. 9. Fu, Y. 1996. „Discovery of Multiple-Level rules from Large Databases“ Ph.D. thesis, Computing Science, Simon Fraser University, July. 10. Han, J.; Huang, Y.; Cercone, N. and Fu, Y. 1996. „Intelligent Query Answering by Knowledge Discovery Techniques“ IEEE Transactions on Knowledge and Data Engineering, 8(3):373-390, June. 11. Cheung, D. W.; Fu, A. W.-C and Han. J. 1994. „Knowledge Discovery in Databases: A Rule-Based Attribute-Oriented Approach“ In Proceedings of the 1994 International Symposium on Methodologies for Intelligent Systems (ISMIS ’94), pp 164-173, Charlotte, North Carolina.
Automated Discovery of Polynomials by Inductive Genetic Programming Nikolay Nikolaev1 and Hitoshi Iba2 1
Department of Computer Science, American University in Bulgaria, Blagoevgrad 2700, Bulgaria, [email protected] 2
Department of Information and Communication Engineering, School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan, [email protected]
Abstract. This paper presents an approach to automated discovery of high-order multivariate polynomials by inductive Genetic Programming (iGP). Evolutionary search is used for learning polynomials represented as non-linear multivariate trees. Optimal search performance is pursued with balancing the statistical bias and the variance of iGP. We reduce the bias by extending the set of basis polynomials for better agreement with the examples. Possible overfitting due to the reduced bias is conteracted by a variance component, implemented as a regularizing factor of the error in an MDL fitness function. Experimental results demonstrate that regularized iGP discovers accurate, parsimonious, and predictive polynomials when trained on practical data mining tasks.
1
Introduction
Inductive Genetic Programming (iGP) is considered a specialization of the Genetic programming (GP) paradigm [7] for automated knowledge discovery from data. The reasons for this specialization are [9]: 1) inductive knowledge discovery is a search problem and GP is a versatile framework for exploration of large search spaces; 2) GP provides genetic operators that can be tailored to the particular data mining task; and 3) GP flexibly reformulates program solutions. An advantage of iGP is that it automatically discovers the size of the solutions. Previous research showed that iGP is successful for various data mining applications like: financial engineering [7], classification [2], time-series prediction [11], [5], etc.. A commonality in these evolutionary systems is that they discover non-linear model descriptions, and construct non-linear discriminant boundaries among the example data. This observation inspires us to consider KolmogorovGabor polynomial models represented as non-linear multivariate trees. An iGP system for evolutionary discovery of multivariate high-order polynomials, STROGANOFF [5], is enhanced. The intention is to achieve optimal search performance which acquires solutions that are not only parsimonious and accurate but also highly predictive. Our strategy to improving the performance trades-off between the statistical bias and the statistical variance of iGP. Statistical bias is the set of basis polynomials with which iGP constructs the target polynomials. Statistical variance is the deviation of the learning efficacy from one sample of examples to another sample that suggest the same target polynomial. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 456–461, 1999. c Springer-Verlag Berlin Heidelberg 1999
Automated Discovery of Polynomials by Inductive Genetic Programming
457
The iGP control balances the statistical bias and variance in the following way: 1) an extended set of basis polynomials is used to reduce the statistical bias, thus to fit flexibly the examples; 2) a regularization technique is applied to the fitness function to diminish the variance, in the sense of tendency to overfit the particularities and noise in the examples. The effect of balancing the statistical bias with a variance factor is increasing the degree of generalization and improving the global search performance. We implement an iGP system with the regularized MDL function, proportional selection, and application of the crossover and mutation operators dependent on the tree size [9]. Mutation and crossover points are chosen with recombinative guidance by the largest error in the tree nodes [5]. Empirical evidence for the efficiency of this regularized iGP on data mining tasks is provided. The next section of the paper gives the representation of multivariate highorder polynomials as trees and explains how an extended basis set impacts the search performance. Section three defines the regularized MDL-based fitness function. The iGP performance with this fitness formula is investigated in section four. Finally, a discussion is made and conclusions are derived.
2
Knowledge Discovery and Regression
The problem of inductive knowledge discovery can be formulated as a nonparametric regression problem. Given examples D = {(xi , yi )}N i=1 of instantiated vectors of independent variables xi = (xi1 , xi2 , ..., xil ) ∈ IRl , and the dependent variable yi ∈ IR, the goal is to find models y = f (x). Due to noise perturbations, the goal becomes to find the best approximation f (x) of the regression function f˜(x) = E[y|x] by minimizing the empirical error on the examples. When the normal distribution is considered, the least squares fitting criterion is used to search for function f (x) that minimizes the average squared residual ASR: ASR =
N 1 X (yi − f (xi ))2 N i=1
(1)
where yi is the true outcome of the i-th example, f (xi ) is the estimated outcome with this example xi , and N is the sample size. 2.1
Polynomials as Trees
We develop a knowledge discovery system that deals with high-order multivariate polynomials represented by non-linear multivariate trees [5]. They are interpreted as Kolmogorov-Gabor polynomials: X XX XXX ai xi + aij xi xj + aijk xi xj xk + ... (2) f (x) = a0 + i
i
j
i
j
k
There are three main issues in the automated discovery of polynomial models: 1) how to find the polynomial coefficients; 2) which terms with which variables to select; and 3) how to avoid overfitting with the example data.
458
2.2
N. Nikolaev and H. Iba
Set of Basis Polynomials
The polynomials are modeled according to the Group Method of Data Handling (GMDH) [6]. A polynomial is a composition of bivariate basis polynomials in the nodes and independent variables in the leaves. We extend the basis Φ into a set with all complete and incomplete, first and second order polynomials in order to increase the flexibility of iGP to fit the data. The complete basis polynomial is: (3) fj (x) = aT h(x) which if all coefficients in a = (a0 , a1 , ..., a5 ) are non-zero, i.e. each ai 6= 0, 0 ≤ i ≤ 5, expands to: fj (h(x)) = a0 h0 (x) + a1 h1 (x) + ... + a5 h5 (x) (4) The vector h =(h0 (x), h1 (x), h2 (x), h3 (x), h4 (x), h5 (x)) consists of simple functions hi that produce the polynomial terms: h0 (x) = 1, h1 (x) = x1 , h2 (x) = x2 , h3 (x) = x1 x2 , h4 (x) = x21 , and h5 (x) = x22 . The basis set is derived from the complete second-order polynomial (4). The total number of the incomplete polynomials is 25 from all 25 combinations of monomials ai hi (x),1 ≤ i ≤ 5, and the leading constant term a0 , that contain both the variables x1 and x2 . We use a subset {Φ} = 17 of them after elimination of the symmetric polynomials. The target polynomial f (x) is built by bottom-up tree traversal, and composing the basis polynomials in the nodes. The benefit of this cascaded tree-like representation of the polynomials is that it allows tractable evaluation of complex, high-order models due to the composition of simple functions with parameters that are computed fast. The coefficients at each tree node are calculated with the matrix formula: a = (HT H)−1 HT y (5) where H is a N × 6 matrix of vectors hi = (hi0 , hi1 , ..., hi5 ), i = 1..N , and y is a N × 1 output vector. This is a solution of the ordinary least-squares (OLS) fitting problem by the method of normal equations. 2.3
Genetic Operators
The second issue in discovering high-order multivariate polynomials, raised in subsection 2.1, concerns the organization of the iGP search for polynomial terms. We use specific mutation and crossover operators. The mutation operator is context-preserving [9]. It modestly transforms a tree with three elementary submutations: 1) substitution of an arbitrary node by another one; 2) insertion of a node as a parent of a subtree so that the subtree becomes leftmost child of the new node; and 3) deletion of a node only when no subtree below is to be cut. This mutation is applied with probability pm = m|g|2 , where m is a free parameter and |g| is size of tree g. p The iGP crossover splices two trees with probability pc = c/ |g|, where c is a free parameter, or swaps them. This crossover operator produces offsprings with larger size than their parents if the parents are of very small size. The convergence of the evolutionary iGP search process is accelerated by recombinative guidance [5]. The tree nodes in which the basis polynomials have largest ASR error are deterministically chosen for mutation and crossover.
Automated Discovery of Polynomials by Inductive Genetic Programming
3
459
Regularization Approach to iGP
The third issue in discovering polynomials, raised in subsection 2.1, concerns overfitting avoidance. The problem is to determine the polynomial terms and coefficients with which it optimally approximates the data without overfitting. Criteria for learning parsimonious and accurate models are given by the Minimum Description Length (M DL) principle. The regularization theory provides heuristics for increasing the predictability. 3.1 The MDL-Based Fitness Function Adapted for the purpose of polynomial discovery, the MDL principle can be stated as follows: given a set of examples and an effective enumeration of their polynomial models, prefer with greatest confidence the polynomial which has together high learning accuracy, and low structural complexity. We adopt the following M DL-based fitness function [1]: A (6) M DL = ASR + σ 2 log(N ) N where ASR is the average squared residual, A is the number of coefficients, N are the examples, σ 2 is a rough estimate of the error variance. 3.2 Statistical Bias and Variance When trying to find polynomials from a fixed and finite example set the ASR error may be low but this is not enough to anticipate a high generalization. The reason is that often the examples are noisy. This may be combat by decomposing of the error into a statistical bias and a variance component [4]. The statistical bias, proportional to (ED [f (x)] − E[y|x])2 , accounts only for the degree of fitting the examples, but not for the level of extrapolation. This is the variance, proportional to ED [(f (x) − ED [f (x)])2 ], that accounts for the generalization. The risk of overfitting the examples could be minimized if a variance factor is added to the error component of the fitness function. We introduce a correcting complexity that penalizes large coefficients in a regularized average error RAE: N A X 1 X RAE = ( (yi − f (xi ))2 + k a2j ) (7) N i=1 j=1 where k is a regularization parameter. Motivation for this definition is that large coefficients imply a fluctuating polynomial with large amplitudes that overfits the examples, while small coefficients imply more ”regular” approximation. We use RAE in the M DL function instead ASR, which is called from now on M DLR . The manner for calculating polynomial coefficients can be derived from PA Pthe N the minimum of the function i=1 (yi − f (xi ))2 + k j=1 a2j with respect to the coefficients aj , 1 ≤ j ≤ A, assuming the least squares fitting criterion: a = (HT H+kI)−1 HT y
(8)
where I is the identity matrix. We select values for k relying on a proof that as long as 0 < k < 2σ 2 /a0 a the mean squared error of the identified polynomial is smaller than this of the best estimator without correction.
460
4
N. Nikolaev and H. Iba
Search Performance
We present experimental results trying to answer the questions: whether the regularization approach to iGP can find polynomials with better predictive capacities than the ordinary iGP, assuming the old system STROGANOFF? How does the regularization affects the generalization quality? The iGP performance is studied with four data mining tasks from the machine learning repository [8]: Iris, Ionosphere, Glass, Credit, and Vehicle. Because of the character of the system, experiments were conducted with iGP to find one polynomial for each particular class from each of these tasks. In Table 1 we give results, measured by the percentage of correctly recognized training examples. The two iGP systems are related to a knowledge discovery system Ltree [3] that uses oblique decision trees, since it also finds non-linear approximations of the data, and to the decision tree learning system C4.5 [10]. An oblique or a decision tree classifies the data into all classes, and the comparison with the polynomials is not straightforward. That is why, we computed the average iGP variances from each group of polynomials that together learn all classes of a task. The variances are evaluated by 10-fold cross-validation. Data/class Best Ordinary iGP Regularized iGP Ltree C4.5 Iris variance(%) 99.74±0.16 99.95±0.21 97.15±2.85 95.15±4.85 accuracy(%) 99.965 99.982 100 100 Ionosphere variance(%) 89.23±0.75 90.14±0.26 90.6±4.0 90.9±5.0 accuracy(%) 92.44 93.37 94.6 95.9 Glass variance(%) 72.55±5.04 78.10±3.62 65.5±8.0 67.7±12.0 accuracy(%) 80.05 82.18 73.7 79.7 Credit variance(%) 77.15±3.48 81.06±2.52 73.6±5.0 70.9±4.0 accuracy(%) 81.53 83.75 78.6 74.9 Vehicle variance(%) 75.11±5.12 79.47±2.85 77.5±5.0 71.2±4.0 accuracy(%) 80.22 83.44 82.5 75.2 Table 1. Variance of the discovered tree-models by Regularized iGP using RAE with k = 0.01, Ordinary iGP, Ltree and C4.5, and accuracy of the best tree-models.
Table 1 shows that the regularized iGP converges to more accurate polynomials than the ordinary STROGANOFF, and also the variances of the regularized polynomials are smaller, therefore they feature higher generalization. One may observe that iGP discovers better solutions of complex tasks, like Glass, Credit, and Vehicle. The reason is that iGP may evolve very high-order polynomials that closely approximate the data. The non-evolutionary systems Ltree and C4.5 are slightly better on the simpler data sets Iris and Ionosphere. On the complex tasks iGP discovers shorter trees, for example of size 8 with 24 coefficients on the Glass data while Ltree produces trees of size approximately 34 and C4.5 acquires trees of size 44. Although the iGP system induces trees of almost equal size to these learned by Ltree and C4.5 on the simple tasks, 5 on the Iris data set and respectively 12, 15, 19 on the Ionosphere data set, the evolved polynomials include more terms.
Automated Discovery of Polynomials by Inductive Genetic Programming
5
461
Discussion
An essential advantage of this iGP is that the polynomial coefficients are rapidly computed as least-squares solutions by the method of normal equations. The regularization improves the previous ordinary iGP STROGANOFF providing a reliable scheme for computing the coefficients and, thus, avoiding problems of ill-posedness of the example matrix. The iGP approaches from the STROGANOFF family as well as GMDH [10] and MAPS [2] have the ability to find automatically the structure of the polynomials. iGP performs search in the space of whole polynomials, while the other iteratively grow a single polynomial. When GMDH and MAPS learn one polynomial layer by layer, they constrain the feeding of higher tree layers since they restrict the search considering subsets of some plausible basis polynomials.
6
Conclusion
This paper contributes to the research into discovery of high-order multivariate polynomials by iGP. It demonstrated an iGP system that can be used for applied nonparametric approximation due to the following advantages: 1) it discovers automatically the model structure; 2) it generates explicit analytical representations in the form of polynomials; and 3) it makes the polynomials well-conditioned suitable for practical purposes.
References 1. Barron, A.R., and Xiao, X. (1991) Discussion on MARS. Annals of Statistics, 19: 67–82. 2. Freitas, A. A. (1997) A Genetic Programming Framework for two Data Mining Tasks: Classification and Generalized Rule Regression. In Genetic Programming 1997: Proc. of the Second Annual Conference, 96–101, Morgan Kaufmann, CA. 3. Gama, J. (1997) Oblique Linear Tree, In X.Liu, P.Cohen, and M.Berthold (Eds.), Advances in Intelligent Data Analysis IDA-97, 187–198, Springer, Berlin. 4. Geman, S., Bienenstock, E., Doursat, R. (1992) Neural Networks and the Bias/Variance Dilemma. Neural Computation, 4(1): 1-58. 5. Iba, H., and de Garis, H. (1996) Extending Genetic Programming with Recombinative Guidance. In Advances in Genetic Programming 2, The MIT Press, 69–88. 6. Ivakhnenko, A. G. (1971) Polynomial Theory of Complex Systems. IEEE Trans. on Systems, Man, and Cybernetics. 1(4): 364–378. 7. Koza, J. R. (1992) Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge. 8. Merz, C. J., and Murphy, P. M. (1998) UCI Repository of machine learning databases, Irvine, CA: University of California, Dept. of Inf. and Computer Science [www.ics.uci.edu/ mlearn/MLRepository.html]. 9. Nikolaev, N., and Slavov, V. (1998) Concepts of Inductive Genetic Programming. In EuroGP’98: First European Workshop on Genetic Programming, LNCS-1391, 49–59, Springer, Berlin. 10. Quinlan, R. (1993) C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann. 11. Zhang, B.-T., and M¨ uhlenbein, H. (1995). Balancing Accuracy and Parsimony in Genetic Programming. Evolutionary Computation 3(1):17-38.
Diagnosing Acute Appendicitis with Very Simple Classification Rules Aleksander Øhrn and Jan Komorowski Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway
Abstract. A medical database with 257 patients thought to have acute appendicitis has been analyzed. Binary classifiers composed of very simple univariate if-then classification rules (1R rules) were synthesized, and are shown to perform well for determining the true disease status. Discriminatory performance was measured by the area under the receiver operating characteristic (ROC) curve. Although an 1R classifier seemingly performs slightly better than a team of experienced physicians when only readily available clinical variables are employed, an analysis of cross-validated simulations shows that this perceived improvement is not statistically significant (p < 0.613). However, further addition of biochemical test results to the model yields an 1R classifier that is significantly better than both the physicians (p < 0.03) and an 1R classifier based on clinical variables only (p < 0.0003).
1
Introduction
Acute appendicitis is one of the most common problems in clinical surgery in the western world, and its diagnosis is sometimes difficult to make, even for experienced physicians. The costs of the two types of diagnostic errors in the binary decision-making process are also very different. Clearly, unnecessary operations are desirable to avoid. But failing to operate at an early enough stage may lead to perforation of the appendix. Perforation of the appendix is a serious condition, and leads to morbidity and occasionally death. Therefore, a high rate of unnecessary surgical interventions is usually accepted. Analysis of collected data with the objective of improving various aspects of diagnosis is therefore potentially valuable. This paper reports on an analysis of a database of patients thought to have acute appendicitis. The main objective of this study has been to address the following two questions: (1) Based only upon readily available clinical attributes, does a computer model perform better than a team of physicians at diagnosing acute appendicitis? and (2) Does a computer model based upon both clinical attributes and biochemical attributes perform better than a model based only upon the clinical attributes? These two issues have previously been addressed in the medical literature by Hallan et al. [3, 4], using the same database of patients as presently considered. Multivariate logistic regression (MLR), the de facto standard method for analysis of binary data in the health sciences, was used in ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 462–467, 1999. c Springer-Verlag Berlin Heidelberg 1999
Diagnosing Acute Appendicitis with Very Simple Classification Rules
463
those studies. This paper addresses the same issues, but rather using one of the simplest approaches to rule-based classification imaginable, namely a collection of univariate if-then rules. Univariate if-then rules are also referred to as 1R rules.
2
Preliminaries
Let U denote the universe of patients, let A denote the set of classifier input attributes, and let d denote the outcome attribute. The set of 1R rules is defined as all rules on the form “if (a = a(x)) then (d = d(x))”, where a ∈ A and x ∈ U . 1R rules have previously been investigated by Holte [7]. A binary classifier realizes a composed decision function θτ ◦ φ, where φ(x) ∈ [0, 1] measures the classifier’s certainty that a patient x has outcome 1. The function θτ is a simple threshold function that evaluates to 0 if φ(x) < τ , and 1 otherwise. By varying τ and plotting the resulting true positive rates against the false positive rates, one obtains a receiver operating characteristic (ROC) curve. ROC analysis is a graphical method for assessing the discriminatory performance of a binary classifier [5], independent of both error costs and the prevalence of disease. The area under the ROC curve (AUC) is of particular interest, as it equals the Wilcoxon-Mann-Whitney statistic. An AUC of 0.5 signifies that the classifier performs no better than tossing a coin, while an area of 1.0 signifies perfect discrimination.
3
Methodology
For employing 1R rules, discretization of numerical attributes is a necessary prerequisite. In this study, for simplicity, all numerical attributes were discretized using an equal frequency binning technique with three bins, intuitively corresponding to labeling the values “low”, “medium” or “high” relative to the observations. To make the most out of scarce data, k-fold cross-validation (CV) was employed. In the training stage of the CV pipeline, the union of the k − 1 blocks were first discretized. 1R rules were subsequently computed from the discretized union of blocks. In the testing stage, the hold-out block was first discretized using the same bins that were computed in the training stage, and the cases in the discretized hold-out block were then classified using standard voting among the previously computed 1R rules. The results from the voting processes among the 1R rules were used to construct ROC curves. Performance measures for each iteration were harvested by computing the area under the ROC curves (AUC), computed using the trapezoidal rule for integration, as well as their associated standard errors as determined by the Hanley-McNeil formula [5]. Two variations of k-fold CV were applied. First, a single 10-fold CV replication was performed, corresponding to how CV is traditionally employed.
464
A. Øhrn and J. Komorowski
Additionally, five different replications of 2-fold CV was performed, as proposed by Alpaydin [1]. The outlined procedure was done for the three different classifiers below, with identical divisions into k blocks across all classifiers. – Simple 1R: A 1R computer model, based only upon readily available clinical attributes. – Extended 1R: A 1R computer model, based upon the same attributes as the simple 1R model, but with additional access to the results of certain biochemical tests. – Physicians: A classifier realized by probability estimates given by a team of physicians, based upon the same attributes as the simple 1R classifier. Lastly, a statistical analysis comparing their differences was performed using the methods of Hanley and McNeil [6] and Alpaydin [1].
4
Experiments
The methodology outlined in Sec. 3 has been applied to a medical database with 257 patients thought to have acute appendicitis, summarized in Tab. 1. The 257 patients were referred by general practitioners to the department of surgery at a district general hospital in Norway, and were all suspected to have acute appendicitis after an initial examination in the emergency room. Attributes {a1 , . . . , a14 } are readily available clinical attributes, while attributes {a15 , . . . , a18 } are the results of biochemical tests. The outcome attribute d is the final diagnosis of acute appendicitis, and was based on histological examination of the excised appendix. After the clinical variables were recorded the physician also gave an estimate of the probability that the patient had acute appendicitis, based on these. Nine residents with two to six years of surgical training participated in the study. These estimates directly define a realization of the certainty function φ. For a detailed description of the patient group and the attribute semantics, see [3, 4].
5
Results
The mean AUC scores from the 10-fold CV simulation with mean standard errors in parentheses were 0.823 (0.089) for the physicians, 0.858 (0.083) for the simple 1R classifier, and 0.920 (0.060) for the extended 1R classifier. The same scores from the five 2-fold CV simulations were 0.818 (0.041), 0.838 (0.039) and 0.910 (0.030), respectively. All simulations were carried out using the ROSETTA software system [8]. On average, the extended 1R classifier seemed to perform somewhat better than both the simple 1R classifier and the team of physicians. The simple 1R classifier and the physicians seemingly perform approximately the same, with the former achieving a slightly better average score.
Diagnosing Acute Appendicitis with Very Simple Classification Rules Attribute a1 AGE a2 SEX a3 DURATION a4 ANOREXIA a5 NAUSEA a6 PREVIOUS a7 MOVEMENT a8 COUGHING a9 MICTUR a10 TENDRLQ a11 REBTEND a12 GUARD a13 CLASSIC a14 TEMP a15 ESR a16 CRP a17 WBC a18 NEUTRO d DIAGNOSIS
Description Age (years) Male sex? Duration of pain (hours) Anorexia? Nausea or vomiting? Previous surgery? Aggravation of pain by movement? Aggravation of pain by coughing? Normal micturation? Tenderness in right lower quadrant? Rebound tenderness in right lower quadrant? Guarding or rigidity? Classic migration of pain? Rectal temperature (◦ C) Erythrocyte sedimentation rate (mm) C-reactive protein concentration (mg/l) White blood cell count (×109 ) Neutrophil count (%) Acute appendicitis?
465
Statistics 3–86 (22) 55.3% 2-600 (22) 69.3% 70.8% 9.3% 61.5% 59.9% 87.2% 86.0% 55.3% 30.7% 49.4% 36.4–40.3 (37.7) 1–90 (10) 0–260 (12) 2.9–31 (12.1) 38–93 (80) 38.1%
Table 1: Summary of attributes recorded for the 257 patients thought to have acute appendicitis. For binary attributes, the prevalence is given. For numerical attributes, the range and median are given. It is trivial to produce a classifier that classifies the training data perfectly. Although this would be a very optimistically biased estimate, 1R rules are so simple they do not possess enough degrees of freedom to overfit the data much. Reference ROC curves obtained when applying the classifiers to the full set of 257 patients from which they were constructed are displayed in Fig. 1. The exact same set of 257 patients has been previously analyzed by Hallan et al. [3] using MLR. With a slightly different resampling scheme than the one presently employed, an MLR model based upon only the clinical attributes had a mean AUC of 0.854, while an MLR model based on both the clinical attributes and the biochemical attributes had a mean AUC of 0.920. Carlin et al. [2] have also analyzed the same set of patients, but used rough set (RS) methods. Using a similar resampling scheme as Hallan et al., the same scores were 0.850 and 0.923, respectively1 .
6
Analysis
In order to draw any trustworthy conclusions from the results in Sec. 5, a statistical analysis has been performed. The standard tool for comparing correlated AUC values is Hanley-McNeil’s method [6]. However, this method is usually employed for a single two-way split only and not in a CV setting. The five 2-fold CV results have been analyzed using the method of Hanley and McNeil on a perfold per-replication basis. Considering the median p-values, there is no significant difference between the physicians and the simple 1R classifier (p < 0.585). On 1
Both Hallan et al. and Carlin et al. included slightly fewer attributes. However, this was done because Hallan et al. found that adding more attributes did not improve the models further.
466
A. Øhrn and J. Komorowski 1 0.9 0.8
True positive rate
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4 0.5 0.6 False positive rate
0.7
0.8
0.9
1
Fig. 1: Reference ROC curves. The middle solid line represents the simple 1R classifier, while the top dotted line represents the extended 1R classifier. The physicians are represented by the bottom dashed line. The AUC values and their standard errors of the three classifiers are 0.817 (0.029) for the physicians, 0.859 (0.026) for the simple 1R classifier, and 0.924 (0.019) for the extended 1R classifier. the other hand, the extended 1R classifier is significantly better than both the physicians (p < 0.026) and the simple 1R classifier (p < 0.018). Simply averaging the p-values as done above to obtain a summary p-value does not capture any systematic variations in differences in performance across folds and replications, information which is obviously of importance. There are, however, statistical tests that have been specifically designed for combining CV together with detection of differences in performance. Applying the 5x2CV F test of Alpaydin [1] to the five 2-fold CV results again yields similar conclusions. There is no significant difference between the physicians and the simple 1R classifier (p < 0.613), but the extended 1R classifier is significantly better than both the physicians (p < 0.03) and the simple 1R classifier (p < 0.0003).
7
Discussion
In Sec. 1, it was argued that performing a large number of unnecessary operations was preferable to missing any cases of acute appendicitis. This corresponds to prioritizing test sensitivity before test specificity. As can be seen from Fig. 1, the simple 1R classifier and the physicians display virtually identical performance in the area of ROC space of interest, while the extended 1R classifier outperforms them both everywhere. Simulations by Holte [7] showed that the best individual 1R rules were usually able to come within a few percentage points of the error rate that more complex models can achieve, on a spread of common benchmark domains. The present study suggests that this might be true for other performance measures, too.
Diagnosing Acute Appendicitis with Very Simple Classification Rules
8
467
Conclusions
Based on the results in Sec. 5 and the analysis in Sec. 6, the answers to the two main questions raised in Sec. 1 are: (1) No, not significantly, at least not with a set of very simple 1R classification rules as the computer model, and (2) Yes, even with a set of very simple 1R classification rules as the computer model there is a significant improvement when biochemical attributes are additionally taken into account. It hardly seems likely that the almost identical results reported in the literature and repeated in Sec. 5 based on MLR [3, 4] or complex RS models [2] are statistically significantly different from the 1R results reported in this study. Hence, based on the principle of parsimony, a collection of very simple 1R classification rules seems like a good rule-based candidate for diagnosing acute appendicitis as measured by the area under the ROC curve. Acknowledgments Thanks to Stein Hallan and Arne ˚ Asberg for sharing the appendicitis data, and to TorKristian Jenssen and Ulf Carlin. This work was supported in part by grant 74467/410 from the Norwegian Research Council.
References [1] E. Alpaydin. Combined 5x2CV F test for comparing supervised classification learning algorithms. Research Report 98-04, IDIAP, Martigny, Switzerland, May 1998. To appear in Neural Computation. [2] U. Carlin, J. Komorowski, and A. Øhrn. Rough set analysis of patients with suspected acute appendicitis. In Proc. Seventh Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’98), pages ´ 1528–1533, Paris, France, July 1998. EDK Editions M´edicales et Scientifiques. [3] S. Hallan, A. ˚ Asberg, and T.-H. Edna. Additional value of biochemical tests in suspected acute appendicitis. European Journal of Surgery, 163(7):533–538, July 1997. [4] S. Hallan, A. ˚ Asberg, and T.-H. Edna. Estimating the probability of acute appendicitis using clinical criteria of a structured record sheet: The physician against the computer. European Journal of Surgery, 163(6):427–432, June 1997. [5] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29–36, Apr. 1982. [6] J. A. Hanley and B. J. McNeil. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148:839– 843, Sept. 1983. [7] R. C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1):63–91, Apr. 1993. [8] A. Øhrn, J. Komorowski, A. Skowron, and P. Synak. Rough sets in knowledge discovery 1: Methodology and applications. volume 18 of Studies in Fuzziness and Soft Computing, chapter 19, pages 376–399. Physica-Verlag, Heidelberg, Germany, 1998.
Rule Induction in Cascade Model Based on Sum of Squares Decomposition Takashi Okada Kwansei Gakuin University, Center for Information & Media Studies Uegahara 1-1-155, Nishinomiya 662-8501 Japan [email protected]
Abstract. A cascade model is a rule induction methodology using levelwise expansion of an itemset lattice, where the explanatory power of a rule set and its constituent rules are quantitatively expressed. The sum of squares for a categorical variable has been decomposed to within-group and between-group sum of squares, where the latter provides a good representation of the power concept in a cascade model. Using the model, we can readily derive discrimination and characteristic rules that explain as much of the sum of squares as possible. Plural rule sets are derived from the core to the outskirts of knowledge. The sum of squares criterion can be applied in any rule induction system. The cascade model was implemented as DISCAS. Its algorithms are shown and an applied example is provided for illustration purposes.
1
Introduction
The main subject of this paper is the explanatory power of a rule. Conventional rule induction systems give accuracy and coverage [1] or support and confidence [2] to a rule, but we cannot use a single measure for the explanatory power of a rule. Furthermore, we cannot know what portion of a problem has been solved by the rule. The author proposed a cascade model [3] to solve this problem in discrimination rule induction. The model expanded the itemset lattice used in association rule mining [2, 4], where each item was expressed as a [attribute: value] pair of explanation attributes. The class distribution of instances was attached to each itemset in the lattice. A link in the lattice was employed as a rule: IF item-0 added on [item-1, item-2, …] THEN Class = class-1 where item-0 is the item added along the link, and the other items in the LHS (left hand side) are those on the upper end of the link. The latent discrimination power of the rule was defined as the product of the number of instances and the potential difference between the nodes, L-power(U Å L) = N(L) □ (potential(U) - potential(L)),
(1)
where U and L denote the upper and the lower nodes, respectively. The gini-index calculated from class distributions was employed as the potential of a node. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 468-475, 1999. © Springer-Verlag Berlin Heidelberg 1999
Rule Induction in Cascade Model Based on Sum of Squares Decomposition
469
The total power of the problem was defined as a root-node-power in (2). Then, the explanatory power of the rule was quantitatively evaluated with respect to the rootnode-power. root-node-power = N(root) · potential(root) ,
(2)
A view of cascades appeared when nodes and links are regarded as lakes and waterfalls, respectively. The problem of finding discrimination rules was transformed into the problem of selecting a set of powerful waterfalls explaining most of the rootnode-power. The method was implemented as DISCAS software and successfully applied to an analysis of House voting records. In this paper, we give more sound definitions to the power of a waterfall and to the potential of a lake based on sum of squares (SS). The next section gives a formulation of SS decomposition in the case of a categorical variable. Section 3 discusses the rule expression. Algorithms for the selection of rules are described in Sect. 4, and the results of an application to a simple illustrative dataset are shown in Sect. 5.
2
Decomposition of Sum of Squares
It is well known that the total sum of squares (TSS) of a continuous variable can be g decomposed to the sum of within-group sums of squares (WSS ) and between-group g sums of squares (BSS ) by the following equation [5]. If we derive the same equation g for a categorical variable, BSS will be suitable as the power definition of a link to group g.
(
TSS = ∑ WSS g + BSS g
)
,
(3)
g
Following the description in [6], Gini showed that the SS definition for x could be transformed to (4), and that the distance definition of a categorical variable by (5) led to the expression of SS by (6),
SS =
1 (xa − xb )2 , ∑∑ 2n a b
=1 xa − xb =0
if xa ≠ xb , if xa = xb ,
n 2 SS = 1 − ∑ p(α ) , 2 α
(4)
(5)
(6)
where n shows the number of instances, a, b each designate an instance, and p(α) is the probability of class α. The formula in parentheses in (6) is the gini-index. This g definition of SS leads to the expressions of TSS and WSS in (7) and (8),
TSS =
n 1 − ∑ p(α ) 2 , 2 α
(7)
470
T. Okada
WSS g =
ng 2 1 − ∑ p g (α ) . 2 α
(8)
g
If we give the definition (9) to BSS , the decomposition of TSS in (3) is shown to g be valid [7]. Thus, we can employ BSS as the latent power of a link in cascades. g g Furthermore, the sample variance of a group, the quotient of WSS by n , can be considered as the potential of a node. 2 ng (9) BSS g = p g (α ) − p(α ) , ∑ 2 α An example of SS decomposition is shown in Fig.1, where TSS: 160 is decomposed to BSS and WSS for the groups 1 and 2. The class distribution of the lower right group is opposite to that of the upper node, and hence the link between these nodes obtains a g large BSS value, as expected.
(
)
800/200 TSS: 160 BSS(2): 72
BSS(1): 18 760/40 WSS(1): 38
40/160 WSS(2): 32
Fig. 1. Example of SS decomposition. The top line at each node shows numbers of instances for the two classes
node-0 (8 / 11 / 2) WSS: 6.0
0.121 1.0
0.792 1.215 node-2 (3 / 0 / 0) WSS: 0.0
node-1 (4 / 7 / 0) WSS: 2.545 0.456 node-3 (1 / 7 / 0) WSS: 0.875
Fig. 2. Example of SS conservation. Numbers of instances for 3 classes are shown at each node. Values attached to links denote BSS values
Another interesting example of SS decomposition is shown in Fig. 2. A part of SS at node-0 is decomposed to node-1, of which WSS is again divided to nodes-2, 3. Another possible decomposition is shown by the dotted lines, where the same SS is directly decomposed into nodes-2, 3. In both cases, summation of the relevant BSSs gives the same value: 1.792. However, addition of the BSSs between nodes -1, 2 and between nodes -0, 3 gives 2.007. This result indicates that conservation of SS is guaranteed only when BSSs are taken from the links in a tree.
Rule Induction in Cascade Model Based on Sum of Squares Decomposition
471
SS decomposition can be applied recursively. Then, the SS at the root node of a decision tree is decomposed into the BSSs of all the links and into the WSSs of all the terminal nodes. Hence, the SS at the root node may be regarded as the total power of the problem to be solved.
3
Rule Expression
Let us suppose a rule link between nodes U and L as shown in Fig. 3. The itemset on U consists of items [A: a1], [B: b1], and the item [C: c1] is added along the link. The attributes X, Y, which do not appear in these itemsets, are called veiled attributes at the node. The items derived from veiled attributes are called veiled items. The distributions of these veiled items are attached at the right of each node. U:
A: a1, B: b1
[X] x1:50, x2:50 [Y] y1:50, y2:50
BSS(X): 8.0 BSS(Y): 10.6 L:
A: a1, B: b1, C: c1
IF [C: c1] added on [A: a1, B: b1] THEN [X: x1, Y: y2]
[X] x1:45, x2: 5 [Y] y1: 2, y2:48
Fig. 3. Link, distributions and BSSs of veiled items, and the resulting rule expression
The previous section introduced the SS decomposition for the class attribute, which is a veiled attribute. However, there are no obstacles to applying the scheme to other veiled attributes. In fact, we can compute the BSS of X and Y, as shown in Fig. 3. In this example, the percentage of veiled items [X: x1] and [Y: y2] shows a sharp increase along this link, giving high BSS values to these attributes. We can say that the interactions between the added item [C: c1] and the veiled items [X: x1], [Y: y2] are large in the extension of node U. This recognition of strong interaction leads to the rule expression in the right textbox in Fig. 3. The rule expression is the same as an association rule, if we merge the items in the LHS of the rule. However, an association rule miner generates a rule by a comparison of two itemsets: [A: a1, B: b1, C: c1] and [A: a1, B: b1, C: c1, X: x1, Y: y2], whereas the current method compares two itemsets, [A: a1, B: b1] and [A: a1, B: b1, C: c1] to describe the LHS, and the items in the RHS are derived from the sample distribution of veiled items along the rule link. Selection of items in the RHS consists of two steps. First, we choose veiled attributes with BSS values that exceed a given parameter. Then, we select a value, α, of the attribute that corresponds to the maximum term in the summation of (9). If there are plural values with the same contribution, the value with the largest pg(α) is selected. All [attribute: value] pairs thus derived are written as items in the RHS.
472
T. Okada
The selection of rules based on BSS values allows recognition of negative interactions among items, that is, the percentage of instances may decrease along a link, which it is also important to know.
4
Extraction of Rules
We have developed DISCAS (version 1.3) software that can detect interactions using the BSS of a class variable and of explanatory variables. The resulting rules are called discrimination rules and characteristic rules, respectively. An efficient lattice generation algorithm appears separately [8]. Here we show two key concepts in the extraction of rules. Selection of Candidate Rule Links by Voting. We associate an instance to a link, if the item added along the link is contained in the instance. An instance selects the l links with the highest BSS values among the associated links. The selected links receive a vote from this instance. Links with at least m votes are chosen as candidate rule links. Default values of l and m were set to 3 and 1. As for a BSS value, we use that of the class attribute when we generate discrimination rules. The sum of the BSS values for all veiled explanation attributes was used to generate characteristic rules. Plural Rule Sets and Rules Selection. Rules are expressed as a group of rule sets. The rules in a rule set are selected so that their supporting instances cover all instances. We have employed a greedy algorithm to extract rules from candidate links. That is, the link with the highest BSS value is selected as the first rule in the first rule set. Its supporting instances are deleted from the support of the remaining candidate links. This process is successively applied to the candidates, until there are no candidates with supporting instances. The rules selected constitute the first rule set. After a rule set is created, we recover the initial supporting instances for the remaining candidate links, and creation of the next rule set proceeds as for the first. The rule induction process is complete when all the candidates are expressed as rules.
5
Application to Cars Dataset
DISCAS has been applied to a simple cars dataset to illustrate the capability of the cascade model. This data set consists of only 21 instances and 10 categorical attributes [9]. The problem is to predict the mileage of cars from the other 9 attributes. The entire lattice was created except the itemsets containing mileage attributes. Default values are used for the selection of rule links. The first rule sets for discrimination and characteristic rules are shown in Tables 1 and 2, respectively. Discrimination Rules. The eight rules in Table 1 are ordered according to their BSS values. For example, we should read the fifth rule as follows,
Rule Induction in Cascade Model Based on Sum of Squares Decomposition
473
IF [cyl: 6] added on [fuelsys: EFI, trans: manual] THEN [mileage: high]. Vowels are omitted from attribute names in the table. The number of supporting instances and the confidence levels are 11 (54.5%) and 4 (0%) on the upper and the lower nodes, respectively. The three BSS values have the following meanings. The LBSS (latent BSS) column shows the BSS values defined in Sect. 2. Of the supporting instances for the fifth rule, two have already been used for the third and the fourth rules, and hence the number of effective instances remaining is 2. The E-BSS (effective BSS) column shows the effective part of the L-BSS. The R-BSS (BSS from root node) column denotes the BSS of a link from the root node to the virtual node consisting of these effective instances. We can sum R-BSS in the rule set, and compare the value with SS at the root node. The RHS of this rule should be read as [mileage: not high], because its confidence level decreases to 0.0%. Table 1. Discrimination rules from cars example (see text for explanation of terms used) Item added Items on upper node Sz: subcompact Null Wght: heavy Null cyl: 6 flsys: EFI; cmp: high sz: compact cyl: 4; trb: no cyl: 6 flsys: EFI; trns: manual flsys: EFI sz: compact; cyl: 4; cmp: high trns: auto flsys: EFI cmp: medium cyl: 4; flsys: EFI
No 1 2 3 4 5 6 7 8
RHS
confidence (%)
support
mlg: high
38.1 : 100
mlg: low
L
BSS E
R
21 : 6
2.00
2.00
2.00
9.5 : 100
21 : 2
1.24
1.24
1.24
mlg: medium
28.6 : 100
7:2
1.02
1.02
0.38
mlg: medium
45.5 : 83.3
11 : 6
0.86
0.86
0.45
mlg: high
54.5 : 0.0
11 : 4
0.94
0.47
0.38
mlg: high
40.0 : 100
5:2
0.72
0.36
0.33
mlg: medium
50.0 : 100
14 : 3
0.66
0.22
0.19
mlg: medium
25.0 : 66.7
8:3
0.52
0.17
0.19
Table 2. Characteristic rules from cars example (see text for explanation of terms used) No 1 2 3
item added items on upper node dsplc: small null pwr: high trb: no Dsplc: medium Null
cyl: 4 cmp: high cyl: 6 dsplc: medium
confidence (%) 66.7 : 100 52.4 : 88.9 35.3 : 80.0 52.9 : 100
-
-
RHS
support 21 : 9 17 : 5 21 : 12
L 1.00 1.20 1.00 1.11
BSS R 1.00 1.20 1.09 0.92
7.27
-
-
2.57
sum
5.14
474
T. Okada
The SS of the mileage at the root node is 6.0, while the sum of R-BSS and E-BSS of the rules in the rule set are 5.17 and 6.34, respectively. The difference (0.83) between the sum of R-BSS and SS at the root node is equal to the WSS of the lower node of rule-4. The top five rules in this rule set cover 18 out of all 21 instances, and their R-BSS explains 74% of the SS at the root node. We do not need to worry about numerous rules, because we can stop rule inspection when the sum of the BSS values reaches close to the SS at the root node. If our interest in a discrimination mechanism is from a different viewpoint, we can browse rules with high latent BSS values in the succeeding rule sets. Characteristic Rules. The first rule set among the explanatory attributes consists of the three rules in Table 2. The RHS column shows the veiled items where L-BSS exceeds 1.0. The column at the right end shows the sum of the R-BSS values of all items except the class attribute: mileage. In rule-3, there are no items with high LBSS. The sum of all R-BSS in this rule set is 15.0. We can say that 36% of the SS at the root node (41.81) is covered by these 3 links. This suggests a moderate correlation among the 9 explanatory attributes. On the other hand, the sum of R-BSS values for the class variable: mileage explains only 20% of its SS at the root node in these 3 rules. If our object is the segmentation of cars, these 3 links will lead to good starting clusters, but they are not necessarily good for mileage prediction.
6 Related Works In statistics, considerable effort has been directed to detecting interactions among variables [10] and to generating decision trees [1]. Many clustering methods use the sum of squares criterion for continuous variables [5]. However, there has been no attempt to use the sum of squares definition of categorical variables as a criterion in tree formation or in clustering. Levelwise construction of lattices is now an accepted method [2, 4], and there have been attempts to extract classification rules using association rule miners [11, 12, 13]. Many measures to select rules have already been discussed [14]. Among them, our method has common conceptual points with maximal guess [15] and exceptional knowledge discovery [16]. Detection of interactions using the χ2 statistic is especially relevant to the current work [17]. However, no research to date has recognized the importance of SS.
7 Concluding Remarks Decomposition of SS has been proposed to give a criterion of rule power in the cascade model. The DISCAS system has been applied to the induction of discrimination and characteristic rules, giving satisfactory results. The method is
Rule Induction in Cascade Model Based on Sum of Squares Decomposition
475
expected to be a powerful tool for data analysis. Further, we have shown that the RHS of a rule can be found from the distribution of veiled items at the LHS nodes. This finding has led to an efficient pruning methodology [8]. Algorithms employed are subject to change, but the SS criterion can easily be adapted to other systems and will provide a useful measure. SS is one of the core concepts in statistics, so bridging between statistics and rule induction is expected. Acknowledgements The author would like to thank Prof. Sugihara, Prof. Oyama and Dr. Jimichi of Kwansei Gakuin University for their encouragement and their valuable discussions about statistics.
References 1. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont CA (1984) 2. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. Proc. ACM SIGMOD (1993) 207-216 3. Okada, T.: Finding Discrimination Rules Based on Cascade Model. submitted to J. Jpn. Soc. Artificial Intelligence (1999) 4. Agarawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. Proc. VLDB (1994) 487-499 5. Takeuchi, K. et al. (eds.): Encyclopedia of Statistics (in Japanese). Toyo Keizai Shinpou (1990) 388-390 6. Light, R.J., Margolin, B.H.: An Analysis of Variance for Categorical Data. J. Amer. Stat. Assoc. 66 (1971) 534-544 7. Okada, T.: Sum of Squares Decomposition for Categorical Data. to appear in Kwansei Gakuin Studies in Computer Science 14 (1999). http://www.media.kwansei.ac.jp/home/ kiyou/kiyou99/kiyou99-e.html 8. Okada, T.: Efficient Detection of Local Interactions in Cascade Model. submitted to Discovery Science 99 9. Ziarko, W.: The Discovery, Analysis, and Representation of Data Dependencies in Databases. In: Piatetsky-Shapiro, G., Frawley, W. (eds.): Knowledge Discovery in Databases. AAAI Press (1991) 195-209 10. Hawkins, D.M., Kass, G.V.: Automatic Interaction Detection. In: Hawkins D.M. (ed.): Topics in Applied Multivariate Analysis. Cambridge Univ. Press (1982) 269-302 11. Ali, K., Manganaris, S., Srikant, R.: Partial Classification using Association Rules. Proc. KDD-97 (1997) 115-118 12. Bayardo, R.J.: Brute-Force Mining of High-Confidence Classification Rules. Proc. KDD97 (1997) 123-126 13. Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. Proc. KDD-98 (1998) 80-86 14. Silberschatz, A. and Tuzhilin, A.: What Makes Patterns Interesting in Knowledge Discovery Systems. IEEE Trans. on Know. and Data Eng. 8(6) (1996) 970-974 15. Washio, T., Matsuura, H., Motoda, H.: Mining Association Rules for Estimation and Prediction. In: Wu, X., Kotagiri, R., Korb, K.B. (eds.) Research and Development in Knowledge Discovery and Data Mining. Lecture Notes in A.I. Vol. 1394. Springer-Verlag (1998) 417-419 16. Suzuki, E., Shimura, M.: Exceptional Knowledge Discovery in Databases based on Information Theory. Proc. KDD-96 (1996) 275-278 17. Silverstein, C., Brin, S., Motwani, R.: Beyond Market Baskets: Generalizing Association Rules to Dependence Rules. Data Mining and Knowledge Discovery, 2 (1998) 39-68
Maintenance of Discovered Knowledge 0LFKDO3 FKRXþHN2OJDâW SiQNRYiDQG3HWU0LNãRYVNê {pechouc,step,miksovsp}@labe.felk.cvut.cz The Gerstner Laboratory for Decision Making and Control Department of Cybernetics, Faculty of Electrical Engineering Czech Technical University Technická 2, CZ 166 27, Prague 6
Abstract. The paper addresses the well-known bottleneck of knowledge based system design and implementation – the issue of knowledge maintenance and knowledge evolution throughout lifecycle of the system. Different machine learning methodologies can support necessary knowledge-base revision. This process has to be studied along two independent dimensions. The first one is concerned with complexity of the revision process itself, while the second one evaluates the quality of decision-making corresponding to the revised knowledge base. The presented case study is an attempt to analyse the relevant questions for a specific problem of industrial configuration of TV transmitters. Inductive Logic Programming (ILP) and Explanation Based Generalisation (EBG) within the Decision Planning (DP) knowledge representation methodology, have been studied, compared, and tested on this example.
1
Introduction
Configuration, as a specific kind of decision making, is defined [9] as a problem of assembling basic elements of the considered system in such a way that internal logic constraints are not violated. We address the issue of gradual evolution of configuration meta-knowledge that forms a knowledge base of a knowledge-based system developed for a Czech TV producer. The considered production undergoes some changes during its life cycle, e.g. due to modification of the product range. This transition has to be reflected in the knowledge base of the system, of course. Some machine learning techniques can support corresponding evolution/revision of the knowledge base. We have applied Inductive Logic Programming (ILP) and Explanation Based Generalisation (EBG) complemented by Decision Planning (DP) knowledge representation methodology. Our experiments and analysis show that both methodologies exhibit different behaviour with respect to two angles introduced for characterisation of the revision processes. This behaviour should be taken into account when choosing the proper tool for support of knowledge-base evolution. Inductive logic programming (ILP) enriched the repository of available machine learning methods in the 90ties. Its goal is to induce knowledge in the form of a general J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 476-483, 1999. © Springer-Verlag Berlin Heidelberg 1999
Maintenance of Discovered Knowledge
477
logic program using in its body predicates from the background knowledge defined in advance. This is an obvious difference when compared to knowledge represented by a decision tree or a decision list, for example. In ILP, the language of the first order logic is used (1) to describe the training examples, (2) to express the induced knowledge, and (3) to formulate background knowledge, which is a natural part of the problem definition. Though ILP methods work with a significantly richer language than the attribute valued propositional language of the classical ML methods, both approaches share intuition and some basic paradigms. Number of different ILP systems have been developed and implemented recently [2]. Our experiments have been conducted in the systems FOIL [7] and KEPLER [10]. Decision Planning is a declarative knowledge representation methodology based on proof planning, a well-known theorem proving technique. This technique has been successfully used for formalisation of the process of industrial configuration (see [5, 6]). The meta-level inference knowledge is captured here by means of an oriented graph called decision graph. The decision graph describes in meta-terms a decision space, a space of all legal configurations. We distinguish between the abstract level of the decision graph, where only the abstract inference knowledge (strategic knowledge) is specified and the specific level with all the context-related pieces of knowledge incorporated. A knowledge engineer specifies the strategic knowledge within the abstract level of the decision plan and an EBG (Explanation Based Generalisation) [3, 4] machine learning algorithm deduces the specific level of the decision plan from the object-level knowledge. The basic EBG procedure starts from an attempt to prove that a concept example is positive within a complete theory, the system is aware of. The constructed proof (explanation) is then generalised and appended to the theory so that a new, better definition of the concept is learned. We have thoroughly tested both methodologies for creating and maintaining the inference meta-knowledge to be used in the process of configuration [8]. Metaknowledge is supposed to embody original domain knowledge from the field. To generate this meta-knowledge, we have analysed a set of all positive examples of legal variations with the aim to find knowledge that will cover exactly the given set of legal variations. This set remains fixed for a certain period. Machine learning techniques are used to elicit the relevant knowledge. The learned knowledge has to be sound and complete with respect to the present set of all legal variants representing a noise-free training set. From time to time, the set of all legal variants is extended, while the task domain remains fixed. This is a signal for the knowledge revision/update - a central point of this case study.
2
Meta-knowledge Lifecycle
Suppose that the original theory T covers a set of all positive examples E. Let the set of positive examples be extended by a new set S such that E ¬ S=«. The theory T has to be updated/revised in such a way that the resulting theory T1 covers E S. To do so two alternative approaches can be taken:
478
03 FKRXþHN2âW SiQNRYiDQG30LNãRYVNê
STRONG UPDATE re-computes the entire inference knowledge base considering a new updated field theory (i.e. the current field theory enriched by the recent change) as a knowledge induction parameter. Strong update is the result of application of the knowledge extraction process on the full set E S. WEAK UPDATE re-computes just the relevant parts of the inference knowledge base in order to ensure that the result stands for a sound and complete representation for the updated field theory. Most often it is much simpler to generate a weak update then the strong one. On the other hand, it is not clear what efficiency will exhibit the system using the changed (weakly updated) knowledge base. Weak update patches the local requirements of the decision space but it does not consider its overall shape. That is why weak update is computationally less expensive, but the resulting space of configuration-like metalevel knowledge can be expected to be quite messy. Consequently, there can appear inefficient consultations in such a space often. When deciding between strong and weak update the efficiency of the resulting system has to be considered simultaneously with requirements for creation of the updated system. This last criterion should meet the demands of the dynamics of the application domain; i.e. estimated frequency of updating and its effect on the quality of the decision space described by inference knowledge. When the revision is not required very often and strong update is not extremely time consuming then the strong update is generally recommended. When the field theory changes instantly one has to bear in mind that we might be lacking resources or time for the strong update. In order to illustrate this we may view the problem from purely AI point of view. Let us consider a simplified meta-knowledge decision space in a form of a decision tree, where each product configuration is a branch from a root of the tree to a certain leaf node (goal). Here, a weak update means appending a simple branch to the root of the tree. By doing so the numbers of nodes considered as well as the branching factor of the new state space increases. In this way the knowledge hidden within the original decision tree loses its significance. Using a tree with an increased branching factor or with more decision nodes results in higher memory requirements, moreover it makes the process of state space search slower and more difficult. That is why we try to update the decision space in such a way that (1) needed time and (2) the increase of the state space size and of its branching factor are as little as possible. Neither pure weak update, nor pure strong update guarantees optimal knowledge base maintenance. Rather than implementing a ’middling’ update we seek for a compromise by playing around with frequencies of both weak and strong updates. If the weak update does not result in significant decrease of efficiency, the strong update can be done only occasionally (after several steps of weak updates). Need for strong updates and their frequency has to be determined with respect to the conditions of the application domain. We will compare experimental results concerning revision processes in DP environment with the theoretical estimates computed for Prolog programs resulting from the ILP analysis. These experiments try to answer the following questions: "Which knowledge is easier to maintain? Is it knowledge, the system formulated automatically (this is what ILP does) or knowledge elicited and expressed by a knowledge expert (the case of DP)? What is the efficiency of the system after a sequence of weak updates? "
Maintenance of Discovered Knowledge
2.1
479
Inductive Logic Programming Approach
ILP is applied in this context to induce inference meta-knowledge from object-level data. Our ILP analysis starts from pure data without any expert inference knowledge. The set of positive examples represents all legal variants (valid combinations) of input transmitter parameters to be covered by induced rules representing the metaknowledge. The negative examples are created using closed world assumption. The induced set of rules should provide significant compression of positive examples; i.e. it should be as small as possible. The rules represent the inference knowledge needed during the transmitter configuration consultation. Their utilisation is straightforward – any question concerning legality of a certain variant of a product is answered using a standard Prolog query. Moreover, these rules are further analysed to optimise number of questions that have to be asked during the process of configuration. The induced Prolog program is used whenever the question about legality of a specific variant arises. We know the exact number of different relevant queries, which can appear. Let us denote this number N – it is the size of the considered task domain. This number is constant until a new possible value for one of the attributes is added (its domain is extended). Let us concentrate on the case when all attributes have fixed domains. Then N is fixed and any member of the considered space can become an issue. That is why we are interested in average case complexity of the program measured in terms of the average number of attempted unifications. Let us try to predict the change of this complexity during the process of gradual weak updates when each new positive example is added as an exception in front of the actual program. Let us start with the program P induced from positive examples E+ and negative examples E- (both sets E+ and E- together have just N elements). The program P works with average complexity p, it has d rules and its structure is not recursive. At the first step, one single example f is shifted from the set E- into the set E+. What will be the complexity of the program P(1) that results from the weak update of P after this single step? Let us denote this complexity by C (p,1). The program P(1) will generate the answer for most members of the task domain attempting just one more unification then the original program P. The only exception will be the single example f - this will be answered in a single unification. Thus we can estimate the upper limit for the considered update as follows & S 1 ≤
1S − G + 1 + 1 − 1 = S +1− G 1 1
(1)
By induction we can prove that after a sequence of k gradual weak updates (each removing a single example from E- ) the complexity of the resulting program P(k) is C ( p, k ) ≤
(1 + 2 + ... + k + Np − kd + ( N − k )k ) = p + k − (k 2 − k + 2kd ) 2 N N
(2)
This bound is strongly influenced by the compression coefficient d/N, which is in our case much smaller than 1/5. For any k < N/2 in this range holds, that the degradation of performance of P(k) is within the interval < k/2, k > and the increase of average case complexity after a sequence of k weak revisions is linear in k.
480
03 FKRXþHN2âW SiQNRYiDQG30LNãRYVNê
Assertion 1: Degradation of performance after k weak updates is acceptable provided k is a fraction of average complexity p of the original program. The induced rules are not used to compress the set of valid variants of transmitters, only. A specific algorithm was implemented to present valid process of configuration consultation using the induced rules. For implementation reasons the task has been “inverted” and the positive and negative examples were exchanged so that no restrictions are placed on the configuration parameters at first. During consultation the state space of possible variants is cut wrt disabled combinations of parameters and engaged values. The length of configuration depends on the process of consultation as the algorithm skips all decision nodes offering a single option, only. Experiments verified that time complexity of the algorithm is linear wrt the number of rules used within the consultation. This is why the minimal number of rules is desired. Applying strong or weak update strategies carries out the maintenance of inference meta-knowledge. Whereas, the weak update strategy can be applied as a fast strategy the time complexity of the corresponding configuration consultation algorithm grows inevitably. Though this degradation can be characterised as graceful, see equations (1) and (2), it can eventually exceed acceptable limits. Moreover, the weak update strategy is applicable only if we need to add a new legal variant within the fixed task domain. In the opposite case the strong update is inevitable. The strong update is performed by the ILP strong update analysis, which induces new rules from the entire data set. This strategy minimises the inference ILP exception rules number of rules to be used within consultation but it is more time consuming. The strong weak update update can significantly improve efficiency of the considered system. This is best visible on Figure 1 – ILP knowledge evolution an extreme case of strong update applied to lifecycle 2658 examples and resulting in 42 rules [8]. Since the time complexity is proportional to the size of the program we can claim that complexity has been decreased and efficiency improved about 63 times due to strong update. We have failed to produce a rigorous time requirement analysis for the strong update (it took about 2 minutes). According to our observations, time necessary for strong update depends on the internal similarity of examples. It took considerably more time to analyse a small set of utterly unrelated examples than a bigger set of similar clusters of examples. The point we are trying to make here is that not only the frequency of strong versus weak updates is what matters. It is highly recommended to structure the incoming examples in clusters with consideration of some internal logic. This is the case of real data as the requirement for strong update arises usually when a new product with a number of variations is introduced.
2.2
Decision Planning Solution
Decision planning takes constant number of steps for any consultation within the fixed task domain. This is due to the topology of the decision plan (the abstract decision
Maintenance of Discovered Knowledge
481
plan) which remains fixed in such a case. No matter how many examples are covered, the user has to be taken through an identical decision path, i.e. through the same number of decision nodes. Requested time does not in practice depend on the cardinality of the set of field theory examples, which represent the set of all positive examples. Checking preconditions of a decision node can become more time consuming. But a well-designed abstract decision plan minimises the extent to which the time requirements grow with increasing number of covered examples. The nature of decision planning knowledge representation methodology is such that it does not offer any means for a strong update. System knowledge base can be either directly supplied with inference knowledge specified by user or step-by-step induced from the field theory using Explanation Based Generalisation (EBG) technique which cares for weak updates. The lifecycle is then a sequence of weak updates. To ensure functionality of a strong update (Fig. 2), there have been implemented some processes of decision space filtering, aimed at reducing the number of decision nodes and the branching factor. We distinguish between following two filters: CONJUNCTIVE FILTER clusters decision nodes with the same effect; they get unified through conjunction of corresponding preconditions. FRAME FILTER checks relevancy of each of preconditions in order to avoid the frame problem ([1]). The algorithm has to be provided with the entire possible range of values each attribute can take. Possible clusters of decision nodes having the same effect and precondition describing entire discourse are ’strong’ update eliminated. Time needed to learn a single example depends on the structure of the actual decision Filters EBG graph decision space (compared to the example to be accepted) and on the amount of the nodes weak update to be parsed. The first objective makes an example to be learnt more difficult in the Figure 2 – Decision planning knowledge beginning of the learning process and the evolution lifecycle. other objective makes it more time consuming with increasing number of accepted examples. Apart from the first 30 examples the time needed for learning is linearly proportional to the size of the decision space for our data set. When filtering the decision space the nodes have to be compared one to another (to put it simpler), thus the complexity is almost proportional to the square of number of nodes. The space of decision nodes does not grow with increasing number of positive examples proportionally. It saturates at certain point and consequently filtering takes just certain time (seconds) but does not grow up Figure 3 – Positive accepted examples/
number of nodes in the decision space
482
03 FKRXþHN2âW SiQNRYiDQG30LNãRYVNê
to infinity. Experiments showed that the best practice in maintaining the knowledge base consistent with respect to frequency of updates corresponds to the rhythm in which object-level data come. As each new piece of product has got usually about 30 new variants (4 – 6 new attributes) we were filtering the knowledge base after each 32 arrivals of new object-level data. This mechanism provides the knowledge base with filtered data ready for consultation whereas it minimises the time needed for knowledge induction. The solid line in the Fig. 3 gives the size of the decision space after several tens of weak update iterations (in terms of number of decision nodes). The effect of proposed filtering can be seen on the dashed line in the Fig. 3.
3
Conclusion and Future Work
The issue of knowledge maintenance is of immense importance throughout entire lifecycle of a knowledge-based system. Since general methodology for updating knowledge is missing, we have compared two machine learning methodologies for addressing this issue – Inductive Logic Programming (ILP) and Explanation Based Generalisation (EBG) within the framework of Decision Planning (DP). Both approaches differ in number of aspects. The ILP approach generates a logic program, which can be directly used by the reasoning mechanism that mimics the expert’s configuration process using standard methods of logic programming. In the case of DP the configuration mechanism is more elaborate. Graphs of interrelated decision nodes (decision graphs) on various levels of specificity are parsed, each representing either an action or another lower level graph. Originally, we have suggested the ILP approach as an alternative for intelligent maintenance of knowledge base through a strong update, which corresponds to ILP analysis of all positive examples. Later, the weak update proved to be simple to achieve due to the used knowledge representation: the new positive example is added as an exception in front of the system's base of rules. This takes just one single step and the system using the rule base after a sequence of weak update steps exhibits acceptable decrease of efficiency (linear function with respect to number of consecutive weak updates). Moreover, we have found a simple method (Assertion 1) how to estimate frequency of strong updates in dependence on the requested efficiency of our target KB system. On the contrary, the central point of the decision planning approach is the weak update achieved by EBG. EBG updates the decision graph so that a new example is accepted. As this example is incorporated within already existing decision space structure, the extent to which its complexity increases is minimised. The time needed for the weak update here is considerably bigger than in the case of ILP approach. Decision planning implements the strong update by means of filters that reorganise the decision graphs in order to maintain efficiency of knowledge representation. As decision planning requires strategic inference knowledge to be acquired from the user in order to formalise initial decision graph; its utilisation is envisaged mainly in areas where human expertise is available. The DP methodology just ensures the decision graph to be consistent with the evolving task domains. Our experience
Maintenance of Discovered Knowledge
483
verified that ILP is good at digging inference knowledge from the domain data and it is well suited even for areas with no or little of pre-specified human expertise. Efficiency of the resulting logic program can be estimated and used to find a proper balance between weak and strong updates. This makes ILP a robust and reliable mechanism for maintenance of domain knowledge throughout life cycle of the system. Suggested properties of revision processes should be studied more carefully since they have to be taken into account whenever choosing an appropriate knowledge representation and knowledge extraction process for a dynamic application. Even here there have to be done decisions concerning the frequency with which the discovered theory has to be recomputed using a strong update instead of applying only a weak update of discovered theory. It is important to choose such frequency of strong updates, which suites best the dynamics of the system and which takes into account time and memory demands of the strong update. Finding balance between efficiency of weakly updated knowledge base and the result of time-consuming strong update belongs to the key problems within this area and it deserves development of rigorous methodology. Our case study shows that it is worth of considering ILP tools in this context. One of advantages they offer is that some useful upper estimates can be derived for the efficiency of the resulting programs. Acknowledgements.
The research has been partially supported by the Esprit Project 20237 ILP2 and by INCO 977102 ILPNET2. References 1. Davis E.: Representation of Common-sense Knowledge, Morgan Kaufmanns Publishers, Inc. California, 1990 2. Lavraè, N. et al.: ILPNET repositories on WWW: Inductive Logic Programming Systems, Datasets and Bibliography, AI Communications, Vol.9, No.4, 1996, pp. 157-206 3. Kodratoff, Y.: Machine Learning, In.: Hojat Adelli (ed.) Knowledge Engineering, McGraw-Hill Publishing Company, USA, 1990 4. Michalski R., Kaufmann K.: Data Mining and Knowledge Discovery: A Review of Issues and a Multistrategy Approach, In. Machine Learning and Data Mining: Methods and Applications (Michalski, Bratko, and Kubat Eds.) John Wiley & Sons Ltd, 1997 5. Pìchouèek, M.: Advanced Planning Techniques for Project Oriented Production, Ph.D. Dissertation, Prague, 1998 6. Pìchouèek M.: Meta-level structures for industrial decision making, In: Proceedings of the Second International Conference of Practical Application of Knowledge Management, London, April, 1999 7. Quinlan, J. R., Cameron-Jones, R. M.: Introduction of logic programs: FOIL and related systems, New Generation Computing, Special Issue on ILP, 13 (3-4), 287-312, 1995 8. Štìpánková O., Pìchouèek M., Mikšovský P.: ILP for Industrial Configuration, Tech. Rep. of Gerstner Laboratory - GL-74/98 9. Tansley D.S.W. and Hayball C.C.: Knowledge-based Systems Analysis and Design, Prentice-Hall, 1993 10. Wrobel,S., Wettschereck,D., Sommer,E., Emde,W.: Extensibility in Data Mining Systems, In: Proceedings of the 2nd International Conference on KDD, Menlo Park, CA, USA, August 1996, pp. 214-219
A Divisive Initialisation Method for Clustering Algorithms Clara Pizzuti, Domenico Talia, and Giorgio Vonella ISI-CNR c/o DEIS, Univ. della Calabria 87036 Rende (CS), Italy {pizzuti,talia}@si.deis.unical.it
Abstract. A method for the initialisation step of clustering algorithms is presented. It is based on the concept of cluster as a high density region of points. The search space is modelled as a set of d-dimensional cells. A sample of points is chosen and located into the appropriate cells. Cells are iteratively split as the number of points they receive increases. The regions of the search space having a higher density of points are considered good candidates to contain the true centers of the clusters. Preliminary experimental results show the good quality of the estimated centroids with respect to the random choice of points. The accuracy of the clusters obtained by running the K-Means algorithm with the two different initialisation techniques - random starting centers chosen uniformly on the datasets and centers found by our method - is evaluated and the better outcome of the K-Means by using our initialisation method is shown.
1
Introduction
Clustering is a data analysis unsupervised technique which searches to separate data items (objects), having similar characteristics, in constituent groups. The importance of this technique has been recognized in diverse fields such as sociology, biology, statistics, artificial intelligence and information retrieval. In the last few years clustering has been identified as one of the main data mining tasks [7]. Many of the proposed definitions of what constitutes a cluster are based on the concepts of similarity and distance among the objects. A more general approach consists in considering a data item as a point in a d-dimensional space, with each of the d variables being represented by one of the axes of this space. The variable values of each data item thus define a d-dimensional coordinate in this space. Clusters can be described as continuous regions of this space with a relatively high density of points, separated from other clusters by regions with a relatively low density of points. Clusters described in this way directly reflect the process we use to detect clusters visually in two or three dimensions and lead to see clustering as a density estimation problem. Density estimation has been addressed in statistics and has been recognized as a challenging problem [14]. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 484–491, 1999. c Springer-Verlag Berlin Heidelberg 1999
A Divisive Initialisation Method for Clustering Algorithms
485
Several approaches have been defined for data clustering. Among them the partitioning methods produce a partition of a set of objects into K clusters, where the number K of groups is given a priori. The performance of these methods is strongly influenced by the choice of the initial clusters. The nearer are the initial starting points to the true centers of the clusters, the better is the quality of the clustering and the convergence of the method is accelerated. In this paper a divisive method for the initialisation step of clustering algorithms is presented. The method provides an estimation of the centroids of clusters. It is based on the concept of cluster as a high density region of points and on the intuition that subsampling the data will naturally bias the sample to representatives near the more dense regions [3,8]. The search space is modelled as a set of d-dimensional cells. A sample of points is chosen and located into the appropriate cells. Cells are iteratively split as the number of points assigned to them increases. In such a way the attention is focused on those regions of the search space having a higher density of points and, thus, candidates to contain the true centers of the clusters. Preliminary experimental results show the good quality of the estimated centroids with respect to random choice of points. Furthermore, the accuracy of the clusters obtained by running the K-Means algorithm with the two different initialisations - random starting centers chosen uniformly on the datasets and centers found by our method - is evaluated. The results point out the better performances of K-Means when estimated centroids are provided by our method. The paper is organized as follows: section 2 gives a brief survey of clustering methods; section 3 presents our divisive method to estimate centroids; section 4 shows the effectiveness of the proposed approach on a set of synthetic data.
2
Background and Related Work
Several algorithms for clustering data have been proposed [1,5,11,13,15]. In [10] an extensive overview of clustering algorithms is presented. One of the approaches to clustering is the partitional approach. A partitional clustering algorithm produces a partition of a set of objects into K clusters by optimizing a criterion function. Most of these approaches assume that the number K of groups has been given a priori. They can be formulated as three main steps: 1) initialisation of clusters, 2)allocation of the objects to clusters, 3) reallocation of objects to other clusters after the initial grouping process has been completed. The differences between the methods mainly lie in (1) and (3). The techniques for initialisation of clusters find K points in the d-dimensional space and use them as initial estimates of cluster centers. The choice of these points is an important aspect of this approach since it strongly biases the performances of the methods [10]. Among the partitioning methods, the direct K-Means clustering [12,1] is a well-known and effective method for many practical applications. In the direct K-Means clustering the n data points must be partitioned into C1 , . . . , Ck clusters, where each cluster Cj contains nj data points and each data point belongs to only one cluster. The criterion function adopted in this tecnhique is the
486
C. Pizzuti, D. Talia, and G. Vonella
squared error criterion. The centroid of the cluster Cj is defined as the mean of the objects it contains. The square-error e2j of each partition Cj is the sum of the squared Euclidean distances between each data point in Cj and its centroid. The square-error of the partition is the sum of the square errors of all partitions K E = j=1 e2j . The smaller the error, the better is the quality of the partitioning. The direct K-Means algorithm starts by randomly choosing the initial centroids and then by assigning each data point to the cluster Cj with the nearest centroid. The new centroids and the square-error E are computed and the method stops either when E does not change significantly or no movement of points from one cluster to another occurs. A good choice of the initial centroids, sufficiently near to the actual centroids, can drastically reduce the number of steps needed by the K-Means to converge and can avoid the problem of empty clusters due to the choice of starting points far from the centroids. In literature there are few proposals on initialisation methods [6,2,3,8]. The most recent is the one proposed by Bradley and Fayyad in [3] where they present an algorithm for refining an initial starting condition. The method chooses a given number of subsamples from the dataset and groups them by using the KMeans. Each subsample gives a solution CMi . The set of such solutions is then clustered via the K-Means initialised with CMi giving a solution F Mi . Finally, the F Mi having the minimal squared error is chosen as the refined initial point. In the next section we present an initialisation method which, analogously to [3] works on a sample of data, but exploits the concept of density to focus attention on those parts of the search space that contain a high frequency count of the sampled points. These parts are iteratively divided into smaller subparts and assumed as possible candidates to contain the true centers.
3
The Method
We now present a general method for providing a good estimation of the centroids of clusters for clustering algorithms. The method exploits the concept of density in the search space, by focusing the attention on those regions having a high density in the data points. Given the dataset D of n data points Xi , in d dimensions, a cell ID ⊂ Rd containing all the points of D is defined in the following way. (j)
(j)
ID = [Xmin , Xmax ] = {Xi ∈ D | Xmin ≤ Xi (j)
(j)
(j)
(j) ≤ Xmax , 1 ≤ j ≤ d} (j)
where Xmin =min{Xi , 1 ≤ i ≤ n} and Xmax =max{Xi , 1 ≤ i ≤ n}. The search space can thus be represented by ID and the problem of finding K centroids can be formulated as the problem of finding K subcells of ID having the highest density of points. The Divisive Initialisation algorithm, shown in Figure 1, initially partitions the cell ID in 2d subcells by dividing each side of ID into two equal parts. It takes as input data points D, number J of different subsamplings to do, frequency F of times that the cells must be considered for
A Divisive Initialisation Method for Clustering Algorithms
487
Fig. 1. Divisive method and Centroids found for each subsample.
splitting and number K of clusters. It randomly chooses J subsamples of data points D, Si , i = 1, . . . , J and, for each of them, finds Ci centroids by calling the function f indCentroids, shown in Figure 1. The sets Ci , i = 1, . . . , J are then considered for possible fusion of points that could be representatives of the same cluster. The function f indCentroids at the beginning assigns to each cell a level of splitting equal to zero. This level is incremented by one, every F point read, only for the subcells generated by splitting the cell that received the highest number of points. A graphical view of the iterative splitting of the dataset DS2 (described in the next section) with one subsample is shown in Figure 2. It is worth noting that regions not containing points from the subsample are completely ignored during the division process. In fact, since no point from the subsample has been located in these regions, they will never be considered for the division process. The points belonging to Si are located in the appropriate subcell and when all the sample of points has been considered, the cells having a splitting level equal to maxlevel or maxlevel -1 are selected and the mean points of the points they contain are computed. These mean points are considered as possible representatives of clusters.
488
C. Pizzuti, D. Talia, and G. Vonella
Fig. 2. Iterative splitting of the cells for DS2.
4
Experimental Results
The goal of the experiments was to evaluate the quality of the centroids estimated by our method with respect to the actual centers of two synthetic datasets, DS1 and DS2, described in [15]. Both datasets consists of K clusters of 2-dimensional √ √ data points. DS1 has the grid pattern and its centers are placed on a K × K grid. DS2 has the sine pattern and places the cluster centers on a curve of the sine function. Both DS1 and DS2 are constituted by 100000 data points grouped in 100 clusters. The size of the random subsamples we used was 1% of the full dataset and the number of subsampling was taken to be 10. The first set of experiments evaluated the closeness of the centroids estimated by the Divisive algorithm to the true centers of the synthetic data. Figures 3 draws the actual (identified with the symbol X), estimated (with +) and random centers (with *) for DS1. As the figure show, the centroids computed by our method are very near to the true centers. Next, we run the K-Means algorithm with two different initialisations: random starting centers chosen uniformly on the datasets and centers found by the Divisive Initialisation algorithm. Figure 4 visualizes the clusters obtained at the first step with random start and with estimated centers for the dataset DS1 and the same clusters after 10 steps. Comparing these results with the actual clusters, we can observe that: (1) the location of centers by using the random initialisation is far away from the true location, whether the location of the estimated centers is very close to the actual one; (2) the number of points contained in the clusters can remarkably be different from the actual number as regard the random start; for the estimated centers, on the contrary, only few clusters move away from them. The same results have been obtained for DS2, but they have been omitted due to the lack of space. Finally, table 1 shows, for the first five iterations and then for 10th, 20th and 30th iterations, the square-error E and the difference with the square error E of the previous step, for both the datasets DS1 and DS2, when the K-Means is initialised by our method and by a random choice of the centers. The value of the column reduc compares the reduction of the square-error E for the K-Means initialised with respect to a random start.
A Divisive Initialisation Method for Clustering Algorithms
489
Fig. 3. Centers of DS1: X : true, + : estimated, * : random. Table 1. Square-errors for DS1 and DS2.
Divisive iter E E-E’ 1 22341823 2 19255196 3086627 3 18716145 539051 4 18518756 197389 5 18457450 61306 10 18313930 31143 20 17377078 10225 30 16931927 1062
5
DS1 random start E E-E’ 50670020 30930736 19739284 26999328 3931408 25230965 1768363 24211837 1019128 22277027 322189 20228928 87325 19706827 57233
reduc 2.26 1.60 1.44 1.36 1.31 1.24 1.16 1.16
Divisive E E-E’ 601370 280649 320721 251671 28978 236701 14970 234800 1901 232709 427 231422 6 231428 0
DS2 random start E E-E’ 2081689 774035 1307654 587834 186201 539755 48079 536525 3230 447485 8352 445371 404 444606 3
reduc 3.46 2.75 2.33 2.28 2.28 1.92 1.92 1.92
Conclusions and Future Work
Initialisation methods that estimate the first values of centers of clusters can play an important role in supporting clustering algorithms to accurately and efficiently identify groups of items in large data sets. Preliminary results on two synthetic datasets showed the good quality of the estimated centers. Further work, however, is still necessary to assess the effectiveness of our method. A comparison of our approach with that of Bradley and Fayyad, by using their synthetic data, need to be done and the application on real world data is to be performed.
490
C. Pizzuti, D. Talia, and G. Vonella
(a) : random, 1st iter
(b) : estimated, 1st iter
(c) : random, 10th iter
(d) : estimated, 10th iter
Fig. 4. Clusters of DS1 with random ( (a) and (c)) and estimated ((b) and (d) centers for the 1st and 10th iterations.
References 1. K. Alsabti, S. Ranka, V. Singh. An Efficient K-Means Clustering Algorithm. Proceedings of the First Workshop on High Performance Data Mining, Orlando, Florida, 1998. 2. P.S. Bradley, O.L. Mangasarian, W.N. Street. Clustering via Concave Minimisation. Advances in Neural Information Processing Systems 9, Mozer, Jordan and Petsce Eds. pp. 368-374, MIT Press, 1997. 3. P.S. Bradley, U.M. Fayyad. Refining Initial Points for K-Means Clustering. Proceedings of the Int. Conf. on Machine Learning, pp. 91-99, Morgan Kaufmann, 1998. 4. R.C. Dubes, A.K. Jain. Algorithms for Clustering Data. Prentice Hall, 1988. 5. M. Ester, H. Kriegel, J. Sander, X. Xu. A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, AAAI Press, 1996.
A Divisive Initialisation Method for Clustering Algorithms
491
6. B. Everitt. Cluster Analysis. Heinemann Educational Books Ltd, London, 1977. 7. U.M. Fayyad, G. Piatesky-Shapiro, P. Smith. From Data Mining to Knowledge Discovery: an overview. In U.M. Fayyad & al. (Eds) Advances in Knowledge Discovery and Data Mining, pp. 1-34, AAAI/MIT Press, 1996. 8. U.M. Fayyad, C. Reina, P.S. Bradley. Initialisation of Iterative Refinement Clustering Algorithms. Proceedings of the Int. Conf. on Knowledge Discovery and Data Mining, AAAI Press, 1998. 9. A.K. Jain. Algorithms for Clustering Data. Prentice Hall, 1988. 10. A.K. Jain, M. N. Murty, P. J. Flynn. Data Clustering : A Review. ACM Computing Surveys, June 1999. 11. D. Judd, P. McKinley, A. Jain. Large-Scale Parallel Data Clustering. Proceedings of the Int. Conf. on Pattern Recognition, 1996. 12. L. Kaufman, P.J. Rousseew. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. 13. R.T. Ng, J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. Proceedings of the 20th Int. Conf. on Very Large Data Bases, pp. 144-155, 1994. 14. B.W. Silverman. Density Estimation for Statistics and Data Analysis. London, Chapman and Hall, 1986. 15. T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. Proceedings of the ACM SIGMOD Int. Conf. on Managment of Data, Montreal, Canada, pp. 103-114, June 1996.
A Comparison of Model Selection Procedures for Predicting Turning Points in Financial Time Series Thorsten Poddig and Claus Huber University of Bremen, Chair of Finance FB 7, Hochschulring 4 D-28359 Bremen, Germany [email protected] [email protected]
Abstract. The aim of this paper is to compare the influence of different model selection criteria on the performance of ARMA- and VAR-models to predict turning points in nine financial time series. As the true data generating process (DGP) in general is unknown, so is the model that mimics the DGP. In order to find the model which fits the data best, we conduct data mining by estimating a multitude of models and selecting the best one optimizing a well-defined model selection criterion. In the focus of interest are two simple in-sample criteria (AIC, SIC) and a more complicated out-of-sample model selection procedure. We apply Analysis of Variance to assess which selection criterion produces the best forecasts. Our results indicate that there are no differences in the predictive quality when alternative model selection criteria are used.
1
Introduction
Forecasting turning points (TP) in financial time series is one of the most fascinating (and possibly rewarding) aspects in finance. In this paper, we implement a Monte-Carlo-based regression approach introduced by Wecker[1] and enhanced by Kling[2] to produce probabilistic statements for near-by TPs in monthly financial time series. This method needs forecasts of future values of the time series. Those can be predicted by an econometric model, which is assumed to mimic the true data generating process (DGP). The performance of forecasting models can be judged in two different ways. In-sample predictive accuracy is measured with the data already used for model development and estimation of the coefficients. In contrast, out-of-sample assessment focuses on the ability of the model to predict unknown datapoints. Hence forecasting models are usually needed to predict unknown datapoints, in this paper their out-of-sample predictions are in the focus of interest. To evaluate whether the out-of-sample forecasts from the models are reliable, backtesting is performed: Using a simulation period of 68 months, out-of-sample predictions are produced for each month, using for model estimation only the data available until this month. Unfortunately, in ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 492–497, 1999. c Springer-Verlag Berlin Heidelberg 1999
A Comparison of Model Selection Procedures
493
most applications the true DGP is unknown, and so is the specification of the model. The problem is that a multitude of models exists, and one out of these models is able to fit the DGP best. In order to find this model that produces the best out-of-sample predictions in the backtesting period, two model selection procedures can be applied. In-sample model selection can be implemented easily, using e.g. the Akaike or Schwartz information criterion (AIC, SIC). More complicated is out-of-sample model selection, which can be conducted by forming a separate cross-validation subset (CV) from the training data. The CV is not used to estimate the coefficients but only to validate the model on unknown data. The aim of this paper is to compare the performance of models selected by two classical in-sample model selection criteria (AIC, SIC) with the performance of an out-of-sample validation procedure. To evaluate the performance of the models in the backtesting period, we take the view of a participant in the financial markets. Here one is not interested in optimizing statistical criteria, like Mean Squared Error etc., but in obtaining an acceptable profit. A performance criterion in this spirit is the Cumulative Wealth (CW ). We predict TPs in nine financial time series of monthly periodicity with ARMA- and VAR-models using rolling regressions. In order to obtain statistically significant results, we examine the TP predictions from ARMA- resp. VAR-models created by the different model selection methods using Analysis of Variance (ANOVA). Due to space limitations, this paper had to be shortened considerably. The full version is available from the internet[3].
2
The Detection of Turning Points in Financial Time Series
As a first step to obtain a probabilistic statement about a near-by turning point one has to define a rule when a TP in the time series is detected. The turning point indicator 1, if xt > xt+i , i = −τ, −τ + 1, . . . , −1, 1, . . . , τ − 1, τ P (1) zt = 0, otherwise is defined as a local extreme value of τ preceding and succeeding datapoints. The trough indicator ztT is defined in an analogous way. As we investigate monthly time series, we define τ =2. At time t the economist knows only the current and past datapoints xt , xt−1 , . . . , xt−τ +1 , xt−τ . The future values xt+1 , . . . , xt+τ −1 , xt+τ have to be estimated using e.g. ARMA- and VAR-models. With those estimates we applied a Monte-Carlo based procedure developed by Wecker [1] and Kling [2] to obtain probabilistic statements about near-by TPs. A TP is detected if the probability reaches or exceeds a certain threshold θ, e.g. θ=.5. A participant in the financial markets usually is not interested in MSE, MAE, etc. but in economic performance. Since our models do not produce return forecasts but probabilities for TPs, we have to measure performance indirectly by generating trading signals from those probabilities: A short position is taken when a peak is detected (implying the market will fall, trading signal s=-1), a long
494
T. Poddig and C. Huber
position in the case of a trough s=+1), and the position of the previous period is maintained if there is no TP. With the actual period-to-period return ractual,t we can calculate the return rm,t from a TP forecast of our model: rm,t = s · ractual,t . In this paper we deal with log-differenced data, so the Cumulative Wealth can PT be computed by adding the returns over T periods: CW = t=1 rm,t . To test the ability of the ARMA and VAR models to predict TPs, we investigate nine financial time series, namely DMDOLLAR, YENDOLLAR, BD10Y (performance index for the 10 year German government benchmark bond), US10Y, JP10Y, MSWGR (performance index for the German stock market), MSUSA, MSJPA, and the CRB-Index. The data was available in monthly periodicity from 83.12 to 97.12, equalling 169 datapoints. To allow for the possibility of structural change in the data, we implemented rolling regressions: After estimating the models with the first 100 datapoints and forecasting the τ succeeding datapoints, the data-window of the fixed size of 100 datapoints was put forth for one period and the estimation procedure as well as the Monte-Carlo-simulations were repeated until the last turning point was predicted for 97.10. Thereby we obtained 68 outof-sample turning point forecasts. We estimated a multitude of models for each model class: 15 ARMA-models from (1,0), (0,1), (1,1),..., to (3,3) and 3 VAR models VAR(1), (2), and (3) comprising all nine variables. We do not specify a model and estimate all rolling regressions with this model. Rather we specify a class of models (ARMA and VAR). Within a class the best model is selected for forecasting. As an extreme case, a different model specification could be chosen for every datapoint (within the ARMA class e.g. the ARMA(1,0) model for the first rolling regression, ARMA(2,2) for the second etc.). Popular in-sample model selection criteria are AIC and SIC. Applying AIC and SIC for model selection within the first rolling regression, we estimated a multitude of e.g. ARMA-models with 100 datapoints and chose the model with the lowest AIC to forecast the τ future datapoints. In contrast to the simple implementation of AIC and SIC, the out-of-sample procedure for model selection is more complicated. Therefore we divided the training data in two subsequent, disjunct parts: an estimation (=training) subset (70 datapoints) and a validation subset (30 datapoints, see figure 1). The first 70 datapoints from t-99 to t-30 were used to estimate the models, which were validated with respect to their abilities to predict TPs on the following 30 datapoints from t-29 to t. The decision which model is the ”best” within the out-of-sample selection procedure was made with respect to CW : the model with the highest CW was selected. The specification of this model, e.g. ARMA(2,2), then was re-estimated with the 100 datapoints from t-99 to t to forecast the at time t unknown τ values of the time series which are necessary to decide whether there is a turning point at time t. As a result of model selection with the two in-sample criteria AIC, SIC, and the out-of-sample procedure with regard to CW we obtain three sequences of TP forecasts each for ARMA- and VAR-models for the out-of-sample backtesting period of the 68 months. Two ARMA-sequences with a threshold θ=.5 could look like table 1. The first four columns refer to the number of the rol-
A Comparison of Model Selection Procedures
495
Fig. 1. Division of the database
ling regressions and the training, validation, and forecast period, respectively. For AIC and SIC model selection was performed on the 100 datapoints of the training and validation subset as a whole. The 5th (7th) column gives the specification of the ARMA-model selected by CW (AIC), the 6th (8th) column gives the corresponding CW - (AIC)-value. Table 1. ARMA-sequence as an example for the rolling regressions RR 1 2 .. . 68
training 83.12-89.9 84.1-89.10 .. . 89.8-95.4
validation forecast 89.10-92.3 92.4-92.5 89.11-92.4 92.5-92.6 .. .. . . 95.5-97.10 97.11-97.12
CW Spec. CW − value (2,2) .179 (1,0) .253 .. .. . . (3,0) .815
AIC Spec. AIC − value (3,3) -5.326 (1,1) -5.417 .. .. . . (2,3) -5.482
The first TP forecast was produced for 92.4 (with the unknown values of 92.5 and 92.6), the last for 97.10. The 68 out-of-sample forecasts of the model sequences generated this way are finally evaluated with respect to CW . To judge whether the econometric models are valuable forecasting tools, one would like to test if the model class under consideration is able to outperform a simple benchmark in the backtesting period. When forecasting economic time series, a simple benchmark is the naive forecast. Using the last certain TP statement can be regarded as a benchmark in this sense. As τ =2, the last certain TP statement can be made for t-2, using the datapoints from t-4 to t. A valuable forecasting model should be able to outperform this Naive TP Forecast (NTPF) in the backtesting period. In order to produce a statistically significant result when comparing the model sequences generated by the different model selection criteria, we apply Analysis of Variance (ANOVA). The forecasts for ARMA-models, θ=.5, with respect to the evaluation criterion CW can be exhibited as in table 2 (the last column
496
T. Poddig and C. Huber
contains the NTPF-results). The entry -.115 in the 3rd column of row 3 means that ARMA-models selected by AIC produced a CW of -.115 in the backtesting period when predicting turning points for MSWGR. Looking to the last row, column 3 reveals that the mean CW over all nine time series from the ARMA forecasts is -.192. Table 2. Example for the exhibition of the results from the ARMA turning point forecasts
MSWGR BD10Y .. . mean:
Selection criteria CW AIC SIC -.262 -.115 -.020 -.856 -.515 -.898 .. .. .. . . . -.232 -.192 -.264
NTPF -.029 -.291 .. . -.196
The block experiment of ANOVA can be used to test if the means of the columns (here the means from the TP predictions) and the means of the rows (the means from TP predictions for one of the time series) are identical. Thereby it is possible to compare the performance of the different model selection criteria. Additionally, the NTPF is included in the test to make sure that the models outperform the benchmark. The basic model of ANOVA is: yij = µ + αi + βj + eij , where yij represents the element in row i and column j of table 2, µ is the common mean of all yij , αi is the block effect due to the analysis of different time series in the r rows of table 2, βj the treatment effect of the p selection criteria (incl. NTPF) in the columns of table 2, and eij an iid, N (0; σ 2 ) random factor. We want to test whether the treatment effects βj are zero: β1 = β2 = ... = βp = 0. In other words, we want to test the null hypothesis that there are no statistically significant effects due to the use of different model selection criteria on the TP forecasts from ARMA- and VAR-models. An F test statistic is based on the idea that the total variation SST of the elements in table 2 can be decomposed in the variation between the blocks SSA, the variation between the selection criteria SSB, and the random variation SSE: SST = SSA + SSB + SSE. Estimators for SST , SSA, SSB and SSE can be computed as shown in [3]. Then an F -statistic can be computed (see [3]). The null is rejected, if F exceeds its critical value. The next section presents empirical results.
3
Empirical Results and Conclusion
The following table 3 exhibits the empirical results from the TP forecasts with ARMA- and VAR-models. The 2nd and 3rd column show the value for the F statistic and its corresponding p-value. The 4th to 7th column contain the means
A Comparison of Model Selection Procedures
497
of the model sequence created by the selection criterion under consideration. E.g. the entry ”.32” in row 2, column 2 gives the F -statistic for the null that the mean CW of ARMA-models, created by the use of the selection criteria AIC, SIC, CW , and the NTPF are all the same. The p-value of .8087 indicates that the null cannot be rejected at the usual levels of signficance (e.g. .10). Thus we have to conclude that there are no differences between TP forecasts from ARMA-models generated by different model selection criteria. Moreover, the ARMA forecasts do not differ significantly from the NTPF. Columns 4 to 7 exhibit the mean CW over all nine time series. The ARMA models selected by e.g. AIC managed to produce an average CW of .059 in the simulation period. This is only marginally higher than the mean CW from NTPF (.057). Table 3. Empirical ANOVA results from the TP predictions Model F p AIC SIC CW NTPF ARMA .32 .8087 .059 -.022 -.010 .057 VAR .16 .9242 -.002 -.002 .015 .057
In general, the results indicate that there are no statistically significant differences between TP predictions from ARMA- and VAR-models (p-values .8087 and .9242). With concern to ARMA-models, AIC seems to be the best selection criterion with respect to CW (mean CW =0.059). This is only slightly better than the benchmark NTPF (mean CW =0.057) and cannot be considered as a reliable result. The other selection criteria even led to underperformance vs. NTPF. Results are even worse for VAR-models. All VARs underperformed the NTPF. Thus it must be doubted that ARMA- and VAR-models are valuable tools for predicting TPs in financial time series. If they are employed despite of the results achieved here, it might be a good choice to make use of in-sample selection criteria AIC and SIC. They led to comparable results as the out-ofsample validation procedure suggested in this paper and are less expensive to implement. If those results hold for other forecasting problems, evaluation criteria, and selection procedures as well has to be investigated by further research.
References 1. Wecker, W. (1979): Predicting the turning points of a time series; in: Journal of Business, Vol. 52, No. 1, 35-50 2. Kling, J.L. (1987): Predicting the turning points of business and economic time series; in: Journal of Business, Vol. 60, No. 2, 201-238 3. Poddig, T.; Huber, C. (1999): A Comparison of Model Selection Procedures for Predicting Turning Points in Financial Time Series - Full Version, Discussion Papers in Finance No. 3, University of Bremen, available at: www1.uni-bremen.de/ e fiwi/
Mining Lemma Disambiguation Rules from Czech Corpora Luboˇs Popel´ınsk´ y and Tom´ aˇs Pavelek Natural Language Processing Laboratory Faculty of Informatics, Masaryk University in Brno Czech Republic {popel,xpavelek}@fi.muni.cz
Abstract. Lemma disambiguation means finding a basic word form, typically nominative singular for nouns or infinitive for verbs. In Czech corpora it was observed that 10% of word positions have at least 2 lemmata. We developed a method for lemma disambiguation when no expert domain knowledge is available based on combination of ILP and kNN techniques. We propose a way how to use lemma disambiguation rules learned with ILP system Progol to minimise a number of incorrectly disambiguated words. We present results of the most important subtasks of lemma disambiguation for Czech. Although no knowledge on Czech grammar has been used the accuracy reaches 93% with a small fraction of words remaining ambiguous.
1
Disambiguation in Czech
Disambiguation in inflective languages, of which Czech is a very good instance, is a very challenging task because of their usefulness as well as its complexity. DESAM, a corpus of Czech newspaper texts that is now being built at Faculty of Informatics, Masaryk University, contains more than 1 000 000 word positions, about 130 000 different word forms, about 65 000 of them occuring more then once, and 1665 different tags. DESAM is now being tagged – partially manually, partially by means of different disambiguators – into 66 grammatical categories like a part-of-speech, gender, case, number etc., about 2 000 tags, combinations of category-value couples. E.g. for substantives, adjectives and numerals there are 4 basic grammatical categories. For pronouns 5 categories, for verbs 7 and for adverbs 3 categories, and some number of subcategories. The large number of tags is made by combination of those categories. It was observed [11] that there is in average 4.21 possible tags per word. It is impossible to perform the disambiguation task manually and any tool that can decrease the amount of human work is welcome. DESAM is still not large enough. It does not contain all Czech word forms – compare 132 000 different word forms in DESAM with more than 160 000 stems of Czech words that morphological analysers are able to recognise (each of them can have a number of both prefixes and suffixes). Thus DESAM does not ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 498–503, 1999. c Springer-Verlag Berlin Heidelberg 1999
Mining Lemma Disambiguation Rules from Czech Corpora
499
contain the representative set of Czech sentences. In addition DESAM contains some errors, i.e. incorrectly tagged words. Another problem is that the significant amount of word positions (words as well as interpunction) are untagged. For the word form “se” nearly one fifth of words are untagged (16,8%) and 93.4% of contexts contain an untagged word. It is similar for other classes of words with an ambigoues lemma. It should be noticed here that the disambiguation task in Czech language is much more complex than in e.g. English also for another reason. For English there are tagged corpora covering a majority of common English sentences. The known grammar rules cover a significant part of English sentence syntax. Unfortunately, neither of those statements hold for Czech. It makes our task quite difficult.
2
Lemma Disambiguation
Lemma disambiguation which we address here, means assigning to each word form its basic form – nominative singular for nouns, adjectives, pronouns and numerals, infinitive for verbs. E.g. in the sentence Od r´ ana je m´ a Ivana se ˇzenou. (literarily since (the) morning my Ivana(female) has been with (my) wife.) each of words except the preposition ”od” has two basic forms. E.g. “r´ ana” can be genitive of “r´ ano”(morning) as well as nominative of a substantive “r´ ana”(bang). In Czech corpora it was observed that 10% of word positions – i.e. each 10th word of a text – have at least 2 lemmata and about 1% word forms of the Czech vocabulary has at least 2 lemmata. The most frequent ambiguous word forms are se and je. Disambiguation of the word “se” would be welcome as it is the 3rd most frequent word in DESAM corpus. Actually the lemma disambiguation (almost always) leads to a disambiguation of sense. In the example, m´ a means either ”my” (daughter) or ”has” (s/he has), se is either preposition ”with” (my wife) or the reflexive pronoun “self” (like ”elle se lave” in French). We use here a novel approach to lemma disambiguation based on a combination of memory-based learning, namely weighted k-Nearest Neighbor method [8] and inductive logic programming(ILP) [9]. Inductive logic programming aims at finding first-order logic rules that cover positive examples (and uncover negative ones) using given domain knowledge predicates. The rest of the paper is organised as follows. In Section 3 we explain how to build basic domain knowledge if no sufficient linguistic knowledge is available. In the Section 4 we present the results obtained with ILP system Progol for the most frequent lemma-ambiguous word form “se”. Rule set accuracy on a disambiguated context is displayed. Section 5 brings the results of disambiguation when correct tags in a context are unknown. We conclude with a discussion of results and a summary of relevant works.
500
3
L. Popel´ınsk´ y and T. Pavelek
Domain Knowledge
There is no complete formal description of Czech grammar. We decided to build domain knowledge predicates without any need of deep linguistic knowledge. We only exploit information about particular tags in a context. The general form of domain knowledge predicates is p(Context, Focus, Condition) where Context is a variable bound with either left context in a reverse order or with right context, Focus, Condition are terms. Focus defines a subpart of the Context. It has a form first(N) (N=1..max length, a sublist of the Context of length N neighboring with the word. max length is a maximal length of a context). Condition says what condition must hold on the Focus. Condition is an unary term of the form somewhere(List) (tags from the List appear somewhere in the Context) or always(List) (tags from the List appear in all positions in the Context). E.g. a goal p(X,first(2), always([c7,nS])) succeeds if tags c7,nS appear in each of the first two words in the context X – e.g. a pronoun and a noun in singular instrumental as in “(se) svou sestrou” – ”(with) his sister”.
4
Learning Disambiguation Rules with Progol
We will demonstrate our method on disambiguation of the word form “se”. It may have either the lemma “s” (preposition like “with” in English) or the lemma “sebe” (reflexive pronoun ”self”). For generation of learning sets we use the part of DESAM corpus which was manually disambiguated (about 250 000 word positions). The left and right contexts have been set to 5 words. Untagged words in context has been tagged as ’unknown part-of-speech’ (tag kZ). Negative examples have been built from sentences where the word has the second lemma. Using P-Progol [10] version 2.2 we have learned rules for both of the two lemmata. It means that for each task we obtained two rule sets that should be complementary. However, we have found it useful to use both of them. Number of sentences was 232 (preposition) and 2935(pronoun). 80% of examples was used for learning. We tested each rule set on the rest of data. Learning time reached 14 hours. It is caused by the enormous number of 4536 literals that may appear in a rule body. It must be mentioned that the default accuracy, i.e. assigning the reflexive pronoun lemma to each occurrence of “se” is 92.7%. Then the reached rule accuracies 92.84% (pronoun) and 94.48%(preposition) are not too impressive. In the next section we will show that even such “poor” rule sets are usable for lemma disambiguation.
Mining Lemma Disambiguation Rules from Czech Corpora
5
501
Disambiguation
The goal then was to find such a criterion that would allow to find the correct lemma for the word “se”. The learning and the testing sets contained sentences not used for learning the disambiguation rules. We limited both left and right context to the length of 3 words. Then we removed all sentences that contained commas, dots, parentheses etc. 50% of the sentences were used for estimation of parameters, the rest for testing. All possible grammatical categories were found for each sentence employing LEMMA morphological analyser1 . Then all variations of categories was generated for each sentence. Both theories learned by Progol were run on those data so that for each sentence we had two success rates, i.e. the relative number of correctly covered positive examples and correctly uncovered negative examples to the number of all examples. Time needed for disambiguation of 1 sentence was 6 seconds in average, very rarely it was more than 10 seconds. If the disambiguation lasted more than 30 seconds (because of the enormous number of variations of tags), the process was killed. It concerned less than 2% of cases. Two success rates obtained for a sentence are as (x,y)-coordinates. The new example is then classified into class(lemma) of the nearest neighbor(s) in that learning set. We computed the distance between two instances (x1 , y1 ) and (x2 , y2 ) as an Euclidian distance. As mentioned above, 50% new sentences have been used for building the set of instances and for parameter estimation. On the new learning set we tried values of k (the number of neighbors) in the range 1..10. It was observed that increasing value of k did not increase accuracy of disambiguation. Therefore for all experiments below k was set to 1. Then we found the nearest point (xi , yi ). Let s1 , s2 be the number of instances with lemma “s” and the number of instances with the lemma “sebe” for this point. If si is greater than sj we would expect that the i-th lemma is the right one. We lemma := if s1 > s2 ∧ succesRatelemma1 > tlemma1 then lemma1 else if s1 < s2 ∧ succesRatelemma2 > tlemma2 then lemma2 else unresolved Fig. 1. kNN algorithm
also observed that if a success rate for a word in a particular context is smaller than a threshold the word cannot be disambiguated. Thus the correct lemma was assigned using the rules in Fig. 1. Values of (tlemma1 , tlemma2 ) was tested in the range (0,0)..(1,1). The best settings of thresholds on the learning set was tlemma1 = 0, tlemma2 = 0.8. Results of disambiguation are in Table 1.
1
copyright Lingea Brno 1995
502
L. Popel´ınsk´ y and T. Pavelek
preposition learn test pronoun learn test
disambiguation #ex correct wrong accuracy(%) 99 80 4 97.5 112 93 7 93.0 297 214 2 99.1 310 236 6 97.5
unresolved # % 17 17.2 14 12.5 82 27.6 44 14.2
Table 1. Results of kNN algorithm
6
Conclusion
The presented results are the first obtained by ILP techniques in disambiguation of inflective languages, as far as we know. It must be stressed that the Czech corpus is under development and therefore it contains about 17% of untagged words as well as incorrectly tagged words. Moreover, there were no usable formal grammar rules for Czech that would make the domain knowledge building easier. We described the systematic way of building domain knowledge if no sufficient linguistic knowledge is available. A new method for lemma disambiguation was introduced that reached an accuracy 93%, leaving a small part of words ambiguous. Similar accuracy was obtained for Prague Tree Bank corpus [13]. The lemma disambiguation task is not solved here completely. The main reason is that the Czech corpora are still too small and therefore cardinality of learning sets is not sufficient for most of the tasks. Our approach was also used for disambiguation of unknown words (not existing in the corpus). We defined similarity classes for lemma-ambiguous words in terms of grammatical categories. First results can be found in [12]. Results obtained by ILP for tag disambiguation can be found in [13]. So far statistical techniques (accuracy 81.64%) and neural nets (75.47%) have been applied to DESAM [11]. See also [5,6,14] for other results with another Czech corpus. It should be pointed out that our results are not quite comparable as we focus only on lemma disambiguation. In the past, ILP has been applied for inflective languages in the field of morphology. LAI Ljubljana [2] applied ILP for generating the lemma from the oblique form of nouns as well as for generating the correct oblique form from the lemma, with average accuracy 91.5 % . Learning nominal inflections for Czech and Slovene (among others) is described in [7]. James Cussens [1] developed POS tagger for English that achieved per-word accuracy of 96.4 %. Martin Eineborg and Nikolaj Lindberg [3,4] induced constraint grammar-like disambiguation rules for Swedish with accuracy 98%. Our approach differs significantly in two points. We do not exploit any information on particular words as in [3]. Such knowledge would improve an accuracy significantly. Neither we use any hand-coded grammatical domain knowledge as in [1].
Mining Lemma Disambiguation Rules from Czech Corpora
503
Out method, although developed for Czech language, is actually language-independent except of the set of tags. It means that it is possible to use our approach also for other languages. Acknowledgements This paper is a brief version of [13]. We thank a lot to anonymous referees of LLL workshop for their comments. We would like to thank to Karel Pala and ˇ ep´ Olga Stˇ ankov´ a for their help with earlier versions of this paper. We thank, too, Tom´ aˇs Pt´ aˇcn´ık, Pavel Rychl´ y, Radek Sedl´ aˇcek and Robert Kr´ al for fruitful discussions and assistance. This work has been partially supported by VS97028 grant of Ministry of Education of the Czech Republic ”Natural Language Processing Laboratory” and ESPRIT ILP2 Project.
References 1. Cussens J.: Part-of-Speech Tagging using Progol. In Proc. of ILP’97, LNAI 1297, Springer-Verlag 1997. 2. Dˇzeroski S., Erjavec T.: Induction of Slovene Nominal Paradigms. In Proc. of ILP’97, LNAI 1297, Springer-Verlag 1997. 3. Eineborg M., Lindberg N.: Induction of Constraint Grammar rules using Progol. In Proc. of ILP’98, 1998. 4. Lindberg N., Eineborg M.: Learning Constraint Grammar-style disambiguation rules using Inductive Logic Programming. In: COLING/ACL’98. 5. Hajiˇc J., Hladk´ a B.: Probabilistic and rule-based tagger of an inflective language – a comparison. In Proceedings of the 5th Conf. on Applied Natural Language Processing, 111-118, Washington D.C., 1997. 6. Hajiˇc J., Hladk´ a B.: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In Proceedings of EACL 1998. 7. Manandhar S., Dˇzeroski S., Erjavec T.: Learning multilingual morphology with CLOG. In Proc. of ILP’98, 1998. 8. Mitchell, T.M.: Machine Learning. McGraw Hill, Newy York, 1997. 9. Muggleton S., De Raedt L.: Inductive Logic Programming: Theory And Methods. J. Logic Programming 1994:19,20:629-679. 10. Muggleton S.: Inverse Entailment and Progol. New Generation Computing Journal, 13:245-286, 1995. 11. K. Pala , P. Rychl´ y and P. Smrˇz: DESAM - annotated corpus for Czech. In Pl´ aˇsil F., Jeffery K.G.(eds.): Proceedings of SOFSEM’97, Milovy, Czech Republic. LNCS 1338, Springer-Verlag 1997. 12. Pavelek T., Popel´ınsk´ y L.: Towards lemma disambiguation: Similarity classes (submitted) 13. Popel´ınsk´ y L., Pavelek T., Pt´ aˇcn´ık T.: Towards disambiguation in Czech corpora. Workshop Notes of “Learning Language in Logic”(LLL) ICML’99 Workshop, Bled, Slovenija, 1999. 14. Zavrel J., Daelemans W.: Recent Advances in Memory-Based Part-of-Speech Tagging. ILK/Computaional Linguistics, Tilburg University, 1998.
Adding Temporal Semantics to Association Rules Chris P. Rainsford1 and John F. Roddick2 1
Defence Science and Technology Organisation, DSTO C3 Research Centre Fernhill Park , Canberra, 2600, Australia. [email protected] 2 Advanced Computing Research Centre, School of Computer and Information Science University of South Australia, The Levels, Adelaide, 5095, Australia. [email protected]
Abstract. The development of systems for knowledge discovery in databases, including the use of association rules, has become a major research issue in recent years. Although initially motivated by the desire to analyse large retail transaction databases, the general utility of association rules makes them applicable to a wide range of different learning tasks. However, association rules do not accommodate the temporal relationships that may be intrinsically important within some application domains. In this paper, we present an extension to association rules to accommodate temporal semantics. By finding associated items first and then looking for temporal relationships between them, it is possible to incorporate potentially valuable temporal semantics. Our approach to temporal reasoning accommodates both point-based and intervalbased models of time simultaneously. In addition, the use of a generalized taxonomy of temporal relationships supports the generalization of temporal relationships and their specification at different levels of abstraction. This approach also facilitates the possibility of reasoning with incomplete or missing information.
1 Introduction Association rules have been widely investigated within the field of knowledge discovery q.v. ([1],[2],[7],[8],[9],[12],[14]). In this paper we present an extension to association rules to allow them to exploit the semantics associated with temporal data and particularly temporal interval data. Some work in the discovery of common sequences of events has been conducted q.v. ([3],[10],[15]). However, these algorithms are aimed at finding commonly occurring sequences rather than associations. Moreover, the algorithms only accommodate point-based events and this restricts both the potential semantics of knowledge that may be discovered and the data that can be learnt from. Other investigations have examined the discovery of association rules from temporal data, such that each discovered rule is weakened by a temporal dependency q.v. [5]. Özden et al. extend this to cyclic domains to describe associations that are strong during particular parts of a specified cycle [11]. This may be used to describe the behavior of rules that only hold true in summer or winter or during some other part of a given cycle. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 504-509, 1999. © Springer-Verlag Berlin Heidelberg 1999
Adding Temporal Semantics to Association Rules
505
Conventional association rules do not accommodate the temporal relationships that may be intrinsically important in some application domains. Importantly, each basket of items is treated individually with no record of the associated customer or client who purchased these goods. However, where client histories exist, temporal patterns may be associated with their purchasing behaviour over time. Therefore it would useful to provide organisational decision-makers with this temporal information. Existing association rule algorithms do not support such temporal semantics. This paper addresses this issue by presenting an extension to association rules that accommodates both point-based and interval-based models of time simultaneously. In addition, the use of a generalised taxonomy of relationships supports the generalisation of temporal relationships and specification at different levels of abstraction. This approach also facilitates the possibility of reasoning with incomplete or missing information. This flexibility makes the proposed approach applicable to a wide array of application domains. Although initially motivated by the desire to analyse large retail transaction databases, the general utility of association rules makes them applicable to a wide range of different learning tasks. For consistency the entities with which a history of transactions is associated are described here as clients. Clients may be associated with non-temporal properties such as sex, specific events such as item purchases or with attributes manifested over intervals such as bank balances, outstanding debts or insurance classifications. These properties are described as items and the set of items associated with a client is said to be their basket of items. Association rules may be able tell us that Investment_X is associated with Insurance_Y. Temporal associations may then tell us that Investment_X usually occurs after the start of Insurance_Y. This may indicate that customers start with an insurance policy and this becomes a gateway for other services such as A campaign marketing Investment_X to holders of Investment_X. Insurance_Y may then be suggested. In the next section temporal association rules are formally defined. Section 3 then discusses the temporal logic that underlies the proposed approach to learning temporal association rules. An overview of the learning algorithm is provided in Section 4. In Section 5 a summary of this paper and discussion of future research is provided.
2
Temporal Association Rules
A temporal association rule can be considered a conventional association rule that includes a conjunction of one or more temporal relationships between items in the antecedent or consequent. Building upon the original formalism in [1] temporal association rules can be defined as follows: Let I = I1, I2,...,Im be a set of binary attributes or items and T be a database of tuples. Association rules were first proposed for use within transaction databases, where each transaction t is recorded with a corresponding tuple. Hence attributes represented items and were limited to a binary domain where t(k) = 1 indicated that the item Ik had been purchased as part of the transaction, and t(k) = 0 indicated that it had not. However in a more general context t may be any tuple with binary domain attributes, which need not represent a transaction buy may simply represent the presence of some attribute value or range of
506
C.P. Rainsford and J.F. Roddick
values. Temporal attributes are defined as attributes with associated temporal points or intervals that record the time for which the item or attribute was valid in the modelled domain. Let X be a set of some attributes in I. It can be said that a transaction t satisfies X if for all attributes Ik in X, t(k) = 1. Consider a conjunction of binary temporal predicates P1 ∧ P2….∧ Pn defined on attributes contained in either X or Y where n ≥ 0. Then by a temporal association rule, we mean an implication of the form X ⇒ Y ∧ P1 ∧ P2….∧ Pn, where X, the antecedent, is a set of attributes in I and Y, the consequent, is a set of attributes in I that is not present in X. The rule X ⇒ Y ∧ P1 ∧ P2….∧ Pn is satisfied in the set of transactions T with the confidence factor 0 ≤ c ≤ 1 iff at least c% of transactions in T that satisfy X also satisfy Y. Likewise each predicate Pi is satisfied with a temporal confidence factor of 0 ≤ tcPi ≤ 1 iff at least tc% of Transactions in T that satisfy X and Y also satisfy Pi. The notation X ⇒ Y |c ∧ P1|tc ∧ P2|tc….∧ Pn |tc is adopted to specify that the rule X ⇒ Y ∧ P1 ∧ P2….∧ Pn has a confidence factor of c and temporal confidence factor of tc. As an illustration consider the following simple example rule: policyC ⇒ investA,productB | 0.87 ∧ during(investA,policyC) | 0.79 ∧ before(productB,investA) | 0.91 This rule can be read as follows: The purchase of investment A and product B are associated with insurance policy C with a confidence factor of 0.87. The investment in A occurs during the period of policy C with a temporal confidence factor of 0.79 and the purchase of product B occurs before investment A with a temporal confidence factor of 0.91 Binary temporal predicates are defined using Allen’s thirteen interval based relations and Freksa’s neighbourhood relations and this will be discussed in the next section.
3
Temporal Logic
The expressiveness of temporal association rules is determined by the set of temporal predicates available to describe relationships between items. For our work, Allen’s taxonomy of temporal relationships is adopted to describe the basic relationships between intervals [4]. These relationships become the basis for binary temporal predicates. Using these relations we are able to treat points as a special case of intervals where begin and end points are equal. To add extra expressive capability, Freksa’s generalised relationships have also been adopted [6]. Freksa’s neighbourhood relations generalise over Allen’s relations and this allows the proposed algorithm to describe temporal relationships at multiple levels. Therefore several commonly occurring relationships can be summarised into single strong relationships. Both of these taxonomies are depicted in Figure 1.
)UHNVD·V
(ol)
The name of the relationship
X older than Y
The relationship label
>? X succeeds Y (sd)
>? X younger contemporary of Y (yc)
?> X surviving contemporary of Y (sc)
?<>? X contemporary of Y (ct)
??>< X survived by and contemporary of Y (bc)
>? X older contemporary of Y (oc)
??(<=)? X precedes Y (pr)
Lines link Freksa’s neighbourhood relationships with the Allen’s interval relationships that they summarise. Dashed and solid lines are used alternately to aid visual clarity.
??>? X died after birth of Y (db)
?>?> X younger and survives Y (ys)
?< X older and survived by Y (ob)
?? X born before death of Y (bd)
)UHNVD·V1HLJKERXUKRRG5HODWLRQVKLSV
Fig. 1. The interrelations between the Allen’s and Freksa’s taxonomies of temporal relationships
??
>>>> After (>)
>=>> Met by (mi)
><>> Overlapped by (oi)
><>= Finishes (f)
><>< During (d)
=<>> Started by (si)
=<>= Equals (=)
=<>< Starts (s)
<<>> Contains (dl)
<<>= Finished by (fi)
<<>< Overlaps (o)
<<=< Meets (m)
<<<< Before (<)
5HODWLRQVKLSV
$OOHQ·V,QWHUYDOWR,QWHUYDO
For two intervals X(α,ω) and Y(Α,Ω), the symbols represent the relationships: greater than >, less than <, equal to =, or unknown ?, between endpoints: α&Α, α&Ω, ω&Α, and ω&Ω respectively.
.H\
???> X survives Y (sv)
>??? X younger than Y (yo)
???= X tail to tail with Y (tt)
=??? X head to head with Y (hh)
???< X survived by Y (sb)
?? X older than Y (ol)
5HODWLRQVKLSV
1HLJKERXUKRRG
Adding Temporal Semantics to Association Rules 507
508
C.P. Rainsford and J.F. Roddick
4 The Learning Algorithm The temporal association learning algorithm can be seen as a four-stage process as depicted in Figure 2. A detailed description of this process is provided in [13]. The first stage of the learning process is the use of an association rule learning algorithm to generate an initial set of association rules. This process is independent of any specific association rule learning algorithm and this has the advantage of allowing the user to utilise the algorithm of their choice to learn the association rules. During this phase the temporal attributes associated with the items are ignored. The separation between the first and second phase allows the user to prune out uninteresting associations before proceeding with a temporal analysis. In the second phase all of the possible pairings of temporal items in each rule in the set of discovered association rules are generated. These pairing can then be tested to see if any strong temporal relationships exist between them. In the third phase, the database is scanned to determine the temporal nature of the relationships between the candidate item pairings. Each tuple is checked to see if it supports a given rule. If support exists, then the temporal relationships between the items in that instance are recorded as one of Allen’s thirteen basic relationships. The fourth phase of the learning algorithm is the derivation of temporal association rules based upon the original ruleset and the aggregation of temporal relationships found to exist between items. During this phase Allen’s relationships may be generalised to Freksa’s more general relationships so that the temporal confidence threshold can be met. Candidate relationships that do not meet the confidence threshold are discarded.
Database
Stage 1 Learn Association Rules
Association Rules
Stage 3 Count Relationships
Relationship Counts
Stage 4 Create Temporal Association Rules
Candidate Relations Stage 2 Generate Candidates
Temporal Association Rules
Fig. 2. The four-stage temporal association rule learning process.
5 Summary In this paper we have presented an extension to association rules to accommodate both interval and point-based temporal semantics. This technique operates upon the output of existing association rule algorithms and therefore is capable of exploiting existing knowledge discovery resources. Our preliminary experiments have indicated that the performance of this algorithm is of a similar order to a comparatively implemented conventional association rule learning process q.v. [13]. The possibility
Adding Temporal Semantics to Association Rules
509
of further experimentation and improvements to this algorithm includes reimplementation with enhancements to increase performance as well as application to real-world problem domains. An interface to facilitate the discovery and presentation of temporal associations is currently being developed.
References 1. Agrawal, A., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. International Conference on Management of Data (SIGMOD’93), May (1993) 207-216. th 2. Agrawal, A., Srikant, R.: (1994). Fast Algorithms for Mining Association Rules. 20 VLDB conference, September, Santiago, Chile. (1994) 487-499. 3. Agrawal, R., Srikant, R.: Mining Sequential Patterns. International Conference on Data Engineering (ICDE), March, Taipei, Taiwan (1995). 4. Allen, J. F.: Maintaining knowledge about temporal intervals. Communications of the ACM Vol 26. No.11 (1983). 5. Chen, X., Petrounias, I., Heathfield, H.: Discovering Temporal Association Rules in Temporal Databases. International Workshop on Issues and Applications of Database Technology (IADT98), July, Berlin, Germany.(1998) 312-319. 6. Freksa, C.: Temporal reasoning based on semi-intervals. Artificial Intelligence 54, (1992) 199-227. 7. Han, J., Fu, Y.: Discovery of Multiple-Level Association Rules from Large Databases, Technical Report, Simon Fraser University. (1995). 8. Koperski, K., Han, J.: Discovery of Spatial Association Rules in Geographic Information th Databases. The 4 International Synopsum on Large Spatial Databases, August, Maine. (1995) 47-66. 9. Mannila, H., Toivonen, H., Verkamo, A.I.: Efficient Algorithms for Discovering Association Rules. AAAI Workshop on Knowledge Discovery in Databases, July, Seattle, washington, U. Fayyad, M. and R. Uthurusamy (Eds) (1994) 181-192. 10. Mannila, H., Toivonen, H., Verkamo, A.I.: Discovery of frequent episodes in event sequences. Report C-1997-15, University of Helsinki, Department of Computer Science, February (1997). 11. Özden, B., Ramaswamy, S., Silberschatz, A.: Cyclic Association Rules, International Conference on Data Engineering, April, (1998). 12. Park, J.S., Chen, M., Yu, P.S.: An Effective Hash-Based Algorithm for Mining Association Rules. ACM SIGMOD (1995). 13. Rainsford, C.P., Accommodating Temporal Semantics in Knowledge Discovery and Data Mining, PhD Thesis, University of South Australia, (submitted)(1998). st 14. Srikant, R., Agrawal, R.: Mining Generalized Association Rules. The 21 International Conference on Very Large Databases, September, Zurich, Switzerland. (1995). 15. Srikant, R., Agrawal, R.: Mining Sequential Patterns: Generalizations and Performance Improvements. Fifth International Conference on Extending Database Technology (EDBT), March, Avignon, France.(1996).
Vwxg|lqj wkh Ehkdylru ri Jhqhudol}hg Hqwurs| lq Lqgxfwlrq Wuhhv Xvlqj d P0ri0Q Frqfhsw U1 Udnrwrpdodod/ V1 Odoolfk dqg V1 Gl Sdopd HULF Oderudwru| 0 Xqlyhuvlw| ri O|rq 5 h0pdlo=~udnrwrpd/odoolfk/vglsdopdCxqly0o|rq51iu
Devwudfw Wklv sdshu vwxg| vsolwwlqj fulwhulrq lq ghflvlrq wuhhv xvlqj wkuhh ruljlqdo srlqwv ri ylhz1 Iluvw zh sursrvh d xqlhg irupdol}dwlrq iru dvvrfldwlrq phd0 vxuhv edvhg rq hqwurs| ri w|sh ehwd1 Wklv irupdol}dwlrq lqfoxghv srsxodu phd0 vxuhv vxfk dv Jlql lqgh{ ru Vkdqqrq hqwurs|1 Vhfrqg/ zh jhqhudwh duwlfldo gdwd iurp P0ri0Q frqfhswv zkrvh frpsoh{lw| dqg fodvv glvwulexwlrq duh frq0 wuroohg1 Wklug/ rxu h{shulphqw doorzv xv wr vwxg| wkh ehkdylru ri phdvxuhv rq gdwdvhwv ri jurzlqj frpsoh{lw|1 Wkh uhvxowv vkrz wkdw wkh glhuhqfhv ri shuirupdqfhv ehwzhhq phdvxuhv/ zklfk duh vljqlfdqw zkhq wkhuh lv qr qrlvh lq wkh gdwd/ glvdsshdu zkhq wkh ohyho ri qrlvh lqfuhdvhv1
4
Lqwurgxfwlrq
Lqgxfwlrq wuhh phwkrgv vxfk dv FDUW ^4`/ F718 ^43` ru jhqhudol}hg dssurdfk olnh VLSLQD ^47` nqrz d juhdw vxffhvv lq gdwd plqlqj uhvhdufk ehfdxvh wkh| duh yhu| idvw dqg hdv| wr xvh1 Wkh ohduqlqj jrdo lv wr surgxfh vxejurxsv dv krprjhqhrxv dv srvvleoh frqvlghulqj d sduwlfxodu dwwulexwh/ fdoohg wkh fodvv1 Wkh lqgxfwlrq dojrulwkp wkh| xvh wr dfklhyh wklv jrdo lv vlpsoh = vsolw hdfk qrgh ri wkh wuhh/ zklfk uhsuhvhqwv d vxevhw ri wkh zkroh srsxodwlrq/ xvlqj d suhglfwlyh dwwulexwh ri wkh ohduqlqj vhw xqwlo d vwrsslqj uxoh lv dfwlydwhg1 Wkh vhohfwlrq ri wkh suhglfwlyh dwwulexwh uholhv rqo| rq d vsolwwlqj phdvxuh wkdw doorzv wr rughu dwwulexwhv dffruglqj wr wkhlu frqwulexwlrq wr suhglfwlqj wkh ydoxh ri wkh fodvv dwwulexwh1 Pdq| zrunv kdyh ehhq ghyrwhg wr wklv fuxfldo hohphqw ri lqgxfwlrq judsk phwkrgv ^44`= vrph wu| wr fodvvli| wkh phdvxuhv xvhg lq sudfwlfh ^45` zkloh rwkhuv frpsduh wkhlu shuirupdqfhv rq ehqfkpdun gdwdedvhv ^5`1 Wkh ehkdylru ri wkhvh phdvxuhv lq d ohduqlqj surfhvv uhpdlqv krzhyhu odujho| xqnqrzq/ qrwdeo| ehfdxvh wkh vwxglhv riwhq xvh gdwdedvhv zkrvh fkdudfwhulvwlfv duh qrw vshflhg vr wkdw uhvxowv duh qdoo| rqo| ydolgdwhg rq vwxglhg gdwdedvhv1 Lq wklv sdshu/ zh ghhshq wkh vwxg| ri vsolwwlqj phdvxuhv iurp wkuhh ruljlqdo srlqw ri ylhz1 Iluvw zh dgrsw dq xqlhg irupdol}dwlrq ri vsolwwlqj phdvxuhv zklfk lqfoxghv prvw h{lvwlqj phdvxuhv vxfk dv Jlql lqgh{ ^4` ru Vkdqqrq hqwurs| ^43` e| prgxodwlqj d sdudphwhu 1 Vhfrqg/ zh jhqhudwh duwlfldo gdwdedvhv xvlqj d P0ri0Q frqfhsw zklfk doorzv xv wr wrwdoo| frqwuro fodvv glvwulexwlrq dqg wkh frpsoh{lw| ri wkh frqfhsw1 Ydulrxv ohyhov dqg nlqg ri qrlvh fdq wkxv eh xvhg wr vwxg| wkh ehkdylru ri jhqhudol}hg hqwurs|1 Wklug/ zh frpsduh shuirupdqfhv ri ghflvlrq wuhhv rq P0ri0Q frqfhswv ri jurzlqj frpsoh{lw|1 •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 510−517, 1999. Springer−Verlag Berlin Heidelberg 1999
Induction Trees Using a M−of−N Concept
511
Lq wkh iroorzlqj/ zh suhvhqw P0ri0Q frqfhswv dqg wkhlu wudqvirupdwlrq lqwr wuhhv1 Zh wkhq lqwurgxfh d jhqhudol}dwlrq ri qrupdol}hg jdlq hqwurs| phdvxuh dqg vwxg| lwv ehkdylru lq lqgxfwlrq wuhhv dojrulwkp xvlqj duwlfldo gdwdvhwv jhqhudwhg zlwk d P0ri0Q frqfhsw1
5
Iurp P0ri0Q Frqfhswv wr Ghflvlrq Wuhhv
Wkh xvh ri P0ri0Q frqfhswv doorzv xv wr frqwuro wkh gl!fxow| ri wkh ohduqlqj surfhvv1 Wkhvh frqfhswv duh hvshfldoo| kdug wr ohduq zlwk ghflvlrq wuhh ^;` ^<`1 Zh qrwh wkdw lw grhv qrw frqfhuq vlpso| dq duwlfldo frqfhsw/ P0ri0Q fdq rffxu lq uhdo sureohp ^7`1 Zh uvw uhfdoo wkh P0ri0Q frqfhsw ghqlwlrq1 Ohw xv frqvlghu Q lqghshqghqw errohdq yduldeohv zlwk wkh vdph suredelolw| s1 Zh fdoo P0ri0Q frqfhsw/ wkh errohdq yduldeoh zklfk wdnhv wkh ydoxh 4 li dqg rqo| dw ohdvw P ri wkh Q yduldeohv wdnh wkh ydoxh 41 Iru h{dpsoh d 50ri06 frqfhsw/ zlwk wkuhh errohdq dwwulexwhv +D/ E dqg F, lv orjlfdoo| htxlydohqw wr DE . DF . EF=
514 Fkhfnlqj wkh Frpsoh{lw| ri P0ri0Q Frqfhswv
P @ P $+QQ$ P ,$ frqmxqfwlrqv ri ohqjwk D P0ri0Q frqfhsw lv d glvmxqfwlrq ri FQ P 1 Wkh plqlpdo wuhh qhfhvvdu| wr ohduq P0ri0Q frqfhsw lv ri ghswk Q/ exw wkh qxpehu ri ohdyhv ri wklv wuhh uholhv rq P 1 Wr hydoxdwh wkh frpsoh{lw| ri P0ri0Q frqfhswv/ zh sursrvh wr fdofxodwh wkh qxpehu ri ohdyhv ri wkh orjlfdoo| htxlydohqw P / lv fdofxodwhg dv iroorz4 = wuhh1 Wklv qxpehu ri ohdyhv/ ghqrwhg e| IQ 41 Zh fdofxodwh wkh qxpehu ri ohdyhv dw wkh ohyho l iru wkh plqlpdo wuhh/ l @ 4/ 5/ 111/ Q041 D qrgh ri wkh plqlpdo wuhh lv d ohdi vlqfh wkh rqh ru wkh rwkhu ri wkh wzr h{foxvlyh frqglwlrqv eorz lv vdwlvhg = F4 = wkh qrgh fruuhvsrqgv wr d 4 dqg wkhuh duh P04 qrghv deryh lw rq lwv eudqfk fruuhvsrqglqj wr d 4 > F5 = wkh qrgh fruuhvsrqgv wr d 3 dqg wkhuh duh +Q P , qrghv deryh lw rq lwv eudqfk fruuhvsrqglqj wr d 3 > Dw wkh ohyho l ri wkh wuhh/ wkhuh duh F lP4 4qrghv vdwlvi|lqj F4> l p>zkloh wkhuh duh FlQ4 P qrghv vdwlvi|lqj F5 > l q p . 41 51 Wkh wrwdo qxpehu ri ohdyhv lv rewdlqhg e| vxpplqj iru wkh ydoxhv ri l =
QP @
I
[Q
lP @
lP
F
4
4
I
[Q
l Q P @
lQP @
F .4
[
Q P m
4
QP @ QQ P . QP F
F
4
@3
@
PP m 4
F
F
4.
[
P
4
m
@3
lQP
F
4
QP
.4
4 Q @ Q . 41 Zh fkrrvh wr zrun zlwk 60ri0Q frqfhswv/ zkhuh IQ @ IQ 6> 7> ===> :/ ehfdxvh wkh| vkrz d vx!flhqw udqjh ri frpsoh{lw|1
4
Q
@
Ixuwkhupruh/ zh fdq fdofxodwh wkh qxpehu ri qrghv qhfhvvdu| wr P0ri0Q frqfhsw ohduqlqj +h{fhswlrq ri wkh urrw qrgh, =
' 2E8 3
512
R. Rakotomalala, S. Lallich, and S. Di Palma Wdeoh41
Ydoxhv ri s/
8
dqg
Q 6 7 8 9 :
515
R6?
s 319633 317896 3168<7 315<9< 315864
iru %60ri0Q% frqfhswv/ Q @ 6/111/:
8 7 43 53 68 89
s6? 31479< 313948 313673 313548 31347<
Fkhfnlqj wkh Suredelolw| Glvwulexwlrq ri wkh rxwsxw fodvv
Zh ghflghg wr ohduq frqfhswv zlwk wkh vdph fodvv suredelolwlhv glvwulexwlrq lq rughu wr dyrlg wkh frpsdulvrq wr eh dowhuhg e| wkh qdwxuh pruh dqg ohvv xqedo0 dqfhg ri wkhvh glvwulexwlrqv1 Zh rswhg iru wkh suredelolwlhv glvwulexwlrq +31:8 > 3158, zkrvh lpedodqfh lv lqwhuphgldwh1 Wr jhqhudwh rxu gdwdvhwv/ zh xvh Q lqghshqghqw errohdq dwwulexwhv zklfk wdnh wkh ydoxh 4 +WUXH, zlwk wkh suredelolw| s1 Wr fdofxodwh wkh uhvxowlqj fodvv glvwulexwlrq ri wkh P0ri0Q frqfhsw/ ohw xv ghqh NQ wr eh wkh qxpehu ri ydul0 deohv zklfk wdnh wkh ydoxh 4 dprqj wkh Q dwwulexwhv1 Wkhq NQ kdv d elqrpldo glvwulexwlrq zlwk sdudphwhuv Q dqg s1 Wkxv/ wkh srvlwlyh fodvv suredelolw| lv
S +FP@Q
@ 4, @
S +NQ
N P , @ 4 s3P 4
zkhuh s3P lv wkh fxpxodwhg suredelolw| ri wkh elqrpldo glvwulexwlrq E +Q > s, iru wkh ydoxh P 1 Wdeoh 4 jlyhv s/ wkh qxpehu ri ohdyhv dqg wkh ydoxh ri splq+suredelolw| ri wkh ohdvw suredeoh uxoh, xvhg lq ohduqlqj vhw vl}h ghqlwlrq iru Q @ 6> 7> ===> :1
6
Dvvrfldwlrq Phdvxuh edvhg rq Hqwurs| ri W|sh q
Ohw xv frqvlghu d yduldeoh wr eh ohduqhg F/ pdgh xs ri N fdwhjrulhv fn / n@4/ 5/111/ N/ dqg d suhglfwlyh dwwulexwh [ pdgh xs ri O fdwhjrulhv {o / o@4/ 5/ 111/ O1 Zh ghqrwh e| no wkh mrlqw suredelolw| ri fn dqg {o / n. dqg .o wkh pdujlqdo suredelolwlhv ri F dqg [1 Wr phdvxuh wkh suhglfwlyh dvvrfldwlrq ehwzhhq F dqg [/ rqh xvxdoo| fdofx0 odwhv wkh uhodwlyh phdq uhgxfwlrq ri xqfhuwdlqw| rq wkh glvwulexwlrq ri F gxh wr wkh nqrzlqj ri [/ iroorzlqj wkh S1U1H frh!flhqwv ri dvvrfldwlrq +sursruwlrqdo uhgxfwlrq lq huuru, sursrvhg e| Jrrgpdq dqg Nuxvndo ^6`1 Zh sursrvh wr hvwde0 olvk d jhqhudol}hg phdvxuh ri dvvrfldwlrq O +F@[ ,/ edvhg rq wkh phdq uhodwlyh uhgxfwlrq ri jhqhudol}hg hqwurs| ri wkh glvwulexwlrq ri F jdlqhg zkhq nqrzlqj [1 Dfwxdoo|/ wkh hqwurs| ri F fdq rqo| ghfuhdvh zkhq uhdvrqlqj frqglwlrqdoo| wr [ @ {/ ehfdxvh ri wkh frqfdylw| ri wkh hqwurs| ri w|sh / A 31 Zh kdyh hvwdeolvkhg wkh jhqhulf irupxod ri wklv phdvxuh/ dqg suryhg lwv jrrg surshuwlhv1 N S 4 n. = Zh Wkh hqwurs| ri w|sh ri F lv ghqhg e|= K +F , @ 554 4 4 n@4
Induction Trees Using a M−of−N Concept
513
rewdlq iru wkh irupxod ri O +F2[,=
O +F@[ , @
SO SN no .4o SN n. o@4 n@4 n@4 N S n. 4
n@4
Ehfdxvh vdpsoh frxqwv kdyh d pxowlqrpldo glvwulexwlrq/ wkh fruuhvsrqglqj sursruwlrqv duh dv|pswrwlfdoo| qrupdo dv zhoo dv O +F@[ ,1 Zh kdyh fdofxodwhg lwv dv|pswrwlf yduldqfh e| dsso|lqj wkh ghowd phwkrg ^9`1 Wkh vshflf ydoxhv ri hqdeoh xv wr qg qrw rqo| wkh xvxdo phdvxuhv ri dvvrfldwlrq edvhg rq Vkdqqrq hqwurs| +iru @4/ e| sdvvdjh wr wkh olplw, dqg Jlql txdgudwlf hqwurs| +iru @5,/ exw dovr wkh rqhv edvhg rq wkh qxpehu ri fdwhjrulhv +@3, ru rq Ekdwwdfkdu|d d!qlw| +@318,1 Wkxv/ e| prgxodwlqj wkh sdudphwhu gxulqj wkh h{shulphqwdwlrq/ zh fdq v|qwkhwlfdoo| frpsduh wkh h!flhqf| ri xvxdo dvvrfldwlrq frh!flhqwv1
7
H{shulphqw = Frqfhsw/ Ohduqlqj Vhw Vl}h dqg Qrlvh
Wdnlqj lqwr dffrxqw rxu suhfhghqw sdudjudskv/ wkh xvh ri duwlfldo gdwdvhwv jlyhv vhyhudo dgydqwdjhv = zh fdq frqwuro wkh gl!fxow| dqg wkh suredelolw| glvwulexwlrq ri wkh frqfhsw wr ohduq/ ghqh wkhruhwlfdoo| wkh plqlpxp vl}h ri wkh ohduqlqj vhw dqg lqwurgxfh frqwuroohg udqgrp qrlvh1 714
Frqfhsw wr Ohduq
Lq rxu h{shulphqwdwlrq/ zh vwxg| yh errohdq frqfhswv ri jurzlqj frpsoh{lw|/ zklfk duh 60ri0Q frqfhswv zlwk Q @ 6> 7> ===> :1 Wkh| duh frqvwuxfwhg e| jhqhudw0 lqj errohdq dwwulexwhv/ wkdw wdnh wkh ydoxh WUXH zlwk wkh suredelolw| s ghqhg lq wdeoh 5/ dqg wkhq dsso|lqj wkh ixqfwlrq wr ohduq lq rughu wr rewdlq wkh ydoxh ri wkh yduldeoh wr suhglfw1 Wkhvh ixqfwlrqv duh glvmxqfwlyh qrupdo irupv/ vr wkh| duh yhu| kdug wr ohduq iru dq lqgxfwlrq wuhh1 Whq lqghshqghqw errohdq suhglfwlyh dwwulexwhv zhuh dgghg lq rxu gdwdvhwv1 Lqghhg/ li zh frqqh wr dwwulexwhv ri wkh frqfhsw ixqfwlrqv/ wkh| zloo dozd|v eh vhohfwhg/ pdnlqj eholhyh wr d idoodflrxv urexvwqhvv ^5`1 715
Vl}h ri wkh Ohduqlqj Vhw
Wkuhh sdudphwhuv duh wr eh vhw = vl}h q ri wkh wudlqlqj vhw/ vl}h n ri ydolgdwlrq vhw dqg wkh qxpehu u ri uhshwlwlrqv ri wkh h{shulphqw1 Wr whvw li wkh ydoxh ri wkh vl}h q ri wkh wudlqlqj vhw lqwhuihuh zlwk wkh rswlpdo ydoxh ri / zh wulhg glhuhqw vl}hv ri wkh wudlqlqj vhw iru hdfk frqfhsw1 Zh fdq vd| wkdw dq l0uxoh zklfk suhglfw 4 iru wkh frqfhsw kdv wkh sured0 elolw| sP +4 s,lP / zkloh dq l0uxoh zklfk suhglfw 3 iru wkh frqfhsw kdv wkh
514
R. Rakotomalala, S. Lallich, and S. Di Palma
suredelolw| slQ .P 4 +4 s,Q P .4 1 Frqvhtxhqwo|/ wkh ohdvw suredeoh uxoh ri wkh plqlpxp wuhh kdv wkh suredelolw| splq / jlyhq e| =
splq @ P lq sP +4
s,Q P > sP 4+4 s,qP .4
Vr/ dffruglqj wr wkh uhvxowv ri wdeoh 4 uhodwlyh wr wkh qxpehu ri uxohv dqg wkh suredelolw| ri wkh ohdvw suredeoh uxoh iru hdfk frqfhsw/ zh vhohfwhg irxu ydoxhv ri q iru hdfk frqfhsw/ l0h whq/ iwhhq/ wzhqw| dqg wzhqw| yh wlph wkh qxpehu ri uxohv1 Wkxv/ wkh vl}h ri ohdyhv ri rxu wuhhv lv qrw orzhu wkdq 8 h{dpsohv +3=479< 7 43 @ 8= ;:9,1 Frqfhuqlqj wkh whvw vhw/ zh jhqhudwh zlwk wkh vdph surfhvv n @ 5333 h{dp0 sohv/ zklfk lv vx!flhqw wr rewdlq dq dffxudwh hvwlpdwlrq ri wkh huuru udwh ^<`1 Dw odvw/ zh fkrrvh wr uhdol}h u @ 58 uhshwlwlrqv ri wkh h{shulphqw/ lq rughu wr eh deoh wr whvw wkh hhfw ri wkh idfwruv rq wkh jhqhudol}dwlrq dffxudf| udwh1 716
Nlqg dqg Ohyho ri Qrlvh
Wr frpsohwh wkh vwxg| ri wkh ehkdylru ri phdvxuhv/ zh lqwurgxfhg qrlvh lq rxu duwlfldo gdwdvhwv1 Rqo| wkh fodvv dwwulexwh zdv dgghg qrlvh dffruglqj wr wkh iroorzlqj udqgrp surfhgxuh = iru hdfk h{dpsoh/ wkh uhvxow ri wkh frqfhsw lv qrlvhg zlwk d suredelolw| zklfk ghshqgv rq wkh fodvv glvwulexwlrq1 Wkuhh nlqgv ri qrlvh +d/ e/ f, duh vwxglhg lq wklv sdshu1 Zh ghqrwh e| 4 dqg 5 wkh suredelolwlhv ri hdfk ydoxh ri wkh fodvv/ 4 dqg 5 wkhlu suredelolwlhv wr eh qrlvhg/ dqg wkh ryhudoo suredelolw| ri qrlvh/ @ 4 4 . 5 5 1 Zh uhfdoo wkdw lq rxu h{shulphqw/ 4 @ 3=:8 dqg 5 @ 3=581 Jlyhq wkdw qrlvlqj wkh gdwd prgli| fodvv glvwulexwlrqv/ 4 dqg 5 duh wkh suredelolwlhv ri hdfk ydoxh ri wkh fodvv diwhu qrlvh1 +d, wkh rffxuuhqfh ri qrlvh lv wkh vdph/ zkdwhyhu wkh ydoxh wdnhq e| wkh fodvv dwwulexwh1 Zh fdq frqvlghu lw dv d uhihuhqfh1 Lw rffxuv qrwdeo| lq lqgxvwuldo surfhvv zkhuh gdwd duh froohfwhg dxwrpdwlfdoo|1 4 @ 5 @ > l @ l . +qrw l
l , >
l @ 4> 5
+e, wkh rffxuuhqfh ri qrlvh lv sursruwlrqdo wr fodvv glvwulexwlrq1 Iru lqvwdqfh/ lq phglfdo whvwv/ wkh suredelolw| ri glvhdvh lv riwhq zhdn dqg wkh suredelo0 lw| ri idovh srvlwlyh lv juhdwhu wkdq wkh suredelolw| ri idovh qhjdwlyh1
l l @ 5 5 > l @ l . 4 .5
5 5 +qrw l l , > 5 5 4 5
.
l @ 4> 5
+f, wkh rffxuuhqfh ri qrlvh lv lqyhuvho| sursruwlrqdo wr wkh fodvv glvwulexwlrq1 Wklv sureohp lv yhu| kdug wr ohduq zkhq zh kdyh d yhu| xqedodqfhg fodvv glvwulexwlrq1 Lqghhg li qrlvh lv frqfhqwudwhg rq wkh uduh ydoxh/ lw lv yhu| gl!fxow wr h{fhhg wkh vlpsoh fodvvlhu frqfoxglqj dozd|v wr wkh prvw uhtxhvw ydoxh lq wkh ohduqlqj vhw1 > @ > l @ 4> 5 l @ 5 l l l
Zh lqwurgxfh qrlvh lq rxu gdwdvhwv zlwk wkh iroorzlqj surfhgxuh = iru hdfk nlqg ri qrlvh/ iru hdfk ryhudoo ohyho ri qrlvh +3/ 3=38/ 3=43/ 3=53,/ zh fdofxodwh wkh suredelolw| ri wkh h{dpsohv uhodwhg wr hdfk ydoxh ri wkh frqfhsw wr eh prglhg1 Wkhq wkh h{dpsohv duh prglhg dffruglqj wr wklv suredelolw|1 Hdfk h{dpsoh wr eh prglhg wdnhv wkh dowhuqdwlyh fodvv ydoxh1
Induction Trees Using a M−of−N Concept
8
515
Uhvxowv dqg frpphqwv
Wr fkhfn wkh ehkdylru ri jhqhudol}hg hqwurs| lq lqgxfwlrq wuhh dojrulwkp/ zh xvh wkh srsxodu F718 dojrulwkp ^43` zkhuh zh uhsodfh wkh vsolwwlqj fulwhulrq jdlq udwlr zlwk hqwurs| ri w|sh 1 Doo rwkhuv ihdwxuhv duh wkh vdph/ lq sduwlfxodu zh h{sdqg wkh pd{lpxp wuhhv ehiruh suxqlqj wkhp xvlqj wkh shvvlplvwlf huuru udwh1 Wkxv/ glhuhqfhv ehwzhhq wuhhv uho| rqo| rq wkh ydoxh ri iru wkh jhqhudol}hg hqwurs|1 Wr hydoxdwh wkh shuirupdqfh ri phdvxuhv/ zh xvh wkh jhqhudol}dwlrq huuru udwh1 Frpsdulvrqv irfxv rq ydoxhv ri @ 4=3> 4=8> 5=3> 6=3> 8=31 Lq wkh fdvh ri hdfk frqfhsw zh kdyh shuiruphg dq DQRYD +Dqdo|vlv ri yduldqfh, lq rughu wr whvw wkh pdlq hhfwv dqg wkh lqwhudfwlrqv ri wkh glhuhqw idfwruv rq wkh huuru udwh1 Ehfdxvh zh kdyh d yhu| elj qxpehu ri whvwv/ zh frqvlghu d uhvxow wr eh d vljqlfdqw rqh rqo| li lwv s0ydoxh lv orzhu wkdq 3=334 +, ru frpsulvhg ehwzhhq 3=334 dqg 3=34 +,= Zh glvwlqjxlvk wzr nlqgv ri uhvxowv/ uvwo| wkrvh frqfhuqlqj wkh fdvh ri qr qrlvh lq jhqhudwhg gdwdvhw xvhg iru ohduqlqj/ vhfrqgo| wkrvh frqfhuqlqj wkh fdvh ri qrlvhg gdwd1 814
Zlwkrxw qrlvh
Dv lw zdv iruhvhhdeoh/ wkh vl}h q ri wkh wudlqlqj vhw lv vljqlfdqw + , iru hdfk frqfhsw P0ri0Q/ Q @ 6> 7> ===> :1 Li wkh vl}h ri wkh ohduqlqj vhw lv vx!flhqw/ wuhhv duh deoh wr ohduq rxu ixqfwlrq1 Zh qrwh wkdw zh gr qrw ohduq dozd|v wkh uljkw frqfhsw1 Lq idfw wkh huuru udwh lqfuhdvhv zkhq wkh udwlr vl}h ri ohduqlqj vhw dqg frqfhsw frpsoh{lw| ghfuhdvhv1 Wkh ydoxh ri lv pruh dqg pruh vljqlfdqw dv wkh frpsoh{lw| ri wkh P0ri0Q frqfhsw lqfuhdvhv= qr vljqlfdqw iru Q @ 6> 7> 8/ yhu| vljqlfdqw +, iru Q @ 9 dqg yhu| kljko| vljqlfdqw + , iru Q @ :1 Wdeoh 5 vkrzv wkdw zkhq wkh frpsoh{lw| lqfuhdvhv/ wkh ehvw ydoxhv ri duh dw wkh rssrvlwh h{wuhphv + @ 4 dqg @ 8, dqg wkh shuirupdqfh ri wkh lqwhuphgldwh ydoxhv ri +h1j1 @ 4=8> 5> 6, ghwhulrudwhv1 Shukdsv zh fdq qg h{sodqdwlrq ri wkh hhfwlyhqhvv ri wkh phdvxuh zlwk wkhlu hpslulfdo yduldqfh = zkhq wkh yduldqfh ri wkh phdvxuh lv orz/ lw fdq eh ehwwhu wr fkrrvh wkh ehvw rqh dprqj fdqglgdwh dwwulexwhv/ hvshfldoo| iru uhmhfwlqj qrlv| dwwulexwhv1 Xqiruwxqdwho|/ wkh vwxg| ri wkh yduldqfh ehkdylru lv yhu| kdug khuh ehfdxvh lw uholhv rq / frqglwlrqdo dqg xqfrqglwlrqdo glvwulexwlrq/ exw dovr rq wkh qxpehu ri ydoxhv ri wkh fodvv dqg wkh vsolwwlqj dwwulexwhv +iru wkh vshfldo fdvh ri errohdq frqfhsw/ zh zrxog eh doorzhg wr vwxg| d 5 5 furvv0wdexodwlrq,1 Wklv uvw uhvxow lv yhu| lqwhuhvwlqj exw zh fdq dvn wr nqrz li lw lv xvhixo rq gdwd plqlqj sureohp1 Zh qrwh wkdw hyhq li wkh glhuhqfhv lv vwdwlvwlfdoo| vljqli0 lfdqw/ wkh| zhuh qrw sudfwlfdoo| vljqlfdqw1 Iru lqvwdqfh/ iru wkh 60ri0: frqfhsw/ huuruv ydu| iurp 3134<7 + @ 6, wr 3134:5 + @ 4,1 Lw lv reylrxv wkdw wklv lp0 suryhphqw zloo qrw eh xvhixo lq uhdo sureohp1 D ixuwkhu vwxg| rq rwkhuv frqfhswv zloo eh qhfhvvdu| wr frqup ru uhmhfw wkh zhdnqhvv ri wkhvh glhuhqfhv exw dv zh vhh ehorz/ rq uhdo gdwdvhw zklfk duh qdwxudoo| qrlv|/ zh zrqghu li lw lv uhdoo| qhfhvvdu| wr vhdufk wkh ehvw phdvxuh +wkh ehvw sdudphwhu , iru d sureohp1
516
R. Rakotomalala, S. Lallich, and S. Di Palma Wdeoh51
Huuru udwh +{ 43333, ri ghflvlrq wuhhv exlow rq gdwd zlwkrxw qrlvh Q
.q
4
418 5
6
8
6
4;
48
4;
47
7
439 437 <<
<3
<5
8
469
477 48< 48< 467
9
498
4:7 4;9 4;7 489
:
4:5
4;4 4<5 4<7 4:8
<
815 Zlwk Qrlvh
Dw wkh rssrvlwh/ lq fdvh ri qrlv| gdwd/ doo DQRYDv jlyh vlplodu uhvxowv iru s0ydoxh ri pdlq idfwru dqg wkhlu lqwhudfwlrqv/ vd|v = vl}h ri ohduqlqj vhw/ wkh ohyho ri qrlvh dqg wkhlu lqwhudfwlrq duh yhu| kljko| vljqlfdqw + , zkdwhyhu wkh frpsoh{lw| ri wkh frqfhsw P0ri0Q1 Wklv lq0 whudfwlrq uholhv rq wkh uhgxfwlrq ri wkh dfwlrq ri ohduqlqj vhw vl}h zkhq wkh ohyho ri qrlvh lqfuhdvhv1 qhlwkhu wkh ydoxh ri / qru lwv lqwhudfwlrqv zlwk wkh rwkhu idfwruv duh vljql0 fdqw1 Hvshfldoo|/ wkhuh lv qr lqwhudfwlrq ri wkh nlqg ri qrlvh zlwk wkh ohyho ri zkdwhyhu wkh ohyho ri qrlvh1
Dq lpsruwdqw uhvxow ri rxu vwxg| lv wr qrwh wkdw vwdwlvwlfdo glhuhqfhv ehwzhhq huuru udwhv dffruglqj wr wkh ydoxh ri revhuyhg lq wdeoh 5 glvdsshdu zkhq zh xvh qrlv| gdwd1
9
Frqfoxvlrq
Rxu sdshu suhvhqwv dq xqlhg iudphzrun iru vsolwwlqj phdvxuhv zklfk jhqhudol}hv wkh vwdqgdug rqh1 Wklv phwulf ghshqgv rq d sdudphwhu wkdw zh fdq ydu| wr rewdlq idprxv phdvxuhv vxfk dv Vkdqqrq hqwurs| ru Jlql lqgh{1 Qrupdol}hg phdvxuhv vxfk dv Jdlq udwlr ru Pdqwdudv glvwdqfhv fdq dovr eh ghgxfhg^:` ^43`1 Sdudphwulf lqglfdwruv/ hvshfldoo| wkh dv|pswrwlf yduldqfh/ zhuh fdofxodwhg1 Zh hydoxdwh wkh ehkdylru ri wklv jhqhudol}hg hqwurs| phdvxuh lq ghflvlrq wuhh dojrulwkp dffruglqj wr wkh sdudphwhu1 Rxu ruljlqdolw| ehvlgh suhylrxv vwxglhv ^5` ^46` lv wr xvh frqfhsw P0ri0Q zklfk surylghv d vfdoh ri ixqfwlrqv zkrvh zh frqwuro wkh jurzlqj frpsoh{lw| dqg wkh fodvv glvwulexwlrq1 Wkh uvw uhvxow ri rxu zrun lv wkdw wkh glhuhqfhv ri shuirupdqfhv ehwzhhq phdvxuhv/ zklfk duh vwdwlvwlfdoo| vljqlfdqw zkhq wkhuh lv qr qrlvh lq gdwdvhwv/ glvdsshdu zkhq wkh ohyho ri qrlvh lqfuhdvhv1 Wklv frxog h{sodlq wkh frqfoxvlrqv ri pdq| dxwkruv zkr qrwh wkdw phdvxuhv lq xhqfh wkh vl}h ri wuhhv udwkhu wkdq wkhlu shuirupdqfh ^8`1 Lqghhg/ prvw ri wkhp xvh uhdo gdwdvhwv zklfk duh qdwxudoo| qrlv|/ wkh| grq*w frqwuro fodvv glvwulexwlrq ru frqfhsw frpsoh{lw|1 Rxu vwxg|/ xvlqj v|qwkhwlf gdwdvhw/ lv pruh srzhuixo wr ghwhfw glhuhqfhv ehwzhhq phdvxuhv ehkdylru1
Induction Trees Using a M−of−N Concept
517
Krzhyhu/ dqg wklv lv rxu vhfrqg pdlq uhvxowv/ zh qrwh wkdw wklv glhuhqfhv duh vwdwlvwlfdoo| vljqlfdqw exw qrw sudfwlfdoo| vljqlfdqw/ hyhq rq duwlfldo gdwdvhw exlow iurp errohdq frqfhsw1 Zh h{shfw wkdw iru wkh prvw sduw ri uhdo gdwdvhwv zklfk duh riwhq pruh ru ohvv qrlv|/ doo phdvxuhv lvvxhg iurp jhqhudol}hg hqwurs| jlyh d jrrg dssurdfk wr vshfldol}h lqgxfwlrq wuhhv1
Uhihuhqfhv 41 O1 Euhlpdq/ M1K1 Iulhgpdq/ U1D1 Rovkhq hw F1M1 Vwrqh1 Fodvvlfdwlrq dqg Uhjuhv0 vlrq Wuhhv1 Fdoliruqld = Zdgvzruwk Lqwhuqdwlrqdo/ 4<;71 51 Z1 Exqwlqh hw W1 Qleohww1 D ixuwkhu frpsdulvrq ri vsolwwlqj uxohv iru ghflvlrq wuhh lqgxfwlrq1 Pdfklqh Ohduqlqj / ;=:8;8/ 4<<51 61 O1 D1 Jrrgpdq hw Z1 K1 Nuxvndo1 Phdvxuhv ri dvvrfldwlrq iru furvv fodvvlfdwlrqv1 Mrxuqdo ri wkh Dphulfdq Vwdwlvwlfdo Dvvrfldwlrq / 6:=87448/ 4<871 71 O1F1 Nlqjvodqg1 Wkh hydoxdwlrq ri phglfdo h{shuw v|vwhp = H{shulhqfh zlwk wkh dl2ukhxp nqrzohgjh0edvhg frqvxowdqw lq ukhxpdwrorj|1 Lq Surfhhglqjv ri wkh Qlqwk Dqqxdo V|psrvlxp rq Frpsxwhu Dssolfdwlrqv lq Phglfdo Fduh / 4<;81 81 L1 Nrqrqhqnr1 Rq eldvhv lq hvwlpdwlqj pxowl0ydoxhg dwwulexwhv1 Lq Surf1 Lqw1 Mrlqw Frqi1 Rq Duwlfldo Lqwhooljhqfh LMFDL*<8 / sdjhv 43674373/ 4<<81 91 V1 Odoolfk1 Frqfhsw gh glyhuvlwh hw dvvrfldwlrq suhglfwlyh1 Lq Surfhhglqjv ri [[[Lhphv Mrxuqhhv gh Vwdwlvwltxh / sdjhv 9:69:9/ Pd| 4<<<1 :1 U1O1 Gh Pdqwdudv1 D glvwdqfh0edvhg dwwulexwhv vhohfwlrq phdvxuhv iru ghflvlrq wuhh lqgxfwlrq1 Pdfklqh Ohduqlqj / 9=;4<5/ 4<<41 ;1 S1 Pxusk| hw P1 Sd}}dql1 Lg50ri06 = Frqvwuxfwlyh lqgxfwlrq ri p0ri0q frqfhswv iru glvfulplqdwruv lq ghflvlrq wuhhv1 Whfkqlfdo Uhsruw <406:/ Ghsduwphqw ri Lqirupd0 wlrq dqg Frpsxwhu Vflhqfh 0 Xqlyhuvlw| ri Fdoliruqld dw Luylqh/ 4<<41 <1 J1 Sdjdoor hw G1 Kdxvvohu1 Errohdq ihdwxuh glvfryhu| lq hpslulfdo ohduqlqj1 Pdfklqh Ohduqlqj / 8=:4< 4<<31 431 M1 U1 Txlqodq1 F718= Surjudpv iru Pdfklqh Ohduqlqj 1 Prujdq Ndxipdqq/ Vdq Pdwhr/ FD/ 4<<61 441 U1 Udnrwrpdodod1 Judskhv g*Lqgxfwlrq1 SkG wkhvlv/ Xqlyhuvlw| Fodxgh Ehuqdug 0 O|rq 4/ Ghfhpehu 4<<:1 451 O1 Zhkhqnho1 Rq xqfhuwdlqw| phdvxuhv xvhg iru ghflvlrq wuhh lqgxfwlrq1 Lq Sur0 fhhglqjv ri Lqir1 Surf1 dqg Pdqdj1 Ri Xqfhuwdlqw| / sdjhv 74674;/ 4<<91 461 D1S1 Zklwh hw Z1]1 Olx1 Eldv lq lqirupdwlrq0edvhg phdvxuhv lq ghflvlrq wuhh lqgxfwlrq1 Pdfklqh Ohduqlqj / 48+6,=65465 4<<71 471 G1D1 ]ljkhg/ M1S1 Dxud| hw J1 Gxux1 Vlslqd = Phwkrgh hw orjlflho 1 Odfdvvdjqh/ 4<<51
Discovering Rules in Information Trees Zbigniew W. Ras Univ. of North Carolina, Comp. Science, Charlotte, N.C. 28223, USA and Polish Academy of Sciences, Comp. Science, 01-237 Warsaw, Poland
Abstract: The notion of an information tree has been formulated and investigated in 1982-83 by K.Chen and Z. Ras [1,2,3]. In [1] we have defined a notion of an optimal tree (the number of edges is minimal) in a class of trees which are semantically equivalent and shown how optimal trees can be constructed. Rules can be discovered almost automatically from information trees. In this paper we propose a formalized language used to manipulate information trees, give its semantics and a complete and sound set of axioms. This complete and sound set of axioms is needed for discovering rules in information trees assuming that conditional attributes and the decision attribute are not arbitrary but they are both provided by the user.
1 Introduction Information trees and their query answering systems have been proposed and investigated in 1982-83 by K.Chen and Z. Ras [1,2,3]. The main difference between information trees and decision trees [5,6] lies in the interpretations and applications. Their structures and methods of constructions are often transferable to each other. In decision trees nodes are labeled by queries, edges by responses to these queries and leaves by some objects uniquely identified by the path from the root to the leaf. In an information tree, internal nodes are labeled by attributes and terminal nodes by sets of objects. A path from the root to a terminal node is interpreted as a description of objects labeling that node. In [1] we proposed a heuristic polynomial algorithm to construct a minimal information tree with respect to the storage cost (storage cost was defined by us as a number of edges in a tree). It is worth to note that the problem of constructing even a minimal binary tree with respect to the storage cost is to be known as NP-complete. Information trees are quite useful in KDD area. Many certain rules can be discovered from the information tree in 0(k) steps where k is the height of the tree (number of attributes). Many possible rules can be generated in 0(n) steps where n is the number of nodes in a tree. The label of any edge (let us say e) in an information tree can be seen as a decision value of a rule. The conjunction of labels of all edges forming a path from the root of the tree to the edge e gives the condition part of the rule. Questions we would like to state in this paper say: can we discover rules from an information tree if condition attributes and a decision attribute are given by a user? Can we manipulate information trees algebraically into forms that are more convenient for a rule discovery than the initial trees? To answer these questions we devise a representation of information trees as terms in a formal theory (theory of information trees) and present rules to manipulate J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 518-523, 1999. © Springer-Verlag Berlin Heidelberg 1999
Discovering Rules in Information Trees
519
them. Next we show that the rules proposed by us are complete in constructing equivalent information trees. We also give a strategy for constructing equivalent information trees to a given information tree which are more convenient for discovering rules when conditional and decision attributes are given.
2 Basic Definitions In this section we recall the definition of an information tree. Next we introduce the notion of equivalence of two information trees and the notion of one tree being covered by another. Let U be a finite set of attributes called the universe. For each a ∈ U , let VA be the set of attribute values of A. We assume that VA is finite, for any A ∈ U. By an information tree on the universe U, we mean a tree T=(N,E) such that: (a) each interior node is labeled by an attribute from U, (b) each edge is labeled by an attribute value of the attribute that labels the initial node of the edge, (c) along a path, all nodes (except the leaf) are labeled with different attributes, (d) all edges leaving a node are labeled with different attribute values (of the attribute that labels the node), (e) a subset Nl of N is given, each node in Nl is called an object node.
Figure 1. Information Tree So, an information tree can be thought of as a triple (T, l, Nl ) where T = (N,E) is a tree, Nl ⊆ N and l is the labeling function from Nl ∪E into U ∪ ( ∪{VA : A ∈U}). The set Nl is a set of internal nodes in T.
520
Z.W. Ras
Let m be an object node, n1 , n2 , n3 ,…, nk with nk = m be the path from the root n1 to m. Objects node m determines an object type 0(m)={[l(ni),l([ni , ni+1] )]: i=1,2,…,k-1 } where l(ni ) is the label of the node ni , which is an attribute. l([ni , ni+1] ) is the label of the edge [ni , ni+1] which is an attribute value of the attribute l(ni ). An information tree S = ((N,E), l, Nl ) determines a set of object types 0(S) = {0(m): m ∈ Nl }. Two information trees S1 , S2 are said to be equivalent if and only if 0(S1) = 0(S2). If 0(S1) ⊆ 0(S2), we say that S1 is covered by S2. Figure 1 represents an information tree S = ((N,E), l, Nl ) , where N={a,b,c,d,e, f,g,h,i,j}, E = {[a,b], [b,e], [b,f], [a,c], [a,d], [d,g], [g,i], [g,j], [d,h]}, l(a) = color, l([a,b]) = red , l([a,c]) = blue , l([a,d]) = yellow , l(b) = sex , l([b,e]) = male , l([b,f]) = female , l(d) = size , l([d,g]) = large , l([d,h]) = small , l(g) = sex , l([g,i]) = male , l([g,j]) = female , Nl = {e,f,c,g,i,j,h}. The object type of the node f is {[color, red], [sex, female]}. This information tree classifies seven different object types (determined by the seven nodes in Nl ). Assume that {[n,n1], [n,n2],…, [n,nk]} is the set of all outgoing edges from the node n. Node n is called a-active (a is an attribute) if l(n1) = l(n2) = l(nk) = a. Also, we say that k is the degree of the node n. The degree of an information tree is k if no node in the tree has the degree greater than k and there is a node in the tree of the degree k. Information tree is called semi-complete if on any path from the root to a leaf of the tree the same attributes are listed (their order is immaterial). Lemma 1. Assume that information tree is semi-complete. If a child of a node n is a leaf, then the parent of n is a-active for some attribute a. Rules can be automatically generated from an information tree. For instance, let us assume that object node i in the information tree represented by Figure 1 contains 6 objects. Also, assume that object node j contains 4 objects and the object node h contains 3 objects. The following rules can be automatically generated: 1. (color = yellow) ∧ (size = large) → (sex = male) with confidence 6/10 2. (color = yellow) → (size = large) with confidence 10/13. However, if the user looks for a definition of attribute color in terms of attributes size and sex, the tree in Figure 1 has to be transformed into an equivalent tree which will be more suitable for the required knowledge extraction.
3 Formal Theory of Information Trees In this section we shall define a formal syntax for representing information trees which is motivated by the LISP representation in [3] and co-algebra representation in [4]. We will introduce axioms and rules of inference for the formal theory of information trees.
Discovering Rules in Information Trees
521
Let us use the information tree from Figure 1 as a starting point in this section. Assume Vcolor = {red, blue, green, yellow} is ordered as red, blue, green, yellow; Vsex = {male, female} is ordered as male, female; Vsize = {large, medium, small} is ordered as large, medium, small. Then the following term color(sex(-, -), -, *, size(sex(-, -), *, -)) retains all the information about the information tree from Figure 1. An underline means the node is an object node. A star means the corresponding subtree is empty. To describe an information tree, we use the general scheme attribute(subtree1 , subtree2, …, subtreen ) assuming the attribute has n different values. Now we are ready to introduce a formal theory of information trees over an attribute universe U. There are two constant symbols * , - which have standard interpretation: empty tree and single node tree (tree with one node being an object node) respectively. For each attribute A in U with |VA| = n (where |VA| denotes the number of attribute values of A), there are two n-ary function symbols fA , fA . The standard interpretation of fA(t1 ,t2, …, tn ) is the information tree with the root labeled A (drawn as circle) and the next level subtrees t1 , t2 ,…, tn . The standard interpretation of fA(t1 ,t2, …, tn ) is the information tree with the root labeled A which is an object node (drawn as square) and next level subtrees t1 , t2 ,…, tn . Function symbol fA is called a type 0 function symbol, fA is called a type 1 function symbol. There is one predicate symbol ≡ . Statement t1 ≡ t2 in the standard interpretation says that t1 and t2 are equivalent. Terms are defined by the following recursive definition: DEFINITION OF TERMS. (a) constant symbols are terms, (b) if g is n-ary function symbol, t1, t2, …, tn are terms not containing g or its dual type function symbol, then g(t1, t2,…, tn ) is a term. Intuitively, each term represents an information tree. If a term does not contain any type 1 function symbol or the constant symbol -, it is called null object term. The nested level h(t) of a term t is defined as follows: h(*) = h(-) = 0, h(fA(t1, t2,…, tn)) = h(fA(t1, t2,…, tn)) = max{h(ti): n ≥ i} + 1. Let t be a term, we use I(t) to denote the standard interpretation of t, i.e., the information tree that t represents. We have: H(t)=n if and only if the height of I(t) is n. If t is a term then by t we mean a new term defined below: t = [ if t = * , then - ] else [ if t = fA(t1, t2, …, tn ) , then fA(t1, t2, …, tn )] else t
522
Z.W. Ras
The formulas are defined by the following recursive definition: (a) t1 ≡ t2 is a formula for any two terms t1, t2 (b) p ∧ q, p ∨ q, p → q, p ↔ q, ¬ p are formulas if p, q are formulas. Our formal theory has the following axiom schemata: A1. (reflexive) t ≡ t is an axiom for any term t , A2. (nullity) * ≡ t for any null object term t , A3. (change the order of branching) f(g(t1,1 , t1,2 ,…, t 1,m ), g(t2,1 , t2,2 ,…, t 2,m ),…, g(tn,1 , tn,2 ,…, t n,m )) ≡ f(g(t1,1 , t2,1 ,…, t n,1 ), g(t1,2 , t2,2 ,…, t n,2 ),…, g(t1,m , t2,m ,…, t n,m )) is an axiom for any two type 0 function symbols f, g where f is n-ary, g is m-ary and for any n.m terms t i,j (i ≤ n , j ≤ m) not containing f, g or their type 1 duals , A4. p → (q → p) for any formulas p, q , A5. (p → (q → r)) → ((p → q) → (p → r)) for any formulas p, q, r , A6. (¬p → ¬q) → (q → p) for any formulas p, q . The rules of inference for our formal system are the following: R1. from p → q and p we deduce q for any formulas p, q , R2. from t1 ≡ t2 we deduce t(t1) ≡ t(t2), where t(t1) is a term containing t1 and t(t2) comes from t(t1) by replacing some of the occurrences of t1 with t2 , R3. From t1 ≡ t2 , we can deduce t1 ≡ t2 Let t be a term, we shall use I(t) to denote the information tree represented by t under the standard interpretation. Then we have the following completeness theorem. Theorem 1. t1 t2 if and only if I(t1) is equivalent to I(t2).
4 Discovering Rules in Information Trees In this section we suggest a strategy for discovering rules in information trees when condition and decision attributes are provided by the user. We start with an example of an information tree represented by the term A(B(C(-,-),C(-,-)), *, C(B(-,*),*), C(*,B(-,*))) . Assume now that our plan is to describe A (decision attribute) in terms of B and C (classification attributes). We use axioms A2, A3 and rule R2 repeatedly to replace the term above by a new equivalent term B(C(A(-,*,-,*), A(-,*,*,-)), C(A(-,*,*,*), A(-,*,*,*))). The goal of our strategy is to move attribute A below all classification attributes. In a case of attributes B and C, we prefer to place B above C because there is an equal number of object nodes having property c1 and c2 whereas there are two object nodes having property b2 and four object nodes having property b1. Now, if a node labeled by attribute A has only one outgoing edge then edges along a path from the root to that node define the classification part of a rule. In our example we get two rules: b2 ∧ c1→ a1 and b2 ∧ c2 → a1 .
Discovering Rules in Information Trees
523
Now, we apply axioms A2, A3 and rule R2 again to move up the nodes labeled by attribute A assuming that these nodes both currently and in a resulting tree have only one outgoing edge. Term B(C(A(-,*,-,*), A(-,*,*,-)), A(C(-,-),*,*,*)) represents the final resulting information tree (called [A;B,C]-optimal) which is equivalent to the initial tree. We have only one rule describing A in terms of B and C which is: b2 → a1 . Clearly this rule is optimal. Lemma 2. Assume that information tree is semi-complete and its degree is s. Let n1, n2,…, nk be all children of the node n and l(n1)=a. The problem of converting n to a-active node is in the worst case 0(sk) where k is the height of the complete subtree with a root n1. We count here the number of times the rules of inference are applied.
5 Conclusion Information trees investigated in this paper satisfy the assumption that on the path from the root of a tree to a leaf there cannot be two nodes labeled by the same attribute. Trees allowing repeated attributes on a path from the root of a tree to a leaf were investigated by Cockett [4] and used in the implementation of CASCADE system. Formal theory of information trees and its completeness theorem with respect to the predicate ≡ allows us to manipulate information trees using axioms A2, A3, rules R2, R3 and preserve its semantical meaning. Also, we know that any two information trees which are semantically equivalent can be transformed from one to another using only axioms A2, A3, A4, A5, A6 and rules R1, R2, R3. Assume that a quest q which requires to find an optimal description of an attribute A in terms of attributes A1, A2,…, Ak queries an information tree T. Our goal is to find [A;A1,A2,…,Ak]-optimal information tree which is semantically equivalent to T. Example in the last section of this paper gives us some ideas how such trees can be constructed using the formal theory of information trees.
References 1. 2. 3. 4. 5.
6.
Chen K, Ras Z.W., Homogeneous information trees, Fundamenta Informaticae, 8, 1985, 123-149. Chen K, Ras Z.W., DDH approach to information systems, Proceedings of the 1982’CISS in Princeton, N.J., 521-526. Chen K, Ras Z.W., Dynamic hierarchical data bases, Proceedings of the 1983’ICAA in Taipei, Taiwan, 450-456. Cockett R., The algebraic co-theory of decision processes, Technical Report CS-84-58, University of Tennessee. Quinlan, J.R., Induction of decision trees, Machine Learning, 1, 1986, 81-106. Reprinted in J.W. Shavlik & T.G. Dietterich (Eds.), Readings in machine learning, San Francisco, CA, Morgan Kaufmann, 1990 Quinlan, J.R., Generating production rules from decision trees, Proceedings of the Tenth International Joint Conference on Machine Learning, Morgan Kaufmann, 1987, 304-307
Mining Text Archives: Creating Readable Maps to Structure and Describe Document Collections Andreas Rauber and Dieter Merkl Institute of Software Technology, Vienna University of Technology www.ifs.tuwien.ac.at/∼andi www.ifs.tuwien.ac.at/∼dieter
Abstract. With the ever-growing amount of unstructured textual data on the web, mining these text collections is of increasing importance for the understanding of document archives. Particularly the self-organizing map has shown to be very well suited for this task. However, the interpretation of the resulting document maps still requires a tremendous effort, especially as far as the analysis of the features learned and the characteristics of identified text clusters are concerned. In this paper we present the LabelSOM method which, based on the features learned by the map, automatically assigns a set of keywords to the units of the map to describe the concepts of the underlying text clusters, thus making the characteristics of the various topical areas on the map explicit.
1
Introduction
Mining unstructured text data requires the analysis of very high-dimensional data spaces, which makes it a challenging application arena for neural networks. In particular the self-organizing map (SOM) [2] as an unsupervised neural network model has been experiencing an increased popularity in a wide range of application arenas and is frequently used to produce a topologically ordered mapping of high-dimensional data spaces such as encountered in text document collections, cf. [1,3,5]. Among the many benefits we just refer to its stability and its ability to provide an intuitive visualization of high-dimensional data by producing a topology-preserving mapping from a high-dimensional input space to a usually two-dimensional map space. However, interpreting the resulting trained maps is not as intuitive as it could be. This is in spite of the recent advance in sophisticated cluster visualization methods for the self-organizing map like the U-Matrix [8], Adaptive Coordinates or Cluster Connections techniques [4]. While all of these methods focus on the visualization of cluster structure, it still remains a tedious task to interpret the mapping of the SOM as such, i.e. to analyze which attributes were relevant for a particular mapping to describe the characteristics of the clusters identified or, in the case of text mining, reveal the concepts of the various text document clusters. When we look at present applications of the SOM, we usually find it labeled manually, i.e. after inspection of the trained map a set of keywords is assigned ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 524–529, 1999. c Springer-Verlag Berlin Heidelberg 1999
Creating Readable Maps to Structure and Describe Document Collections
525
to each unit or cluster to provide the user with some hints on the contents of the map. This process, obviously, is highly labour intensive by requiring manual inspection of the data mapped onto the units. In this paper we present our LabelSOM method to automatically assign keywords to the units of a trained self-organizing map. We demonstrate the benefits of this method by using it to describe the topics of documents clustered by a self-organizing map. As a sample document archive we use the classsic TIME Magazine article collection consisting of articles from the TIME Magazine from the 1960’s. The LabelSOM method allows to now automatically describe the categories of documents using the features learned by the SOM and thus assists the user in understanding the data collection presented by the map.
2
The Self-Organizing Map and LabelSOM
The SOMLib library [6] is based on the self-organizing map [2] (SOM), a popular unsupervised neural network model. The SOM abstracts from presented input data to provide a topology preserving mapping from a high-dimensional input space to a usually two-dimensional output space. It consists of a 2-dimensional grid of units with n-dimensional weight vectors. During the training process input signals are presented to the map in random order. An activation function is used to determine the winning unit as the unit showing the highest activation. Next, the weight vectors of the winner and its neighboring units are modified following some learning function to represent the input signal more closely. The result of this training process is a map providing a topology-preserving mapping in so far as similar input data are located close to each other on the 2-dimensional map display. The weight vectors, in turn, resemble as far as possible the input signals for which the respective unit serves as a winner. With no a priori knowledge on the data, even obtaining information on the cluster boundaries as it is provided by a number of cluster boundary analysis methods for the SOM [4,8], does not reveal information on the relevance of single attributes for the clustering and classification process. In the LabelSOM approach we determine those vector elements (i.e. features of the input space) that are most relevant for the mapping of an input vector onto a specific unit. This is basically done by determining the contribution of every element in the vector towards the overall Euclidean distance between an input vector and the winners’ weight vector, which forms the basis of the SOM training process. The LabelSOM method is built upon the observation, that, after SOM training, the weight vector elements resemble as far as possible the corresponding input vector elements of all input signals that are mapped onto this particular unit as well as to some extent those of the input signals mapped onto neighboring units. Vector elements having about the same value within the set of input vectors mapped onto a certain unit describe the unit in so far as they denominate a common feature of all data signals of this unit. If a majority of input signals mapped onto a particular unit exhibit a highly similar input vector value for a particular feature, the corresponding weight vector value will be highly similar as
526
A. Rauber and D. Merkl
well. We can thus select those weight vector elements, which show, by and large, the same vector element value for all input signals mapped onto a particular unit to serve as a descriptor for that very unit. This is done by calculating the so-called quantization error vector. It is computed for every unit i as the accumulated distance between the weight vector elements of all input signals mapped onto unit i and the unit’s weight vector elements. More formally, this is done as follows: Let Ci be the set of input patterns xj =< ξi1 , ..., ξin >∈ n mapped onto unit i. Summing up the distances for each vector element k over all the vectors xj (xj ∈ Ci ) and the corresponding weight vector mi =< µi1 , ..., µin > yields a quantization error vector qi for every unit i (Equation 1). qik =
(µik − ξjk )2 ,
k = 1..n
(1)
xj ∈Ci
The quantization error for all individual features serves as a guide for their relevance as a class label. Selecting those weight vector elements that exhibit a corresponding quantization error of close to 0 thus results in a list of attributes that are shared by all input signals on the respective unit and thus describe the characteristics of the data on that unit. These attributes thus serve as candidate labels for regions of the map for data mining applications. Based on the ranking provided by the quantization error vector, we can decide to select either a set of labels exhibiting a quantization error vector value below a certain threshold τ1 as labels or simply choose a set of up to n labels for every unit. While this selection of labels may be used for standard data mining applications, in the text mining arena we are usually faced with a further restriction. Due to the high dimensionality of the vector space and the characteristics of the tf × idf representation of the document feature vectors, we usually find a high number of input vector elements that have a value of 0, i.e. there is a large number of terms that is not present in a group of documents. These terms obviously yield a quantization error value of 0 and would thus be chosen as labels for the units. Doing that would result in labeling the units with attributes that are not present in the data on the respective unit. While this may be useful for some data analysis tasks, where even the absence of an attribute is a distinctive characteristic, it is definitely not the goal in text mining applications where we want to describe the present features that are responsible for a certain clustering rather than describe a cluster via the features that are not present in its data. Hence, we need to determine those vector elements from each weight vector which, on the one hand, exhibit about the same value for all input signals mapped onto that specific unit as well as, on the other hand, have a high overall weight vector value indicating their importance. To achieve this we define a threshold τ2 in order to select only those attributes that, apart from having a very low quantization error, exhibit a corresponding weight vector value above τ2 . In these terms, τ2 can be thought of indicating the minimum importance of an attribute with respect to the tf × idf representation to be selected as a label.
Creating Readable Maps to Structure and Describe Document Collections
west poland refugee polish german bonn berlin
africa jungle primitive patient modern doctor nkrumah hospital trio defendant treason nkrumah court ghana grenade judicary
german chancellor ludwig adenauer alt erhard
borneo malaysia greek karamanlis nenni italy
moscow kremlin soviet sino khrushchev peking squabble chinese enemy
congo europe nato nuclear polaris fleet france spain franco morocco
soviet moscow nuclear treaty pact agreement undergo inspection harold secure case keeler denied tory christine reveal scandal macmillan
turkish ankara france gaulle israel jewish viet cong saigon diem cambodia rocket .
527 west german chancellor adenauer nasser syria arab egypt cairo yemen baath iraq kenya african india pakistan neutral moscow kremlin soviet khrushchev china peking harold christine keeler labour britain africa south black white ghana
Fig. 1. a.) SOM with labels b.) Clusters identified by labels
3
Experimental Results
For the following experiments we use the TIME Magazine article collection. It consists of 420 articles from the TIME Magazine dating from the early 1960’s. To be used for SOM training, the articles are transformed into a vector representation following the vector space model of information retrieval. We use full-term indexing to extract a list of all words present in the document collection while applying some basic stemming rules. Furthermore, we eliminate words that appear in more than 90% or less than 3 articles, with this rule saving us from having to specify language or content specific stop word lists. The documents are thus are represented by a set of vectors using a tf ×idf , i.e. term frequency times inverse document frequency, weighting scheme [7]. This weighting scheme assigns high values to terms that may be considered important in terms of describing the contents of a document and discriminating between the various articles. For the 420 TIME Magazine articles, the indexing process identified 5923 terms, thus describing the articles by 420 feature vectors of dimensionality 5923. These feature vectors are further used to train a self-organizing map consisting of 10 × 15 units. The resulting SOM is depicted in Figure 1a where, due to space considerations, instead of listing the articles that are mapped onto a particular unit, we rather give the number of documents mapped onto the units. The SOM provides an ordered mapping of the articles in terms of the feature vector representation, meaning that articles exhibiting similar features and thus covering similar topics are located close to each other on the map. However, identifying which topics are covered in which part of the map requires the analysis of the individual articles. To be able to describe the clusters, we would now need to read the articles mapped onto the various units and then assign labels to the
528
A. Rauber and D. Merkl
units in order to make the map readable, which even for document collections of this size is a very time-consuming task. Figure 1a also represents the result of applying the LabelSOM method to assign keywords to the individual units. Due to the limited space we can only present the labels for a subset of all units. However, the quality of the labels for the remaining units is similar1 . The automatically created labels nicely describe the various documents that are mapped onto the respective units. For example, the 4 articles mapped onto unit (9/10)2 labeled soviet, moscow, nuclear, khrushchev, treaty, pact, agreement, undergo, inspection all deal with the talks taking place in Moscow under the supervision of Nikita Khrushchev concerning the ban of nuclear tests and the various conditions for an agreement like on-site inspections. Next to it, the 2 documents represented by unit (9/9) labeled moscow, kremlin, soviet, stalin, peking, chinese, sino, enemy discuss the relationship between China and the Soviet Union. These units are part of a larger cluster dealing with the role of Moscow both in its relationship to the western world as well as to its neighbor China. While the labels for each unit serve as a detailed description of the articles on the particular unit, choosing only labels that are shared with neighboring units allows the creation of higher-level class labels. The units mentioned above, for example, are part of a larger cluster sharing labels like kremlin, moscow, khrushchev, soviet, china, chinese. Cluster labels like these can be found in other regions of the map like e.g. in the upper left area of the map, where we have a cluster labeled west, germany and adenauer, for which the labels for the individual units reveal more detailed information by providing additional labels like chancellor, election, socialist, ludwig for unit (4/1) or poland, polish, refugee, bonn, berlin for unit (3/2). To give some more examples, we find a rather large cluster labeled viet, cong, diem, rocket in the lower left corner of the map dealing with the political situation and the war in Vietnam, followed by a cluster reporting on africa with one subsection labeled ghana, nkrumah (3/15) dealing with the situation in Ghana under its dictator Kwame Nkrumah and the existence of Albert Schweizers primitive hospital under these conditions in the jungle of Ghana on unit (4/15) labeled nkrumah, primitive, patient, hospital, jungle. A second subsection of the Africa-cluster is labeled with, amongst others, south, africa, black, white, apartheit and deals with the apartheit policy in this area. As a last example consider units (11/10) and (12/10). From the labels it is obvious, that the corresponding documents deal with the British ProfumoKeeler scandal. The script behind this scandal is that British minister of defense John Profumo had an affair with Christine Keeler, who had connections to the Soviet secret service. From today’s point of view it’s kind of interesting how history repeats itself as in document T170 Profumo is quoted “There has been no impropriety between myself and Miss Keeler”. 1
2
The map is available for interactive exploration via our department webserver at http://www.ifs.tuwien.ac.at/ifs/research/ir/ We will use the notation (x/y) to refer to the unit located at column x and row y.
Creating Readable Maps to Structure and Describe Document Collections
529
We can use the labels assigned to the various units to further identify clear cluster boundaries between topical clusters of documents. This can be achieved by identifying those labels of a unit that also appear as labels for one of the neighboring units, meaning that these units share some common features. Thus we can identify topical clusters by identifying areas in the map where the units have one or more labels in common. A representation of the resulting larger clusters with their corresponding labels is given in Figure 1b. Please note, that the granularity of the clusters identified depends on the number of labels assigned to every unit and the threshold τ2 for the minimum importance of a label in the tf × idf representation. If fewer labels are assigned, we tend to identify a larger number of small clusters comprising only units covering highly similar topics. On the other hand, when we assign a larger number of labels to every unit, we can identify fewer, but larger topical clusters at a higher level of abstraction.
4
Conclusions
We presented the LabelSOM method, which provides a straightforward way for assigning labels to the units of a self-organizing map. Attributes that are shared by the input signals mapped onto a particular unit are used to describe that unit. This facilitates the interpretation and understanding of the contents of a SOM and the features that it learned, which is hardly possible without the additional information provided by the automatically created labels. Using the LabelSOM method largely improves the applicability of the SOM both in the field of data mining and data representation. The clusters identified by the map not only become intuitively visible as with the enhanced cluster visualization methods developed so far, but are also characterized, allowing the interpretation of the features the SOM has learned. Producing a labeled SOM of a text collection allows it to be read and understood in a way one would expect from manually indexed document archives.
References 1. S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM—self-organizing maps of document collections. In Elsevir Publ. Elsevir Publications, 1997. 2. T. Kohonen. Self-Organizing Maps. Springer Verlag, Berlin, Germany, 1995. 3. D. Merkl. Text classification with self-organizing maps: Some lessons learned. Neurocomputing, 21(1–3), 1998. 4. D. Merkl and A. Rauber. Alternative ways for cluster visualization in self-organizing maps. In Proc. of the Workshop on Self-Organizing Maps (WSOM97), Helsinki, Finland, 1997. 5. A. Rauber and D. Merkl. Creating an order in distributed digital libraries by integrating independent self-organizing maps. In Proc. Int’l Conf. on Artificial Neural Networks (ICANN’98), Sk¨ ovde, Sweden, 1998. 6. A. Rauber and D. Merkl. The SOMLib digital library system. In Proc. European Conference on Digital Library Systems, Paris, France, 1999. LNCS, Springer Verlag. 7. G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989. 8. A. Ultsch. Self-organizing neural networks for visualization and classification. In Information and Classification. Concepts, Methods and Application, 1993.
Neuro-fuzzy Data Mining for Target Group Selection in Retail Banking Johannes Ruhland and Thomas Wittmann Lehrstuhl für Wirtschaftsinformatik, Friedrich-Schiller-Universität Jena Carl-Zeiß-Str. 3, 07743 Jena, Germany Tel.: +49 - 3641 - 943313, Fax: +49 - 3641 - 943312 email: [email protected], [email protected]
Abstract: Data Mining Algorithms are capable of ‘pressing the crude data coal into diamonds of knowledge’. Neuro-Fuzzy Systems (NFS) in particular promise to combine the benefits of both fuzzy systems and neural networks, and are thus able to learn IF-THEN-rules, which are easy to interpret, from data. Hence, they are a very promising Data Mining Approach. In this case study we describe how to support a bank’s new direct mailing campaign based on data about their customers and their reactions on a past campaign with a NeuroFuzzy System. We will describe how Neuro-Fuzzy Systems can be used as Data Mining tools to extract descriptions of interesting target groups for this bank. We will also show which preprocessing and postprocessing steps are indispensable to make this Neuro-Fuzzy Data Mining kernel work.
1
Introduction
Companies doing database marketing experience target group selection as a core problem. At the same time they are often confronted with a huge amount of data stored in their data banks. These could be a rich source of knowldege, if only properly used. The new field of research, called Knowledge Discovery in Databases (KDD) aims at closing this gap by developing and integrating Data Mining Algorithms, which are capable of ‘pressing the crude data coal into diamonds of knowledge’. In this case study we describe how to support a bank’s new direct mailing campaign based on data about their customers and their reactions on a past campaign. To promote a new product, one of Germany´s leading retail banks had conducted a locally confined but otherwise large mailing campaign. To efficiently extend this action to the whole of the country, a forecast of reaction probablilty based on demographic and customer history data is required. The database consists of 186.162 cases (656 of them being respondents and the rest non-respondents) and 28 selected attributes, e.g. date of birth, sum of transactions etc., as well as the responding behaviour. Neuro-Fuzzy Systems (NFS) promise to combine the benefits of both fuzzy systems and neural networks, because the hybrid NFS-architecture can be adapted and J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 530-535, 1999. © Springer-Verlag Berlin Heidelberg 1999
Neuro-fuzzy Data Mining for Target Group Selection in Retail Banking
531
also interpreted during learning as well as afterwards [3]. Here we focus on a specific approach called NEFCLASS (NEuro Fuzzy CLASSification), which is available as freeware from the Institute of Information and Communication Systems, Faculty of Computer Science at the University of Magdeburg, Germany (http://fuzzy.cs.unimagdeburg.de/welcome.html). NEFCLASS is an interactive tool for data analysis and determines the correct class or category of a given input pattern by a fuzzy system that is mapped on a feed-forward three-layered Multilayer Perceptron [14]. The effectiveness of NFS is empirically proven for small files like the iris data (see for example [2] or [4]). But when it comes to analyzing real-life problems, their main advantages, the ease of understandability and the capability to process automatically large data bases remains a much acclaimed desire rather than a proven fact, though. Therefore in a previous study [7] we have identified the benefits and shortcomings of NFS for Data Mining based on a real-life data file and criteria, which we have derived from practitioners requirements. Regarding classification quality, we have found Neuro-Fuzzy Systems like NEFCLASS as good as other algorithms tested. Their advantages are ease of interpretation (fuzzy IF–THEN rules ) and their ability to easily integrate a priori knowledge to enhance performance and/or classification quality. However, two severe problems may jeopardize their success, especially for large databases, the inability to handle missing values and to identify the most relevant attributes for the rules which leads to a combinatorical explosion in both run time and in breadth and depth of the rule base. It has turned out that pre- and postprocessing efforts are indispensible for NFS (like most Data Mining algorithms) to be adequate for Data Mining. The Knowledge Discovery in Databases paradigm as described by [1] offers a conceptual framework for the design of a knowledge extracting system that consists of preprocessing steps, a Data Mining kernel and postprocessing measures.
2
Data Preprocessing
Missing values can cause severe problems for Data Mining. This holds true for Data Warehouses in particular, where data are collected from many different, often heterogeneous sources. Different solutions have been developed in the past to solve this problem. To us, the most promising approach is to explicitly impute the missing data. We have tried to use the infomation of a decision tree for imputation. This decision tree is developed by the C5.0 algorithm [5], which is by far more efficient than traditional clustering approaches. We interpret the leaves of the tree as homgeneous groups of cases. To put widespread experience in a nutshell, this method can efficiently and robustly model data dependencies. When building the tree the algorithm can also utilize cases with missing values by making assumptions about the distribution of the variable affected which leads to the fractional numbers in the leaves, as shown in figure 1 [5]. To use this tree for imputation we first have to compute a quality score for each eligible imputation value (or imputation value constellation in case more than one attribute is missing) which will be based on the similarity between the case with the
532
J. Ruhland and T. Wittmann
missing data and each path down the tree to a leaf. In detail, we compute similarity as the weighted sum of the distance at every knot (top knots receive a higher weight) including a proxy-knot for the class variable to generate absolutely homogeneous end leaves. This score is multiplied by a „significance value“, a heuristic quality index considering the length of the path and the total number of cases in the leaf. profession group missing
age group 3
11.0 11.0 profession group
class: bad customer
1
1 2.3 1.3
2 0.2 1.0
3
2
6.5 2.3 age group 3 4.0 0.0
2.3 0.3
2.2 8.4
age group 1 2 0.5 good customer 1.0 4.1 1.0 bad customer
3 0.7 3.3
Fig. 1. C5.0-decision-tree based imputation.
a)
q(orig) n 6
b)
q(subst) n 6
c)
q (plot II) n 6
4
4
4
2
2
2
0 var. 1 3 var. 2 2 1 1 2 3 4
0 1
2 3 var. 2
4
var. 1 3
0 var. 2 2 1 1 2 3 4
Fig. 2. Quality plot and defuzzification by Global Max Dice. Secondly, we have to identify the best imputation value (constellation) by the quality scores of all possible imputation values. To visualize this problem one can draw a (n+1)-dimensional plot for a case with n missing values (see Figure 2 a displaying our score as a function of value constellations). To find a suitable imputation value set different methods are possible, which we have borrowed from the defuzzification strategies in fuzzy set theory [8]. The most promising approach is to use the constellation with the maximum quality score for imputation. This will not always yield unique results, because the decision tree does not have to be developed completely. Hence, some variables might not be defined. In figure 2 a) for example, attribute 2 will not enter the tree if attribute 1 equals 3. We can use the best constellation that is completely defined (Single Max) ((1,4) in figure 2 a) or do an unrestricted search for the maximum and in case this constellation is not unique, use a surrogate procedure on the not-yet-defined attribute(s). The surrogate value can be the attribute's global mean (Global Max Mean) (leading to (3,2) for the example in figure 2 a) or the value with the highest average quality, the average being computed over the defined variables (Global Max Dice) ((3,4) in Figure 2 a). Applied to the case of figure 2 a) this means averaging over variable 1 to get the (marginal) quality distribution displayed in figure 2 b).
Neuro-fuzzy Data Mining for Target Group Selection in Retail Banking
ÏÔ orig Q subst = Q * Ì1 - t n n ÓÔ
È Q plot II - Q plot II ˘ ¸Ô n Í max ˙ plot II ˝ . plot II ÍÎ Qmax - Qmin ˙˚ ˛Ô
533
(1)
subst
Qn = orig Qn =
surrogate quality score of imputation value (constellation) n, original quality score of n; this not completely defined constellation covers constellation n, plot II quality score of imputation value (constellation) n in plot II, Qn = plot II Qmax = maximum quality score in plot II, plot II = minimum quality score in plot II, Qmin factor determining how much the optimal imputation value (constellation) t= found by Global Max dominantes other imputation values (constellations). In essence, this amounts to using the marginal score distribution of variable 2 to modulate branches where the decision tree is not fully developed. This definition of a surrogate value is not so essential, because the fact, that the decision tree algorithm did not take into consideration that attribute shows its small discriminating power. But the maximum approach has a distinctive benefit, because it allows some kind of multiple imputation [6] without really imputing different values for one missing data by a simple extension of the method. For each missing value constellation, the n best value tuples are stored together with their quality scores (n can be chosen freely, n=1 corresponds to single imputation) and choose one of them randomly according to its quality score. In doing so we avoid the often criticized reduction of variance inherent to ordinary, deterministic imputation methods [6]. In SPSS for example, this problem is solved by artificially adding variance enlarging noise, whereas the method explained above seems much more elegant and appropriate. In our case we had 4 attributes with missing data in 0.3%, 27.5% , 50.7% and 86.4% of all cases. Therefore, we would have lost 95% of all cases if using the listwise deletion technique. This would lead to a very high loss of information, especially because all but nine (=1.4%) of all respondents would have to be dropped. The traditional imputation algorithms build their models only on the complete cases. But only 5% of all cases are complete; thus we can hardly trust the regression parameters or clusters. The C5.0-algorithm also uses cases with missing data, although this is done in a very brute force way. This is why we have chosen this method. We have used Global Max Dice for defuzzification with n=3, i.e. multiple imputation. Due to the reasons mention above, we had to focus on relevant attributes and cases. Different approaches to attribute selection have been proposed. To put our experience in a nutshell, we can state an efficiency-effectiveness-dilemma, proving none of the single solutions to be optimal. To overcome this dilemma, we have developed a model combining different approaches in a stepwise procedure [9]. In each step we eliminate cases with methods that are more sophisticated, but also less efficient than the method in the predecessor step. The choice of methods in each step depends to a large extend on the data situation. In our application example we have used measures of variable 2 entropy, a c -test for independence of the output variable from each potential influence
534
J. Ruhland and T. Wittmann
(considered in isolation), clustering of variables, identification of the most informative attributes by a C5.0 decision tree and a backward elimination with wrapper evaluation. After having selected the relevant attributes, we have drawn a training sample by pure random sampling consisting of about 300 respondents and 300 non-respondents.
3
Data Mining with NEFCLASS and Rule Postprocessing
NEFCLASS learns the number of rule units and their connections, i.e. the rules. If the user restricts the maximum number of rules, the best rules are selected according to an optimisation criterion [3]. This maximum rule number and the number of fuzzy sets per input variable are the most important user defined parameters. The optimal number of rules heavily depends on the attribute subset used. Therefore, we have optimized this parameter after every fifth iteration of the variable backward elimination process. The number of fuzzy sets per input variable was generally set to two. For the binary variables this is an obvious choice. For continuous attributes we also identified two fuzzy sets as being sufficient. Using more fuzzy sets will enhance classification quality just marginally, but exponentially increases the number of rules. After having identified the optimal parameters and the optimal attribute subset, we have finally trained NEFCLASS coming up with the rule base shown in figure 3 b). a)
Cheqctq large small small large large large small small small large
Ichan small large small small large large large small small small
Itax small small small small small large small small large large
Ttlntrns large large large small small small small small small small
→ → → → → → → → → →
respondent respondent respondent respondent respondent respondent non-resp. non-resp. non-resp. non-resp.
b)
Ttlntrns large small small small small
Cheqctq Ichan large large large small
Cheqctq: customer’s quality score
Ichan: investments with high risks
Itax: tax-oriented investments
µ 1
µ 1
µ 1
0
-99
1
2
3
0
0
1
0
Itax
large small small
small large
→ → → → →
respondent respondent respondent non-resp. non-resp.
Ttlntrns: customer’s total number of transactions µ 1
0
1
0
0
51,5
108
Fig. 3. Original and aggregated rule base including membership functions. The classification quality was quite good with 72.7% of correctly classified cases in the validation set (respondents: 68.7%; non-respondents: 77,7%). The rules (for the sake of simplification in a matrix notation) and the membership functions are shown in figure 5a). The rules shown in figure 5 reflect some interesting knowledge about the bank’s target groups. In figure 5b) we further aggregated the rule base manually. A tool is being developed, that will not only aggregate rules automatically, but also represent them to the user in an adequate and interactive way.
Neuro-fuzzy Data Mining for Target Group Selection in Retail Banking
4
535
Conclusion
The results of this study affirmed the previous experience that Neuro-Fuzzy Systems are not able to outperform alternative approaches, e.g. neural nets and discriminant analysis, with respect to classification quality. But they provide a rule base that is very compact and well understandable. Extensive preprocessing activities have been necessary, especially concerning the imputation of missing values and selection of relevant attributes and cases. Besides, we believe that these tools are also of general relevance. Intelligent postprocessing can further enhance the resulting rule base’s power of expression. In this paper, we have shown some promising approaches for these steps as well as their effectiveness and efficiency. Of course, we are still far away from an integrated data flow. But our experience with the single modules described above is very promising. However, these first results have to be further validated for different Neuro-Fuzzy Systems and different data situations. Our final goal is to integrate these modular solutions into a comprehensive KDD tool box. This report is based on a project promoted by funds of the Thuringian Ministry for Science, Research and Culture. The authors are responsible for the content of this publication.
References 1. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Knowledge Discovery and Data Mining: Towards a Unifying Framework, In: Simoudis, E., Han, J. (Hrsg.), Proceedings of Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, August 2-4, 1996, Menlo Park 1996, p. 82-88 2. Halgamuge, S. K., Mari, A., Glesner, M., Fast Perceptron learning by Fuzzy Controlled Dynamic Adaption of Network Parameters, In: Kruse, R., Gebhardt, J., Palm, R. (Hrsg.), Fuzzy Systems in Computer Science, Wiesbaden 1994, p. 129-139 3. Nauck, D., Klawonn, F., Kruse, R., Neuronale Netze und Fuzzy-Systeme. 2. Aufl., Braunschweig Wiesbaden 1996 4. Nauck, D., Nauck, U., Kruse, R., Generating Classification Rules with the Neuro-Fuzzy System NEFCLASS, In: Proceedings Biennial Conference of the North American Fuzzy Information Processing Society (NAFIPS’96), Berkley, June 19-22, 1996 5. Quinlan, J. R., C4.5. Programs for Machine Learning, San Mateo, 1993 6. Rubin, D. B., Multiple Imputation for Nonresponse in Surveys, New York 1987. 7. Ruhland, J., Wittmann, T., Neurofuzzy Systems In Large Databases - A comparison of Alternative Algorithms for a real-life Classification Problem, In: Proceedings EUFIT’97, Aachen, September 8-11 1997, p. 1517-1521 8. Wittmann, T., Knowledge Discovery in Databases mit Neuro-Fuzzy-Systemen. Entwurf für einen integrierten Ansatz zum Data Mining in betrieblichen Datenbanken. Friedrich-SchillerUniversität Jena, Wirtschaftswiss. Fakultät, Diskussionspapier Serie A, Nr 98/07, Mai 1998 9. Wittmann, T., Ruhland, J., Untersuchung der Zusammenhänge zwischen Fahrzeugmerkmalen und Störungsanfälligkeiten mittels Neuro Fuzzy Systemen, in: Kuhl, J., Nissen, V., Tietze, M., Soft Computing in Produktion und Materialwirtschaft, Göttingen 1998, p.71-85
Mining Possibilistic Set-Valued Rules by Generating Prime Disjunctions Alexandr A. Savinov GMD — German National Research Center for Information Technology Schloss Birlinghoven, Sankt-Augustin, D-53754 Germany E-mail: [email protected], http://borneo.gmd.de/~savinov/
Abstract. We describe the problem of mining possibilistic set-valued rules in large relational tables containing categorical attributes taking a finite number of values. An example of such a rule might be “IF HOUSEHOLDSIZE={Two OR Tree} AND OCCUPATION={Professional OR Clerical} THEN PAYMENT_METHOD={CashCheck (Max=249) OR DebitCard (Max=175)}. The table semantics is supposed to be represented by a frequency distribution, which is interpreted with the help of minimum and maximum operations as a possibility distribution over the corresponding finite multidimensional space. This distribution is approximated by a number of possibilistic prime disjunctions, which represent the strongest patterns. We present an original formal framework generalising the conventional boolean approach on the case of (i) finite-valued variables and (ii) continuos-valued semantics, and propose a new algorithm, called Optimist, for the computationally difficult dual transformation which generates all the strongest prime disjunctions (possibilistic patterns) given a table of data. The algorithm consists of generation, absorption and filtration parts. The generated prime disjunctions can then be used to build rules or for prediction purposes.
1. Introduction One specific data analysis task consists in discovering hidden patterns that characterise the problem domain behaviour and then representing them in the form of rules, which can be used either for description or for prediction purposes. The analysed database consists of a number of records, each of which is a sequence of attribute values. In the case where variables in condition and conclusion may take only one value we obtain well known association rules which describe the dependencies (associations) among individual values rather than among sets of the values. If the variables in rules may be constrained by any subset of possible values then we obtain so called possibilistic set-valued rules, e.g.: IF x1 = {a13 , a14 } AND x2 = {a 21 , a 27 } THEN x3 = {a33 : p33 , a36 : p36 } where aij are values of the i-th variable and pij are degrees of possibility expressed as maximal frequencies of the corresponding values within the interval. This rule means that if x1 is either a13 or a14 , and x2 is either a21 or a 27 , then x3 is either J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 536-541, 1999. © Springer-Verlag Berlin Heidelberg 1999
Mining Possibilistic Set-Valued Rules by Generating Prime Disjunctions
537
a33 or a36 with the possibilities p33 and p36 , respectively, while the rest of the values such as a31 are impossible within this condition interval (the frequency is 0). In the paper we consider the problem of mining set-valued rules for the case where all variables may take only a finite number of values, and the possibilistic semantics is represented by a frequency distribution (the number of observations belonging to each point). The problem is that the number of all possible conjunctive intervals of the multidimensional space is extremely large. However, most of them are not interesting since the projection of the restricted distribution onto all variables does not have much information (it is highly homogeneous). Thus, informally, the more general the rule condition (the wider the selected interval), and the narrower the conclusion are (the closer the conclusion distribution to the singular form), the more interesting and informative the rule is. To find such maximally general in condition and specific in conclusion rules we use an approach [1] according to which any multidimensional possibility distribution can be formally represented (Fig. 1) by a set of possibilistic disjunctions combined with the connective AND (possibilistic CNF). The disjunction (possibilistic pattern) is made up of several one-dimensional possibility distributions (propositions) over the values of individual variables combined with the connective OR (interpreted as maximum). Note that we use an original definition of possibilistic disjunction, which generalises the conventional boolean analogue in two directions: (i) the variables are finite-valued [2] (instead of only 2-valued), and (ii) the semantics is continuous-valued [1] (instead of only 0 and 1). Particularly, this feature distinguishes our approach from other methods including Boolean reasonging and rough sets. The strongest disjunctions, called primes, are used to form the optimal and the most interesting rules, i.e., possibilistic prime disjunctions allow us to reach both goals when generating rules — maximal generality of condition and maximal specificity of conclusion. In contrast to the previous version [3] requiring all records to be in memory, the Optimist is based on the explicit formula [4] of transformation from possibilistic DNF representing data into CNF consisting of prime disjunctions (knowledge). The advantage is that all prime disjunctions are built for one pass through the record set by updating the current set of prime disjunctions each time new record is processed. The algorithm efficiently solves the problem of computational complexity by filtering out too specific disjunctions interpreted as noise or exceptions and generating only the most informative of them. Once the patterns have been found they can be easily written as rules. w is prime but too specific disjunction (noise) and therefore it is filtered out
Frequency distribution is initially extensionally represented by a number of conjunctions, which lower bound its surface
w
u
d
v
d is not prime disjunction since it follows from v and therefore it is absorbed
25
15
ω
D
A
T
A
D(ω)=min(u(ω),v(ω))=min(15,25)=15
Ω
Fig. 1. The data semantics is approximated (upper bound) by prime disjunctions (patterns).
538
A.A. Savinov
2. Data and Knowledge Representation Let some problem domain on the syntactic level be described by a finite number of variables or attributes x1 , x2 ,K, x n each of which takes a finite number of values and corresponds to one column of data table: xi ∈ Ai = {ai1 , ai 2 ,K, aini }, i = 1,2,K, n , where ni is the number of values of i-th variable and Ai is its set of values. The state space or the universe of discourse is defined as the Cartesian product of all sets of the values: Ω = A1 × A2 × K × An . The universe of discourse is a finite set with the multidimensional structure. Each syntactic object (state) from the universe of discourse is represented by a combination of values of all variables: ω = 〈 x1 , x 2 ,K, x n 〉 ∈ Ω . The number of such objects is equal to the power of the universe of discourse: Ω = n1 × n2 × K × nn . Formally the problem domain semantics is represented by a frequency distribution over the state space which assigns the number of occurrences to each combination of values. Then 0 is interpreted as the absolute impossibility of the state while all positive numbers are interpreted as various degrees of possibility. We do not map this distribution into the interval [0,1] since for rule induction it is simpler to work directly with frequencies. The semantics will be represented by a special technique called the method of sectioned vectors and matrixes [1–3]. Each construction of this mechanism along with interpretation rules imposes constraints of certain form on possible combinations of attribute values. The sectioned constructions are written in bold font with the two lower indexes corresponding to the number of variable and to the number of value, respectively. The component u ij of the sectioned vector u is a natural number assigned to j-th value of i-th variable. The section u i of the sectioned vector u is an ordered sequence of ni components assigned to i-th variable and representing some distribution over all values of one variable. The sectioned vector u is an ordered sequence of n sections for all variables. The total number of components in sectioned vector is equal to n1 + n2 + K + nn . The sectioned matrix consists of a number of sectioned vectors written as its lines. For example, the construction u = 01.567.0090 or u = {0,1}.{5,6,7}.{0,0,9,0} is a sectioned vector written in different ways (with sections separated by dots) where u1 = {0,1}, u 2 = {5,6,7}, u11 = 0 and so on. There are two interpretations of sectioned vectors: as conjunction and as disjunction. If the sectioned vector d is interpreted as disjunction then it defines the distribution, which is equal to the maximum of the vector components corresponding to the point coordinates: d(ω ) = d(〈 x1 , x 2 ,K, x n 〉 ) = d1 ( x1 ) ∨d 2 ( x2 ) ∨ K ∨ d n ( xn ) = max d i ( xi )
K
i =1, ,n
(The minimum is taken among n components — one from each section.) The conjunction is interpreted in the dual way by means of the minimum operation. Sectioned matrixes can be interpreted as DNF or CNF. If the matrix K is interpreted as DNF then its sectioned vector-lines are combined with the connective ∨ and interpreted as conjunctions (disjunction of conjunctions). In the dual way, if
Mining Possibilistic Set-Valued Rules by Generating Prime Disjunctions
539
the matrix D is interpreted as CNF then its sectioned vector-lines are combined with the connective ∨ and interpreted as disjunctions (conjunction of disjunctions). The data can be easily represented in the form of DNF so that each conjunction represents one record along with the number of its occurrence in the data set. The conjunction corresponding to one record consists of all 0’s except for one component in each section, which is equal to the number of record occurrences. One distribution is said to be a consequence of another if its values in all points of the universe of discourse are greater or equal to the values of the second distribution. We will say also that the first distribution covers the second one. The operation of elementary induction consists in increasing one component of a disjunction so that it becomes weaker. The disjunction is referred to as prime one if it is a consequence of the source distribution but is not a consequence of any other distribution except of itself. The prime disjunctions are considered as possibilistic patterns expressing dependencies among attributes by imposing the strongest constraints on the possible combinations of values. Thus formally the problem of finding dependencies is reduced to the problem of generating possibilistic prime disjunctions.
3. Generation, Absorption and Filtration of Disjunctions To add the conjunction k (record) to the matrix of CNF D (current knowledge) it is necessary to add it to all m disjunctions of the matrix: k ∨ D = k ∨ (d1 ∧ d 2 ∧ K ∧ d m ) = ( k ∨ d1 ) ∧ ( k ∨ d 2 ) ∧ K ∧ ( k ∨ d m ) Addition of conjunction to disjunction is carried out by the formula: k ∨ d = ( k1 ∨ d) ∧ (k 2 ∨ d) ∧ K ∧ (k n ∨ d ) =
∨ K ∨ dn k 1 ∨ ( d1 ∨ d 2 ∨ K ∨ d n ) k 1 ∨ d1 ∨ d 2 ∨ k 2 ∨ d2 ∨ K ∨ dn k 2 ∨ ( d1 ∨ d 2 ∨ K ∨ d n ) = d1 ∨ d2 ∨ K ∨ k n ∨ dn k n ∨ ( d1 ∨ d 2 ∨ K ∨ d n ) d1 In general case n new disjunctions are generated from one source disjunction by applying the elementary induction, i.e., by increasing one component. For example, according to this formula addition of the conjunction k = 05.005.0005 to the disjunction d = 01.070.0102 results in three new disjunctions (increased components are underlined): 05.070.0102, 01.075.0102, and 01.070.0105. If the elementary induction does not change one of the disjunctions then it means that the source disjunction already covers the conjunction. In this case the disjunction can be simply copied to the new matrix with no modifications. Thus the whole set of new disjunctions can be divided into two subsets: modified and non-modified. As new disjunctions are generated and added to the new matrix the absorption procedure should be carried out to remove the lines which are not prime and follow from others, e.g., d in Fig. 1. In general, each new disjunction can either be absorbed itself or absorb other lines. Thus the comparison of lines has to be fulfilled in both directions. To check for the consequence relation between two disjunctions we have to reduce them [1] and then compare all their components. Let us formulate properties, which significantly simplify the absorption process. Property 1. The disjunctions, which cover the current conjunction and hence are not modified, cannot be absorbed by any other disjunction.
540
A.A. Savinov
This property follows from the fact that the matrix of disjunctions is always maintained in the state where it contains only prime disjunctions, which do not absorb each other. Let us suppose that u is non-modified disjunction while v is modified on the component v rs , and v′rs is old value of modified component ( u ij = u′ij since u is not modified). Then the following property takes place. Property 2. If u rs ≤ v ′rs then v does not follow from u . (This property is valid only if the constant [1] of v has not been changed. To use this property each line has to store information on the old value v′rs of modified component and its number (r and s). These properties are valuable since frequently they allow us to say that one line is not a consequence of another by comparing only one pair of components. Property 3. If the sum of components in v or in any of its sections v i is less then the corresponding sum in the disjunction u then v does not follow from u . To use this property we have to maintain the sums of the vector and section components in the corresponding headers. If all these necessary conditions are satisfied then we have to carry out a component-wise comparison of two vectors in the loop consisting of n1 + n2 + K + nn steps. To cope with complexity problem and to generate only interesting rules the algorithm has been modified so that the number of lines in the matrix of prime disjunctions is limited by a special user-defined parameter while the lines are ordered by a criterion of interestingness. Before a new disjunction is to be generated we calculate its criterion value (the degree of interestingness) which is compared with that of the last line of the matrix. If the new disjunction does not go into the matrix (e.g., w in Fig. 1), it is simply not generated. Otherwise, if it is interesting enough, it is first generated, then checked for absorption, and finally inserted into the corresponding position in the matrix (the last line is removed). The Optimist algorithm uses the criterion of interestingness in the form of the impossibility interval size. Informally, the more points of the distribution have smaller values, the more general and stronger the corresponding disjunction is. Formally the following formula is used to calculate the degree of interestingness:
H=
1 n1
n1
n2
nn
∑ d1 j + n2 ∑ d 2 j + K + nn ∑ d nj j =1
1
j =1
1
j =1
according to which H is equal to the weighted sum of components, and the less this value, the stronger the disjunction. For example, changing one component from 0 to 1 in two-valued section is equivalent to changing three components from 0 to 1 in sixvalued section. Generally, each attribute or even each attribute value may have their own user-defined weights, which influence the direction of induction and reflect their informative importance or subjective interestingness for the user. This mechanism provides the capability of more flexible control over the rule induction process. The set of generated prime disjunctions is an approximate semantic equivalent of the data. Once they have been generated they can be used for prediction purposes or to build rules. The patterns are rewritten in the form of rules in the conventional way by negating the propositions (sections) which should be in the condition and thus obtaining an implication. The only problem here is that we obtain conditions with
Mining Possibilistic Set-Valued Rules by Generating Prime Disjunctions
541
possibilistic weights while it is more preferable to have crisp conditions (without uncertainty). The most straightforward way to do it consists in negating the condition section d i as follows: d ij = d max , if d ij ≤ d min , and d ij = d min , otherwise, where
d min and d max are minimal and maximal components of the disjunction, respectively ( d max is usually mapped into 1 within [0,1] interval). For example, the pattern d = {0,8} ∨ {0,6,0} ∨ {0,2,9,5} with d min = 0 and d max = 9 can be transformed into the implication {9,0} ∧ {9,0,9} → {0,2,9,5} which is interpreted as the possibilistic rule IF x1 = {a11} AND x2 = {a21 , a 23 } THEN x3 = {a31 : 0, a32 : 2, a33 : 9, a34 : 5} This method can be generalised by applying any user-defined value instead of d min . In addition, the values in conclusion can be easily weighted by their frequencies (the sum of occurrences within the condition interval) or necessity degrees (the minimal number of occurrences), e.g., x 3 = {a 32 : ( Min = 0) ( Max = 2) ( Sum = 4), a 33 : K} .
4. Conclusion The described approach to mining possibilistic set-valued rules has the following characteristic features: (i) it is based on the original formal framework generalising boolean approach on the case of finite-valued attributes and continuous-valued semantics, (ii) the notion of prime disjunction as a pattern allows us to reach optimality of rules (maximal generality of condition and specificity of conclusion), (iii) it guarantees finding only the strongest patterns (too weak ones are filtered out), (iv) all attributes as well as all rules have equal rights, particularly, we do not need the target attribute and all rules are interpreted independently, (v) the rules are generated for one pass, (vi) the patterns can be easily used for prediction as well as for other tasks since they approximately represent the data semantics in an intensional form, (vii) a minus is a large number of generated rules especially for dense distributions what can be overcome by a more sophisticated filtration and search.
References 1. A.A. Savinov. Fuzzy Multi-dimensional Analysis and Resolution Operation. Computer Sci. J. of Moldova 6(3), 252–285, 1998. 2. A.D. Zakrevsky, Yu.N. Pechersky and F.V. Frolov. DIES — Expert System for Diagnosis of Technical Objects. Preprint of the Institute of Mathematics and CC, Academy of Sciences of Moldova, Kishinev, 1988 (in Russian). 3. A. Savinov. Application of multi-dimensional fuzzy analysis to decision making. In: Advances in Soft Computing — Engineering Design and Manufacturing. R. Roy, T. Furuhashi and P.K. Chawdhry (eds.), Springer-Verlag London, 1999. 4. A. Savinov. Forming Knowledge by Examples in Fuzzy Finite Predicates. Proc. conf. “Hybrid Intellectual Systems”, Rostov-na-Donu—Terskol, 177–179, 1991 (in Russian).
Towards Discovery of Information Granules Andrzej Skowron1 & Jaroslaw Stepaniuk2 1
2
Institute of Mathematics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland, E-mail: [email protected] Institute of Computer Science, Bialystok University of Technology, Wiejska 45A, 15-351 Bialystok, Poland, E-mail: [email protected]
Abstract. The amount of electronic data available is growing very fast and this explosive growth in databases has generated a need for new techniques and tools that can intelligently and automatically extract implicit, previously unknown, hidden and potentially useful information and knowledge from these data. Such tools and techniques are the subject of the field of Knowledge Discovery in Databases. Information granulation is a very natural concept, and appears (under different names) in many methods related to e.g. data compression, divide and conquer, interval computations, neighborhood systems, and rough sets among others. In this paper we discuss information granulation in knowledge discovery. The notions related to information granules are their syntax and semantics as well as the inclusion and closeness (similarity) of granules. We discuss some problems of KDD assuming knowledge is represented in the form of information granules. In particular we use information granules to deal with the problem of stable (robust) patterns extraction.
1
Introduction
Knowledge Discovery in Databases (or KDD for short) has been defined as the nontrivial extraction of implicit, previously unknown, and potentially useful knowledge from data [3]. It uses machine learning, statistical and visualization techniques, rough set theory and other techniques to discovery and present knowledge in a form which is easily comprehensible to humans. Notions of granule [12], [13], [9], [7], [10] and similarity (inclusion or closeness) of granules are very natural in knowledge discovery. The necessity to consider similarity of granules instead of their equality is a consequence of uncertainty under which granules are processed. The aim of the paper is to show that the association rules [2] can be treated as simple examples of relations on information granules. The left and right hand sides of association rules [2] describe granules and support and confidence specify degrees of the granule inclusion represented by the formula on the left hand side of association rule into the granule represented by the formula on the right hand side of association rule. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 542–547, 1999. c Springer-Verlag Berlin Heidelberg 1999
Towards Discovery of Information Granules
543
We generalize a simple notion of granules represented by attribute value vectors as well as closeness (inclusion) relation to the case of hierarchical granules of concepts. We claim that in KDD there is a need for algorithmic methods to discover much more complex relations between complex information granules. We discuss examples of information granules and we consider two kinds of basic relations between them, namely (rough) inclusion and closeness. The relations between more complex information granules can be defined by extension of the relations defined on parts of the information granules.
2
Granule Inclusion and Closeness
One of the main direction of research for successful construction of intelligent systems is to develop a calculus of information granules [12], [13], [7]. We emphasize the fact that granules and their closeness are important for KDD. In this section we will present several examples of granules starting from the simplest ones. A general scheme of defining more complex granule from simpler ones can be explored using rough mereological approach [6]. We are adding here a new aspect of construction of more complex granules from simpler ones and the problem of extension of rough inclusion and closeness of granules on these complex granules. Closeness between granules G, G0 of degree at least p will be denoted by clp (G, G0 ) (or cl (G, G0 ) ≥ p). Similarly, inclusion between granules G, G0 of degree at least p will be denoted by νp (G, G0 ) (or ν (G, G0 ) ≥ p). cl and k ν are functions with values in the interval [0, 1] of real numbers (or [0, 1] ). A general scheme for construction of hierarchical granules and their closeness can be described recursively using the following metarule: if granules of order ≤ k and their closeness have been defined then the closeness clp (G, G0 ) (at least in degree p) between granules G, G0 of order k + 1 can be defined by applying an appropriate operator F to closeness of components of G, G0 , respectively. 2.1
Elementary Granules - Conjunctions of Selectors
The simplest case of granules can be defined using information system IS = (U, A) . In this case elementary information granules are defined by EFB (x) , where EFB is a conjunction of selectors of the form a = a (x) , B ⊆ A and x ∈ U. Let GIS = {EFB (x) : B ⊆ A & x ∈ U } . In the standard rough set model [5] elementary granules describe indiscernibility classes with respect to some subsets of attributes. In a more general setting see e.g. [8], [11] tolerance (similarity) classes are described. The crisp inclusion of α in β, where α, β ∈ {EFB (x) : B ⊆ A & x ∈ U } is defined by kαkIS ⊆ kβkIS , where kαkIS and kβkIS are sets of objects from IS satisfying α and β, respectively. The noncrisp inclusion, known in KDD, for the case of association rules is defined by two thresholds t and t0 and the following conditions IS (α,β) suppIS (α, β) = kα ∧ βkIS ≥ t, and confIS (α, β) = supp ≥ t0 . kαkIS Granules inclusion can be defined using different schemes e.g. by IS 0 νt,t 0 (α, β) if and only if suppIS (α, β) ≥ t & confIS (α, β) ≥ t or
544
A. Skowron and J. Stepaniuk
νtIS (α, β) if and only if confIS (α, β) ≥ t. In the latter case one can define closeness of the granules by IS cltIS (α, β) = min {t0 , t”} , where νtIS 0 (α, β) and νt” (β, α) hold. 2.2
Decision Rules as Granules
One can define granules corresponding to rules if α then β as pairs (α, β) of elementary information granules together with information about their closeness of components measured e.g. using confidence coefficients (see e.g. [7]). Having such granules r = (α, β) , r0 = (α0 , β 0 ) with degrees of closeness at least t, t0 , respectively one can define closeness of r and r0 . This definition should satisfy the following natural condition: if r and r0 are sufficiently close (in a given information system IS) and α, β are close at least in degree t (in IS) then α0 and β 0 are close in degree t0 (in IS) where |t − t0 | is ”sufficiently small”. One simple attempt to define such closeness can be as follows: r and r0 are close in degree at least t if and only if α, α0 are close in degree at least t (in IS) and β, β 0 are close in degree at least t (in IS). Another way of defining inclusion of granules is as follows ν IS ((α, β) , (α0 , β 0 )) = w1 • confIS (α, α0 ) + w2 • confIS (β, β 0 ) , where w1 + w2 = 1 and w1 , w2 ≥ 0 are some weights to be tuned for a given data. 2.3
Sets of Rules
An important problem related to association rules is that the number of such rules generated even from simple data table can be large. Hence, one should search for methods of aggregating close association rules. We suggest that this can be defined as searching for some close information granules. Let us consider two finite sets of association rules R and R0 . One can treat them as higher order information granules. These new granules R, R0 can be treated as close in a degree at least t (in IS) if and only if there exists a relation rel between rules R and R0 such that: 1. for any r ∈ R there is r0 ∈ R0 such that (r, r0 ) ∈ rel and r is close to r0 (in IS) in degree at least t. 2. for any r0 ∈ R0 there is r ∈ R such that (r, r0 ) ∈ rel and r is close to r0 (in IS) in degree at least t. One can consider a searching problem for a granule R0 of minimal size such that R and R0 are close (see e.g. [1]). Another way of defining closeness of two granules G1 , G2 represented by sets of rules can be described as follows. For example if the granule G1 consists of rules: if α1 then d = 1, if α2 then d = 1, if α3 then d = 1, if β1 then d = 0, if β2 then d = 0 and of rules: if γ1 then d = 1, if γ2 then d = 0, then the granule G2 consists cl (G1 , G2 ) = min cl G11 , γ1 , cl G21 , γ2 , where G11 is the granule α1 ∨ α2 ∨ α3 and G21 is the granule β1 ∨ β2 .
Towards Discovery of Information Granules
2.4
545
Power Sets of Granules
Let G be a set of granules and let ϕ be a property of sets of granules from G (e.g. ϕ (X) if and only if X is a tolerance class of a given tolerance τ ⊆ G × G). Then Pϕ (G) = {X ⊆ G : ϕ (X) holds} . Closeness of granules X, Y ∈ Pϕ (G) can be defined by clt (X, Y ) if and only if inf g∈X,g0 ∈Y clt (g, g 0 ) . Let us observe that the closeness definition can depend on application. For example instead of the above definition one can use the following clt (X, Y ) if and only if there exists r ⊆ X × Y such that 1. for any x ∈ X there is y ∈ Y such that (x, y) ∈ r and x, y are close in degree at least t. 2. for any y ∈ Y there is x ∈ X such that (x, y) ∈ r and x, y are close in degree at least t. The stability of granules g, g 0 ∈ G with respect to Pϕ (G) is defined by g, g 0 are Pϕ (G) stable in degree at least t if and only if ∀X,Y ∈Pϕ (G) (g ∈ X & g 0 ∈ Y → clt (g, g 0 )) . Extracting stable patterns from data is an important task in KDD. 2.5
Stability of Granules
We consider granules specified by parameterized formulas α (p) , β (p) together with parameterized closeness relation clq (•, •) between sets of objects representing the semantics of these granules. The parameter q ∈ [0, 1] is representing the degree of closeness. We assume that values of parameter p are points in the metric space with metric ρ. Hence the intended meaning of clq (α (p) , β (p)) is that granules α (p) , β (p) (parameterized by p) are close at least in the degree q. Let us assume that ε > 0 is a given stability threshold and let clq0 (α (p0 ) , β (p0 )) holds for some p0 , q0 . We would like to check if there exists δ > δ0 (δ0 is a given threshold) such that for any p if ρ (p0 , p) ≤ δ then clq (α (p) , β (p)) for some q such that q ≥ q0 − ε. The above condition is specifying that granules α (p0 ) , β (p0 ) are not only close in degree q but their closeness is also stable with respect to changes of parameter p. Example 1. Let us assume that the rule if a ∈ [1, 1.5) ∧ b ∈ [−1, 0) then d = 1 is an association rule in a given decision table DT = (U, A ∪ {d}) with coefficients suppDT (a ∈ [1, 1.5) ∧ b ∈ [−1, 0) , d = 1) = 0.6 ∗ card (U ) , confDT (a ∈ [1, 1.5) ∧ b ∈ [−1, 0) , d = 1) = 0.7 Let q0 = (0.7, 0.6), p0 = (0, 0) and pε = 0.1. We ask if there exists δ > δ0 = 0.01 such that for any p = (p1 , p2 ) if p21 + p22 ≤ δ then the rule if a ∈ [1 − p1 , 1.5 + p1 ) ∧ b ∈ [−1 − p2 , p2 ) then d = 1 is true in a given decision table DT with coefficients
546
A. Skowron and J. Stepaniuk
supp1 = suppDT (a ∈ [1 − p1 , 1.5 + p1 ) ∧ b ∈ [−1 − p2 , p2 ) , d = 1) , conf1 = conf rDT (a ∈ [1 − p1 , 1.5 + p1 ) ∧ b ∈ [−1 − p2 , p2 ) , d = 1) , 2 2 supp1 such that + (conf1 − 0.7) ≤ ε. card(U ) − 0.6 2.6
Dynamic Granules
For a given decision table DT and information vector α one can define a subtable DTα of DT consisting of all objects from DT with information vectors being extension of α. For any such table one can generate the set of decision rules Rule(DTα ). In this way we obtain a new crisp information granules (α, Rule(DTα )) . Let us assume that on information vectors a tolerance τ is defined. Hence we obtain more complex information granules denoted by τ (α) , {Rule(DTβ )}β∈τ (α) , where τ (α) = {β : (α, β) ∈ τ } . They can be treated as soft information granules. Obviously sets of rules from {Rule(DTβ )}β∈τ (α) should be sufficiently close if one would like to represent such granules using their representatives e.g. taking an arbitrary β from τ (α) and Rule(DTβ ). An interesting problem arises to search for a compact representation of such soft granules relative to τ. Let us consider one more example. The sentence ”if the situation specified by evidence α the algorithm Rule (DTα ) is performed and when a new evidence β appears the algorithm Rule (DTα∧β ) is performed” can be represented by the crisp granule G (α, β, DT ) = ((α, Rule (DTα )) , (α ∧ β, Rule (DTα∧β ))) . The soft granules G (α, β, τ, DT ) relative to a given tolerance τ can be represented by the family of all granules G (α0 , β 0 , DT ) such that α0 ∈ τ (α) and β 0 ∈ τ (β) . Assuming the closeness of granules of the form G (α, β, DT ) has been defined one can define the closeness of G (α, β, DT ) and G (α, β, τ, DT ) by taking infimum of closeness degrees between G(α, β, DT ) and components of G(α, β, τ, DT ). Hence if G (α, β, DT ) and G (α, β, τ, DT ) are close at least in degree t it means that the crisp granule G (α, β, DT ) can represent the soft granule G (α, β, τ, DT ) at least in the degree t.
Conclusions Our approach can be treated as a step towards understanding of complex information granules and their role in KDD. We have pointed out that information granules and their closeness (inclusion) are basic tools for data mining and knowledge discovery. Some higher order patterns important for KDD have been described by means of higher order granules. We also suggest that the optimization of information granules can be performed by tuning up parameters in an optimization process searching for robust granules. In our next papers we show algorithmic methods for extracting from data patterns described by the discussed above higher order granules.
Towards Discovery of Information Granules
547
Acknowledgments This research was supported by the grant No. 8 T11C 023 15 from the State Committee for Scientific Research, the Bialystok University of Technology Rector’s Grant W/II/3/98 and Research Program of the European Union - ESPRITCRIT 2 No. 20288.
References 1. ˚ Agotnes T.: Filtering Large Propositional Rule Sets While Retaining Classifier Performance, NTNU Research Report 1999. 2. Agrawal R., Mannila H., Srikant R., Toivonen H., Verkano A.: Fast Discovery of Association Rules, Fayyad U.M., Piatetsky-Shapiro G., Smyth P., Uthurusamy R. (eds.): Advances in Knowledge Discovery and Data Mining, The AAAI Press/The MIT Press 1996, pp. 307-328. 3. Fayyad U.M., Piatetsky-Shapiro G., Smyth P., Uthurusamy R. (eds.): Advances in Knowledge Discovery and Data Mining, The AAAI Press/The MIT Press 1996. 4. Lin T.Y.: Granular Computing on Binary Relations I Data Mining and Neighborhood Systems, L. Polkowski, A. Skowron (eds.), Rough Sets in Knowledge Discovery 1. Methodology and Applications, Physica–Verlag, Heidelberg, 1998, pp. 107-121. 5. Pawlak Z.: Rough Sets. Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dordrecht, 1991. 6. Polkowski, L., Skowron, A.: Rough Mereology: A New Paradigm for Approximate Reasoning, International Journal of Approximate Reasoning, Vol. 15, No 4, 1996, pp. 333–365. 7. Polkowski L., Skowron A.: Towards Adaptive Calculus of Granules, Proceedings of the FUZZ-IEEE’98 International Conference, Anchorage, Alaska, USA, May 5-9, 1998. 8. Skowron A., Stepaniuk J.: Tolerance Approximation Spaces, Fundamenta Informaticae, Vol. 27, 1996, pp. 245-253. 9. Skowron A., Stepaniuk J.: Constructive Information Granules, Proceedings of the 15th IMACS World Congress on Scientific Computation, Modelling and Applied Mathematics, Berlin, Germany, August 24-29, 1997. Artificial Intelligence and Computer Science 4, 1997, pp. 625-630. 10. Skowron A., Stepaniuk J.: Information Granules and Approximation Spaces, Proceedings of the 7th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems, IPMU’98, Paris, France, July 6-10, 1998, pp. 1354-1361. 11. Stepaniuk J.: Approximation Spaces, Reducts and Representatives, L. Polkowski, A. Skowron (eds.), Rough Sets in Knowledge Discovery 2. Applications, Case Studies and Software Systems, Physica–Verlag, Heidelberg, 1998, pp. 109-126. 12. Zadeh L.A.: Fuzzy Logic = Computing with Words, IEEE Trans. on Fuzzy Systems Vol. 4, 1996, pp. 103-111. 13. Zadeh L.A.: Toward a Theory of Fuzzy Information Granulation and Its Certainty in Human Reasoning and Fuzzy Logic, Fuzzy Sets and Systems Vol. 90, 1997, pp. 111-127.
Classification Algorithms Based on Linear Combinations of Features ´ ezak and Jakub Wr´ Dominik Sl¸ oblewski Institute of Mathematics, Warsaw University, ul. Banacha 2, 02-097 Warsaw, Poland. [email protected] [email protected], http://alfa.mimuw.edu.pl/˜jakubw
Abstract. We provide theoretical and algorithmic tools for finding new features which enable better classification of new cases. Such features are proposed to be searched for as linear combinations of continuously valued conditions. Regardless of the choice of classification algorithm itself, such an approach provides the compression of information concerning dependencies between conditional and decision features. Presented results show that properly derived combinations of attributes, treated as new elements of the conditions’ set, may significantly improve the performance of well known classification algorithms, such as k-NN and rough set based approaches.
1
Introduction
Classification is the problem of forecasting the decision for new cases, basing on their conditional features, by comparison with already known instances. An exemplar classification technique is the nearest neighborhood approach [3]. Given some arbitrarily fixed distance measure ρ, defined over the Cartesian product of conditional features treated as real valued dimensions, we can find for a new example k ρ-nearest known cases u1 , ..., uk and classify it as belonging to the same decision class as that most supported by them. The efficiency of this approach depends obviously on the choice of distance type and the choice of conditions over which we define ρ. Namely, it turns out that sometimes it is even better to consider smaller subset of conditions, to obtain better classification results (see e.g. [1]). Appropriate selection of conditions is the very important task with respect to practical applications, where it is more effective to base on smaller (or easier to be analyzed) groups of features. In the above k-NN approach such a selection is concerned just in view of the classification performance. There are, however, approaches where it is regarded as the main paradigm, enabling to focus not on the classification only, but also on the representation of the dependencies between conditions and decisions. One of them is the decision rules based method, developed within rough sets theory (see Section 2 for details and, e.g., [5] for further references). Although designed originally for discrete data, it can be applied to continuous conditions as well, by using discretization (see e.g. [4]) or tolerance •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 548−553, 1999. Springer−Verlag Berlin Heidelberg 1999
Classification Algorithms Based on Linear Combinations of Features
549
based techniques (see e.g. [7]), where by considering decision rules built with respect to similarity classes of -almost ρ-same objects, for a distance measure ρ and a similarity coefficient , we can obtain classification procedures with efficiency comparable to k-NN, having, however, more possibilities to express what actually happens in data. It is worth noting that mentioned approaches are based not only on proper usage of already given conditional attributes but try to search for completely new means of expression. In fact, one may claim that the choice of ρ in k-NN actually defines new dimension for expressing the impact on decision in a better way. Also in the rough set based approach, discretization techniques themselves can involve hyperplanes (see Section 3 for their correspondence to the task of this work) into descriptors of decision rules or trees. Obviously, it depends on interpretation whether we treat the above examples as producing just classifiers proper for particular techniques or new attributes themselves. In this paper we propose alternative, very simple and intuitive way of automatic extraction of new features from data, as linear combinations of conditions which keep the original meaning of continuously valued attributes for sure. Foundations for searching for such combinations are strictly correlated to the task of classification and decision representation improvement. Presented algorithms optimize quality measures which have strong theoretical background in the above mentioned hyperplane-based approach (as devoted to linear combinations itself) and indiscernibility characteristics being one of the most expressive tools of classical rough sets theory [6]. Resulting new conditions are possible to be applied not only to rough set based methods. They provide an intelligent preprocessing of data information rather than final classification system, what can be concluded from experiments described in Sections 4 and 5.
2
Rough Set Foundations
The main paradigm of rough sets theory [5] states that a universe of known objects is assumed to be the only source of knowledge used for classification of cases outside the sample. In applications, reasoning is usually stated as a classification problem, concerning distinguished decision attribute to predict under given conditions. By a decision table we understand a triple A = (U, A, d), where each attribute a ∈ A ∪ {d} is a function a : U → Va from the universe U of objects into the set of all possible values on a. Classification of new objects outside U with respect to their membership to decision classes is performed by analogy with elements of U . In case of symbolic conditional attributes, we consider indiscernibility relation IN D (A) = {(u1 , u2 ) ∈ U × U : InfA (u1 ) = InfA (u2 )}. Information function InfA (u) = a1 (u) , .., a|A| (u) yields a one-to-one correspondence between equivalence classes of IN D (A) and elements of the set VAU = wA ∈ VA : InfA−1 (wA ) 6= ∅ of all vector values on A supported by objects in U . If for a given wA ∈ VAU there is inclusion InfA−1 (wA ) ⊆ d−1 (vd ) for some vd ∈ Vd , we obtain a decision rule of the form A = wA ⇒ d = vd . Then,
550
D. Slezak and J. Wróblewski
given a new object with vector value wA on A, we classify it as belonging to decision class d−1 (vd ). There are two main reasons for trying to decrease the number of conditional attributes used in rules. First, for less number of descriptors there is higher chance of their applicability to new cases and that we need to gather less information about them. The second reason is statistical – decision rules of the form B = wB ⇒ d = vd , for smaller B ⊆ A, are expected to classify new cases more properly, because of larger support in data. A lot of algorithms were devel↓B support oped to shorten descriptors A = wA to B = wA in a way maximizing
↓B ↓B InfB−1 wA , keeping (approximately) inclusion InfB−1 wA the other hand (see e.g. [8], [9]).
3
⊆ d−1 (vd ) on
Hyperplanes and Linear Combinations
One of rough set based approaches to continuously valued conditions involves so called discretization [4]. Here, we would like to focus on a special technique for decision trees generation, basing on so called hyperplane cuts. In binary decision tree representation, the root is supported by the whole universe. Then, to each node we attach two sub-nodes, corresponding to its objects satisfying additionally inequalities h (u) ≥ c and h (u) < c, respectively. Formula h (u) = h1 a1 (u) + ... + hn an (u) can be treated as describing a new continuously valued feature being linear combination of attributes a1 , ..., an . From this point of view, c ∈ (min (h) , max (h)], for min (h) = minu∈U h (u) , max (h) = maxu∈U h (u) is a real cut generating two-interval discretization over h = h1 a1 + ... + hn an . The main aim of algorithms searching for decision trees with such hyperplane based cuts is to provide possibly best discernibility between decision classes with respect to membership to particular nodes. The fundamental discernibility measure evaluating pairs of linear combinations and their cuts is the following: Disc (h, c) =
X
ku ∈ U : d (u) = v1 , h (u) < ck·ku ∈ U : d (u) = v2 , h (u) ≥ ck
v1 6=v2
Trying to focus on optimization of linear combinations parameters ”in general”, not concerning with any particular cut, one must provide a quality measure reflecting potential ability of using them in various classifier systems. The first idea is thus to search for h corresponding to average Disc (h, ·)-best discretization cuts. The following measure Q1 (h) =
X u1 ,u2 :d(u1 )6=d(u2 )
|h (u1 ) − h (u2 )| max (h) − min (h)
has its own interpretation in the search of combinations putting objects from different decision classes possibly far to each other. Moreover, it turns out to have much in common with average hyperplane cuts quality. Note, that:
Classification Algorithms Based on Linear Combinations of Features
Q1 (h) =
1 max (h) − min (h)
Z
551
max(h)
min(h)
Disc (h, x) dx
Let us rewrite U according to increasing ordering U = huh,1 , ..., uh,N i induced by h. Assuming that there is no values of h which correspond to objects from different decision classes (otherwise, we delete all objects corresponding to such inconsistencies for a given h), we can easily set the minimal sequence of real values min (h) = ch,1 < ... < ch,k(h) = max (h) such that for each i = 1, .., k (h)− 1 there is inclusion h ((ch,i , ch,i+1 ]) ⊆ d−1 (vi ) for some vi ∈ Vd , where h ((ch,i , ch,i+1 ]) = {u ∈ U : ch,i < h (u) ≤ ch,i+1 } Intuitively, the number k (h) corresponds to potential difficulty of handling decision rules based on h after discretization. Such coefficient, however, does not enable to search for proper combinations as optimization factor, because it abandons too much information. In our experiments we decided to consider the following measure: k(h)−1 X 2 kh ((ch,i , ch,i+1 ])k Q2 (h) = i=1
Searching for combinations h which maximize the above formula for Q2 (h) corresponds, actually, to searching for new features which discern minimal number of pairs of objects from different decision classes, after necessary discretization.
4
Algorithmic Foundations
The problem of finding optimal linear combinations of attributes can be divided into two stages: determining which attributes the combination should be concerned with, and determining the proper coefficients of linear combination. In our experiments we solve the first problem by choosing k attributes randomly and finding their optimal linear combination. Quality of particular combinations can be expressed by different modifications of formulas Q1 and Q2 . Experiments presented in the next subsection were performed for original Q1 and Q2 modified as follows, to improve results and take the best from both distance and discernibility based intuitions: mind(ui )6=d(ui+1 ) |h (ui ) − h (ui+1 )| Qmod (h) = 70 + ln Q2 (h) max |h (u)| Given a quality measure, we repeat this algorithm several (20 in our experiments) times and get the best linear combination found. The factor in brackets in formula for Qmod is fixed, concerned with the minimal difference of h values between any two objects from different decision classes, which we should maximize. In fact, it was tuned to obtain possibly best classification performance, with respect to the search procedure described below.
552
D. Slezak and J. Wróblewski
Our task is to create an optimal (in a sense of quality measure) linear combination of the k selected attributes. We used an algorithm based on evolution strategies (see e.g. [2]). Note, that every (normalized) linear combination of k conditional attributes can be defined by k − 1 angles (concerned with the direction of line representing this combination in k-dimensional space). Thus, the ”individual” was composed of k − 1 angle values. The objective function was based on Q1 or Qmod quality measure.
5
Experimental Results
Two databases was used for experiments: sat image database (4435 training and 2000 test objects, 36 attributes) and letter recognition database (15000 training and 5000 test objects, 16 attributes). Four new attributes was generated for each table: two of them as a linear combination of two selected attributes, two other was created basing on three selected attributes (experiments show, that considering more than three attributes hardly improves results, whereas the computation time grows dramatically). Both the training and test table was extended by four new attributes; only the training tables, however, were used to choose the linear combinations. Then, the newly created data sets were analyzed using two data mining methods: k-NN (for k from 1 to 10; distances on all dimensions was normalized) and a rough set based analyzer using local reducts (see [9] for details). Table 1 presents results of classification of test tables of the databases extended by new attributes as well as containing only these new ones. In the case of local reducts based method there is a number of decision rules presented in the last column. Table 1. Classification efficiency on the test data Table name Result (k-NN) Result (local reducts) No. of rules sat image 90.60% 81.30% 5156 extended, Q1 90.30% 79.50% 3405 extended, Qmod 91.05% 82.40% 1867 new attributes, Q1 81.65% 64.50% 445 new attributes, Qmod 84.30% 76.60% 475 letter recognition 95.64% 79.64% 21410 extended, Q1 92.00% 81.64% 17587 extended, Qmod 95.90% 79.74% 15506 new attributes, Q1 50.40% 45.40% 1765 new attributes, Qmod 67.80% 70.84% 4569
Results show that in case of both k-NN and rough sets based method a table extended with four additional attributes can be analyzed more accurately. Moreover, even if only four additional attributes was taken into account, a classification can be done with a pretty good efficiency (e.g. 70.8% of correct answers in case of letter recognition – this is good result if one take into account that
Classification Algorithms Based on Linear Combinations of Features
553
there is 26 possible answers). Note that in these cases we have 4 attributes instead of 36 or 16 – this is a significant compression of information. The best results obtained in case of both sat image and letter recognition database are better than the best results reported in [3]. However, the result on sat image is worse than one obtained using k-NN on feature subsets (91.5%, see [1]). The computation time on sat image (calculation of the best set of four linear combinations): 64 min (Q1 ), 31 min (Qmod ), on letter recognition: 3 h (Q1 ), 2 h 40 min (Qmod ). Calculations was performed on Pentium 200 MHz machine.
6
Conclusions
We provided theoretical and algorithmic framework for finding new features which potentially enable better classification. They were proposed to be searched for as linear combinations of already known continuously valued conditions. Quality measures for optimization of such combinations were shown to have strong intuition based on rough sets theory. Their tuning resulted with interesting experimental outcome, concerning classification task itself as well as representation of dependencies within data. Acknowledgements This work was supported by KBN Scientific Research Grant 8T11C01011 and ESPRIT Project 20288 CRIT-2.
References 1. Bay, S.D.: Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets. In: Proc. of the International Conference of the Machine Learning. Morgan Kaufmann Publishers, Madison, Wisc. (1998) 2. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer–Verlag (1994) 3. Michie, D., Spiegelhalter, D.J., Taylor, C.C. (eds.): Machine Learning, Neural and Statistical Classification. Ellis Horwood Limited (1994) 4. Nguyen, H.S.: From Optimal Hyperplanes to Optimal Decision Trees. Fundamenta Informaticae. Vol. 34 No. 1-2 (1998) 145–174 5. Pawlak, Z.: Rough sets – Theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht (1991) 6. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. In: R. Slowi´ nski (ed.), Intelligent Decision Support. Handbook of Applications and Advances of the Rough Set Theory, Kluwer Academic Publishers, Dordrecht (1992) 311–362 7. Stepaniuk, J.: Approximation Spaces, Reducts and Representations. In: Polkowski, L., Skowron, A. (eds.), Rough Sets in Knowledge Discovery 2. Physica–Verlag, Heidelberg (1998) 109–126 8. Wr´ oblewski, J.: Genetic algorithms in decomposition and classification problem. In: Polkowski, L., Skowron, A. (eds.), Rough Sets in Knowledge Discovery 2. Physica– Verlag, Heidelberg (1998) 471–487 9. Wr´ oblewski J.: Covering with reducts – a fast algorithm for rule generation. In: Proc. of RSCTC’98, Warsaw, Poland. Springer–Verlag, Berlin, Heidelberg (1998) 402–407
Managing Interesting Rules in Sequence Mining Myra Spiliopoulou Institut f¨ ur Wirtschaftsinformatik, Humboldt-Universit¨ at zu Berlin Spandauer Str. 1, D-10178 Berlin [email protected] http//www.wiwi.hu-berlin.de/∼myra
Abstract. The goal of sequence mining is the discovery of interesting sequences of events. Conventional sequence miners discover only frequent sequences, though. This limits the applicability scope of sequence mining for domains like error detection and web usage analysis. We propose a framework for discovering and maintaining interesting rules and beliefs in the context of sequence mining. We transform frequent sequences discovered by a conventional miner into sequence rules, remove redundant rules and organize the remaining ones into interestingness categories, from which unexpected rules and new beliefs are derived.
1
Introduction
Data miners pursue the discovery of new knowledge. But knowledge based solely on statistical dominance is rarely new. The expert needs means for either instructing the miner to discover only interesting rules or for ranking the mining results by “interestingness” [7]. Tuzhilin et al propose interestingness measures based on the notion of belief [1,6]. A belief reflects the expert’s domain knowledge. Mining results that contradict a belief are more interesting than those simply confirming it. Hence, they propose methods to guide the miner in the discovery of interesting rules only. Most conventional miners did not foresee the need for such a guidance, when they were developed, so that much research has focussed on the filtering of the mining results in a postmining phase [4]. Sequence miners are no exception: [2,5,10], the goal is the discovery of frequent sequences, which are built by expanding frequent subsequences in an incremental way. The sequence miner WUM, presented in [9,8], does support more flexible measures than frequence of appearance. Still missing is a full system that administers the beliefs, compares and categorizes the mining results towards them and uses the finally select rules and beliefs as an input to the next mining session. In this study, we propose such a postmining environment. Its goal is to prune and categorize the patterns discovered by the miner, confirm or update the beliefs of the expert and classify unexpected patterns into several, distinct categories. Then, the expert may probe potentially unexpected rules using her updated beliefs or expected rules as a basis. This framework adds value to the functionality of existing miners, helps the expert to gain an overview of the results, to update her system of beliefs and to design the next mining session effectively. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 554–560, 1999. c Springer-Verlag Berlin Heidelberg 1999
Managing Interesting Rules in Sequence Mining
2
555
Sequence Rules
In sequence mining, the events constituting a frequent sequence need not be adjacent [2]. Thus, a frequent sequence does not necessarily ever occur in the log. We remove this formal oddity by introducing the notion of generalized sequence or “g-sequence” as a list of events and wildcards [8]. Definition 1. A “g-sequence” g is an ordered list g1 ∗ g2 ∗ . . . ∗ gn , where g1 , . . . , gn are elements from a set ∈ U and ∗ is a wildcard.The number of nonwildcard elements in g is the length of g, length(g). A sequence in the transaction log L matches a g-sequence if it contains the non-wildcard elements of the g-sequence in that order. Then: ∗ be a g-sequence. The Definition 2. Let L be a transaction log and let g ∈ U+ hits of g, hits(g), is the number of sequences in L that match g.
2.1
Rules for g-Sequences
A conventional miner discovers frequent sequences, or “g-sequences” in our terminology. A “sequence rule” is built by splitting such a g-sequence into two adjacent parts: the LHS or premise and the RHS or conclusion. We denote a sequence rule as < LHS, RHS > or LHS ,→ RHS. Definition 3. Let ζ =< g, s > be a sequence rule, with n = length(g) and m = length(s). Further, let |L| denote the cardinality of the transaction log L. – ‘support” of ζ: support(ζ) ≡ support(g · s) =
hits(g·s) |L|
support(g·s) support(g) idence(g,s) improvement(g, s) = conf support(s)
– “confidence” of ζ: conf idence(ζ) ≡ conf idence(g, s) = – ‘improvement” of ζ: improvement(ζ) ≡
where g · s denotes the concatenation of g and s. We use the notion of support as in [2]. We borrow the concepts of confidence and improvement from the domain of association rules’ discovery [3]. An improvement value less than 1 means that the RHS occurs independently of the LHS and the rule is not really interesting. 2.2
Extracting and Pruning Sequence Rules
We now turn to the problem of producing rules as the results of the mining process. A frequent sequence generates several rules by shifting events from the LHS to the RHS. Many of those rules are of poor statistics and must be removed. Let g = g1 ∗ g2 ∗ . . . gn be a frequent sequence output by a conventional sequence miner. To compute its statistics, we observe that since g is frequent, then all its elements g1 , g2 , . . . , gn and all its order preserving subsequences, like g1 ∗ g2 , g2 ∗ g3 , g1 ∗ g2 ∗ g3 are also frequent. Then, the support values are known and the confidence and improvement values for any sequence rules containing those subsequences can be computed.
556
M. Spiliopoulou
Input: The set G of all g-sequences discovered by the miner, a confidence threshold tc and an improvement threshold timpr (default: 1) Output: For each g ∈ G, all “acceptable” partitionings of g into an LHS and an RHS Algorithm BuildSRules: For each g ∈ G do: For i = n − 1, . . . , 1 do: Set LHS = g1 ∗ . . . gi and RHS = gi+1 ∗ . . . gn If the conf idence(LHS, RHS) < tc then discard the rule Else-If improvement(LHS, RHS) < timpr then discard the rule Fig. 1. Selectively building sequence rules
Building Sequence Rules from a g-Sequence. From a g-sequence g = g1 ∗ . . . ∗ gn and for any i between 2 and n − 1, we can build two sequence rules ζ1 : g1 ∗ . . . gi−1 ∗ ,→ gi ∗ . . . gn and ζ2 : g1 ∗ . . . gi ∗ ,→ gi+1 ∗ . . . gn . From Def. 3, we can see that the confidences of ζ1 and ζ2 are ratios with the same nominator support(g) and with denominators support(g1 ∗ . . . gi−1 ) and support(g1 ∗ . . . gi ) respectively. Since all sequences containing g1 ∗ . . . gi also contain g1 ∗ . . . gi−1 , conf idence(ζ1 ) ≤ conf idence(ζ2 ). Thus, when shifting elements from the LHS to the RHS, the support of the LHS increases and the rule’s confidence decreases. For the same reason, the support of the RHS decreases, so that the improvement changes non-monotonically. High values of improvement are desirable. However, this measure endeavours rules with rare RHS. So it must be used in combination with the confidence measure. Thus, our buildSRules algorithm in Fig. 1 only builds rules with confidence higher than a threshold tc and order them by improvement. Rules with improvement less than 1 or some higher threshold timpr are eliminated. Default thresholds can be provided for both tc , timpr . However, analysts can be expected to provide such values, as is usual for association rules’ discovery. Pruning Sequence Rules by Comparison. After building a first set of sequence rules with buildSRules, we remove rules that are implied by statistically stronger ones. We consider rules that have the same RHS and overlapping LHS contents. The algorithm pruneSRules is shown in Fig. 2.
3
Beliefs and Unexpectedness
Thus far, we have established a set of sequence rules and removed those members that had poor statistics. We now build a system of beliefs and categorize the sequence rules according to their relationships to the beliefs. 3.1
Beliefs over Sequences
Informally, a belief is a sequence rule assumed to hold on the data. This assumption is expressed in the form of value intervals that restrict the support, confidence or improvement of the elements appearing in the rule.
Managing Interesting Rules in Sequence Mining
557
Input: The set SR of all sequence rules built by buildSRules Output: A subset SRout of SR Algorithm PruneSRules: 1. Group sequence rules by RHS 2. Sort the sequence rules of each group by descending LHS length 3. For each rule ζ = lhs ,→ rhs : Find all sequence rules of the form lhs2 ,→ rhs such that: – lhs ⊇ lhs1 · lhs2 – There is a sequence rule lhs1 ,→ lhs2 For each triad (ζ, lhs1 ,→ lhs2 , lhs2 ,→ rhs): Let c1 = conf idence(ζ), c2 = conf idence(lhs1 ,→ lhs2 ), c3 = conf idence(lhs2 ,→ rhs) If c1 < min{c2 , c3 } then remove ζ (ζ is implied by the other rules and has lower confidence than them) Else-If c3 < min{c1 , c2 } then remove lhs2 ,→ rhs (rhs is predicted by ζ, while lhs2 is predicted by lhs1 ,→ lhs2 , both with a higher confidence) Else-If c3 > max{c1 , c2 }, then retain lhs2 ,→ rhs and the one of the two other rules with the largest improvement Fig. 2. Algorithm for sequence rule pruning
Definition 4. A “belief ” β over g-sequences is a tuple < lhs, rhs, CL, C >: – – – –
lhs ,→ rhs is a sequence rule. CL is a conjunction of “constraints” on the statistics of lhs. C is a conjunction of constraints involving elements of lhs and rhs. At least one of CL and C is not empty.
A “constraint” is an expression of the form stats(x, y) ◦ c, where x, y are gsequences contained in lhs or rhs, stats() is one of support(), conf idence() and improvement(), while the symbol ◦ denotes a comparison to a constant c in the value range of stats(). For example, let β =< a∗b, c, CL, C > be a belief with CL = (support(a∗b) ≥ 0.4 ∧ conf idence(a, b) ≥ 0.8) and C = (conf idence(a ∗ b, c) ≥ 0.9). This belief states that the LHS of the sequence rule a ∗ b ,→ c should appear at least in 40% of the log sequences, the confidence of b given a should be at least 0.8, while the RHS confidence should be at least 0.9. In most research in the area of beliefs and interestingness, it is assumed that the beliefs are known in advance. We rather expect that most beliefs will be identified during a postmining phase, as described in section 4. 3.2
Expected and Unexpected Sequence Rules
Having defined the notion of belief over sequences, we now categorize sequence rules on the basis of their collision against beliefs.
558
M. Spiliopoulou
Definition 5. Let ζ = lhs ,→ rhs be a sequence rule found by mining over the transaction log L, such that: support(lhs) = s, support(ζ) = s0 , conf idence(ζ) = c and improvement(ζ) = i according to Def. 3. Let β =< lhs0 , rhs, CL, C >∈ B be a belief such that lhs = lhs0 · y (where y may be empty). The match of ζ towards β, match(β, ζ), is the pair of predicates (CL(ζ), C(ζ)), where: – CL(ζ) is the conjunction CL ∧ (support(lhs) = s) – C(ζ) is the conjunction CL ∧ C ∧ (support(lhs) = s) ∧ support(ζ) = s0 ∧ conf idence(ζ) = c ∧ improvement(ζ) = i. Here we compare the sequence rule to a belief. This is only possible for beliefs and rules having similar contents. Otherwise, the notion of match is undefined, i.e. “there is no match”. Definition 6. Let B be the collection of beliefs defined over L thus far. Further, let ζ = lhs ,→ rhs be a sequence rule found by mining over L. Then ζ is “expected” if there is a β =< lhs0 , rhs, CL, C >∈ B, such that match(β, ζ) is defined and CL(ζ) ∧ C(ζ) = true. If no such belief exists, ζ is “unexpected”. Thus, a sequence rule is expected if it conforms to a belief in terms of statistics and content. Depending on the reason that makes a rule unexpected, we categorize unexpected rules as follows: Definition 7. Let B be the collection of beliefs over a log L and let ζ =< lhs, rhs > be an unexpected sequence rule. – ζ is “statistically deviating” if there is a belief with same LHS and RHS β =< lhs, rhs, CL, C >∈ B, such that: CL(ζ) ∧ C(ζ) = f alse – ζ “makes unexpected predictions” if there is a β =< lhs, rhs0 , CL, C >∈ B, such that: 1. rhs0 6= rhs but length(rhs0 ) = length(rhs) We build a temporary belief βr =< lhs, rhs, CL, Cr > by 1-to-1 replacing references to rhs0 in C with references to rhs. 2. In the match(βr , ζ) it holds that CL(ζ) ∧ Cr (ζ) = true. – ζ “makes unexpected assumptions” if there is a belief with the same RHS β =< lhs0 , rhs, CL, C >∈ B, such that: 1. lhs0 6= lhs and length(lhs) − length(lhs0 ) = n > 0 a) There is a temporary belief βl =< lhs, rhs, CLl , Cl > built replacing references to lhs0 in CL and C with references to a subsequence of lhs of the same length. b) In the match(βl , ζ) it holds that CLl (ζ) ∧ Cl (ζ) = true If none of the above cases holds, then ζ is called “new”. We group unexpected rules by the semantics of their “unexpectedness”. To test whether a rule makes unexpected predictions or assumptions, we search for all beliefs having the same LHS, resp. RHS, and build temporary beliefs that match with the rule. It should be stressed that a rule is unexpected if there is no belief that makes it expected.
Managing Interesting Rules in Sequence Mining
559
Input: A set of beliefs B, the set of frequent sequences S Output: The modified set of beliefs B and 5 sets of rules: 1. the set of expected rules E 2. the set of statistically deviating unexpected rules D 3. the set of unexpected rules making unexpected predictions P 4. the set of unexpected rules making unexpected assumptions A 5. the set of unexpected new rules N Algorithm PostMine: 1. Invoke BuildSRules to construct the sequence rules from the frequent sequences 2. Invoke PruneSRules to reduce the set of sequence rules into SRout . 3. By comparing the beliefs in B to the sequence rules in SRout , identify the expected rules and place them to E. 4. For each ζ ∈ SRout − E: If there is a belief β ∈ B violated by ζ according to Def. 7, then: if ζ is statistically deviating then add ζ to D else-if ζ makes unexpected predictions then add ζ to P suggest extending β or replacing it by ζ else-if ζ makes unexpected assumptions then add ζ to A suggest transforming ζ to a belief Else add ζ to N Fig. 3. Algorithm PostMine for rule categorization
4
Organizing Sequence Rules by Unexpectedness
We now present our complete algorithm for the categorization of sequence rules by unexpectedness. Its input consists of a set of beliefs and a set of frequent sequences discovered by a conventional miner. Initially, the set of beliefs may be empty or rudimentary: it is reconstructed by the end of the post-mining phase by adding new beliefs and removing outdated ones. The algorithm PostMine is shown in Fig. 3. It first invokes buildSRules (Fig. 1) to generate sequence rules from frequent sequences and then activates pruneSRules (Fig. 2) for rule filtering. The remaining sequence rules are categorized by (un)expectedness. For sequence rules making unexpected predictions or assumptions, PostMine suggests a change in the belief system. The set E of expected rules output by PostMine can now be used as input to the mechanism for rule negation proposed in [6] or be generalized into pattern templates as proposed in [1] for the discovery of further interesting rules.
5
Conclusions
We have proposed a complete post-mining mechanism for the extraction of interesting sequence rules from frequent sequences according to a non-fixed set of
560
M. Spiliopoulou
beliefs. Our PostMine model transforms frequent sequences into a set of sequence rules and filters this set using statistical measures and heuristics based on content overlap and transitivity. The remaining sequence rules are categorized on the basis of their relationship to a collection of beliefs. Some of the expected rules become themselves beliefs, probably replacing old ones. PostMine helps in bringing order into the vast set of frequent sequences a miner can generate. This is fundamental for proper rule maintenance and for the stepwise formulation of a system of beliefs that combines the expert’s background knowledge with the rules hidden in the data. The expert may revise her knowledge by studying the unexpected rules of each category and considering the suggestions of our algorithm. We are currently implementing PostMine and intend to use it together with our sequence miner WUM. In this coupling, we want to investigate the automated extraction of beliefs based on variables, as supported by WUM, rather than constants. This would lead to a smaller set of generic beliefs, which the expert can inspect and manipulate easier. We are further interested in the reduction of a set of sequence rules into a minimal set with still reliable statistics.
References 1. Gediminas Adomavicius and Alexander Tuzhilin. Discovery of actionable patterns in databases: The action hierarchy approach. In KDD, pages 111–114, Newport Beach, CA, Aug. 1997. 2. Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In ICDE, Taipei, Taiwan, Mar. 1995. 3. Michael J.A. Berry and Gordon Linoff. Data Mining Techniques: For Marketing, Sales and Customer Support. John Wiley & Sons, Inc., 1997. 4. Alex Freitas. On objective measures of rule surprisingness. In PKDD’98, number 1510 in LNAI, pages 1–9, Nantes, France, Sep. 1998. Springer-Verlag. 5. Heikki Mannila and Hannu Toivonen. Discovering generalized episodes using minimal occurences. In KDD’96, pages 146–151, 1996. 6. Balaji Padmanabhan and Alexander Tuzhilin. A belief-driven method for discovering unexpected patterns. In KDD’98, pages 94–100, New York City, NY, Aug. 1998. 7. Gregory Piateski-Shapiro and Christopher J. Matheus. The interestingness of deviations. In AAAI’94 Workshop Knowledge Discocery in Databases, pages 25–36. AAAI Press, 1994. 8. Myra Spiliopoulou. The laborious way from data mining to web mining. Int. Journal of Comp. Sys., Sci. & Eng., Special Issue on “Semantics of the Web”, Mar. 1999. 9. Myra Spiliopoulou and Lukas C. Faulstich. WUM: A Tool for Web Utilization Analysis. In extended version of Proc. EDBT Workshop WebDB’98, LNCS 1590. Springer Verlag, 1999. 10. Mohammed J. Zaki. Fast mining of sequential patterns in very large databases. Technical Report 668, University of Rochester, 1997.
Support Vector Machines for Knowledge Discovery Shinsuke Sugaya1 , Einoshin Suzuki1 , and Shusaku Tsumoto2 1
Division of Electrical and Computer Engineering, Faculty of Engineering, Yokohama National University, {shinsuke,suzuki}@slab.dnj.ynu.ac.jp 2 Department of Medical Informatics, Shimane Medical University, School of Medicine, [email protected]
Abstract. In this paper, we apply support vector machine (SVM) to knowledge discovery (KD) and confirm its effectiveness with a benchmark data set. SVM has been successfully applied to problems in various domains. However, its effectiveness as a KD method is unknown. We propose SVM for KD, which deals with a classification problem with a binary class, by rescaling each attribute based on z-scores. SVM for KD can sort attributes with respect to their effectiveness in discriminating classes. Moreover, SVM for KD can discover crucial examples for discrimination. We settled six discovery tasks with the meningoencephalitis data set, which is a benchmark data set in KD. A domain expert ranked the discovery outcomes of SVM for KD from one to five with respect to several criteria. Selected attributes in six tasks are all valid and useful: their average scores are 3.8-4.0. Discovering order of attributes about usefulness represents a challenging problem. However, concerning this problem, our method achieved a score of more than or equal to 4.0 in three tasks. Besides, crucial examples for discrimination and typical examples for each class agree with medical knowledge. These promising results demonstrate the effectiveness of our approach.
1
Introduction
Support vector machine (SVM) [1,10] is an effective method for a classification problem with a binary class. Although SVM has been successful in various domains [2,5,6], its effectiveness as a knowledge discovery (KD) method is unknown. Since SVM has achieved high accuracy in various domains, we considered that applying SVM to KD and extracting information from its accurate model would likely to produce useful knowledge. However, if SVM is directly applied to a date set, it is difficult to determine effective attributes for discrimination since an OSH depends on scales for attribute values. In order to circumvent this problem, we propose SVM for KD that adjusts these scales to an approximately equivalent length based on z-scores [3]. SVM for KD obtains effective attributes for discrimination, crucial examples for discrimination and typical examples for each class. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 561–567, 1999. c Springer-Verlag Berlin Heidelberg 1999
562
S. Sugaya, E. Suzuki, and S. Tsumoto
We have applied SVM for KD to meningoencephalitis data set [8], which is a common problem in KD created by one of the authors, Tsumoto (a medical doctor). We performed experiments on six problems by selecting six attributes as a class. Tsumoto judged quantitatively that all the results with respect to attributes are useful. He also confirmed that results concerning crucial examples and typical examples agree with medical knowledge.
2
SuperKDD: Discovery Challenge Contest
The goal of SuperKDD(Supervised KDD) [8] is to compare KD methods under the supervision of a domain expert (Tsumoto) using a common medical data set such as meningoencephalitis data set. The results of SuperKDD enable us to characterize KD methods more concretely, and show the importance of interaction between KD researchers and domain experts. The common data set collects the data of patients who suffered from meningitis and were admitted to the department of emergency and neurology in several hospitals. Tsumoto worked as a domain expert for these hospitals and collected those data from the past patient records (1979 to 1989) and the cases in which he made a diagnosis (1990 to 1993). The database consists of 140 cases and all the data are described by 38 attributes, including present and past history, laboratory examinations, final diagnosis, therapy, clinical courses and final status after the therapy.
3 3.1
SVM for Knowledge Discovery Support Vector Machines
Support vector machines [1,10] are a new learning method introduced by V.Vapnik and AT&T Bell Laboratories. SVM is applied to a classification problem with a binary class, and finds a classification model that discriminates each vector belonging to two classes. In SVM, a vector that is the closest to a hyperplane is called a support vector (SV). The classification model, the optimal separating hyperplane (OSH), is given by maximizing the distance between the hyperplane and a SV. SVM has been applied to various problems and achieved higher accuracy than the conventional methods [2,6]. 3.2
Description of SVM for KD
Information obtained from SVM is the OSH, SVs, non-SVs and misclassified examples. This paper considers exploiting a normal vector of the OSH, SVs and non-SVs for KD. We first explain how to use the normal vector of the OSH. Consider an attribute of which coefficient in the normal vector has a large absolute value. Figure 1(a) illustrates the projection of examples on the axis of this attribute in the example space. From figure 1(a), we see that examples can be discriminated by this
Support Vector Machines for Knowledge Discovery
563
attribute. We can conclude that, an absolute value of the coefficient represents discrimination power of its attribute. Here, in figure 1(b), readers may consider that a classification model which is represented by a dotted line is produced and an attribute of which coefficient has a small absolute value influences the discrimination. However, this case never happens since SVM produces a classification model that is represented by a straight line, which has a larger margin than the dotted line.
(a)
Attribute
(b)
Fig. 1. (a) A projection of the examples on the axis of an attribute of which coefficient in the normal vector has a large absolute value. (b) A counter example for a discriminative attribute of which coefficient has a small absolute value. The straight line, not the dotted line, is produced as the OSH.
The above discussions hold true for attributes of which scales are approximately equivalent. An absolute value of a coefficient in the normal vector also depends on the scale of the attribute. In order to understand this, consider the example in figure 2. The left-hand side (a) and the right-hand side (b) represent expressing an attribute ”HEIGHT” in inch or in centimeter respectively. We see that, although these represent the same attribute, the line in the right-hand side is flatter due to the smaller scale for ”HEIGHT”. Hence, if scales of each attribute are totally different, it is difficult to determine effective attributes for discrimination. AGE(years)
AGE(years)
50
50
40
40
30
30 60 (a)
70 HEIGHT (inch)
160
170 (b)
180 190 HEIGHT (centimeter)
Fig. 2. Effect of expressing an attribute ”HEIGHT” in centimeter (a) or in inch (b) respectively. Line (b) is flatter due to its smaller scale for values.
564
S. Sugaya, E. Suzuki, and S. Tsumoto
In order to circumvent this problem, we propose to normalize each attribute by z-scores [3]. Z-scores is a method for obtaining a relative position of the data with respect to an attribute. A value of z-scores zi represents the number of standard deviation s between an attribute value xi and an average value xm . Attribute values are transformed by zi =
xi − xm s
(1)
Effective attributes for discrimination are determined by this transformation, which corresponds to adjusting attributes to approximately equivalent scales. SVM for KD obtains n most-effective attributes for discrimination by calculating coefficients of the normal vector of the OSH and sorting them with respect to their absolute values. Next, we explain how to use SVs and non-SVs. A SV, which is the closest example to the OSH, is an example that determines each class. Hence, a SV corresponds to a crucial example for discrimination. On the other hand, since a non-SV is farther from the OSH than a SV, it can be considered as a typical example for its class. SVM for KD outputs SVs as crucial examples for discrimination and non-SVs as typical examples for each class.
4 4.1
Application to SuperKDD Conditions
We applied SVM for KD to meningoencephalitis data set [8]. We settled six tasks by selecting Diag2 (grouped attribute of diagnosis), EEG WAVE (electroencephalography), CT FIND (CT finding), CULT FIND (whether bacteria or virus is specified or not), COURSE (grouped attribute of clinical course at discharge) and RISK (grouped attribute of risk factor) as a class. Since SVM can be applied to only continuous attributes, nominal attributes with more than two values were ignored in the experiments. Here, attributes concerning therapy and courses, such as COURSE, can be measured only after diagnosis. They were ignored in four tasks with respect to Diag2, EEG WAVE, CT FIND and CULT FIND since they are not available in determining these classes. Moreover, RISK was removed for data analysis because it was selected as a decision attribute. After the experiments shown in [9], we generalized several attributes, which have too many values and seem to lose essential information for data analysis. Also, categorical attributes were tranformed to numerical attributes for SVM. We applied SVM after all preprocessing procedures. So, the results obtained here cannot be compared directly with those shown in the former paper, where such preprocessing procedures were not applied. 4.2
Effective Attribute for Discrimination
We first justify our rescaling strategy for attributes by comparing results with and without z-scores transformations. Ten hightest-ranking attributes for discri-
Support Vector Machines for Knowledge Discovery
565
mination with respect to class Diag2 with and without z-scores transformations are {Cell Poly, CSF CELL, CRP, STIFF, KERNIG, Cell Mono, CT FIND, CSF GLU, COLD, WBC} and {CT FIND, FOCAL, STIFF, CRP, KERNIG, SEIZURE, BT, GCS, COLD, LOC} respectively. As described in section 3.2, attributes each of which scale is small, such as Cell Poly and CSF CELL, are not obtained without z-scores transformations. However, domain knowledge describes that these attributes are effective in this discrimination task. These results demonstrate validness of our approach. Table 1. Average scores of ten hightest-ranking attributes concerning three criteria in six tasks with different classes, where ”unexpect.” represents unexpectedness. Selection of Attributes Order of Attributes Class Validness Unexpect. Usefulness Validness Unexpect. Usefulness Diag2 4.8 1.0 4.8 4.0 1.4 4.2 EEG WAVE 4.7 1.2 4.1 3.0 2.7 3.2 CT FIND 4.3 1.7 4.1 3.3 1.4 3.2 CULT FIND 4.4 1.8 4.1 3.3 2.6 3.6 COURSE 5.0 1.0 5.0 2.9 2.3 4.0 RISK 3.8 2.5 4.0 2.6 2.4 4.0
Next, the results obtained by applying SVM for KD to six tasks are summarized in Table 1. Tsumoto, a domain expert, ranked, from one to five, selected attributes and order of attributes concerning ten hightest-ranking attributes with respect to validness, unexpectedness and usefulness. Here, validness indicates that discovered results agree with the medical context, and unexpectedness represents that discovered results can be partially explained by the medical context but are not accepted as common sense. Usefulness indicates that discovered results are useful in medical context. According to Tsumoto, it is relatively difficult to select ten relevant attributes in this problem. For selection of attributes, we see that the results are satisfactory. All the average scores concerning usefulness are more than or equal to 4.0 and the scores concerning validness in five tasks are more than or equal to 4.3. Scores are rather low for unexpectedness, however, the meaning of this index is different from the other two. From the left-hand side of table 1, we can conclude that our method always discovered useful attributes for discrimination in these six tasks, and the selections were rather conservative than exploratory. Determining order of attributes represents a more difficult problem than simply selecting attributes. From the table, our method is still effective for this problem. For validness, scores are more than or equal to 3.0 except for the task for COURSE and RISK. In terms of usefulness, our SVM for KD always achieved a score of more than or equal to 3.2. Especially, in three out of six tasks, scores are more than or equal to 4.0. Scores about unexpectedness are higher than
566
S. Sugaya, E. Suzuki, and S. Tsumoto
in attribute-selection problem. However, our method can be still classified as conservative. The results may be summarized as follows: pieces of knowledge obtained by SVM for KD are all useful with respect to selected attributes. Moreover, they are promising for a more difficult problem: determining order of attributes. For both problems, SVM for KD has turned out to be conservative: obtained results are expected. However, this tendency would be important in domains such as medical diagnosis. 4.3
Classification of Examples
Table 2. Performance for selecting crucial examples for discrimination and typical examples for each class. Each ratio indicates how many examples agree with medical context. Diag2 EEG WAVE CT FIND CULT FIND COURSE RISK SV 90% 100% 100% 79% 89% 100% Non-SV 100% 100% 100% 95% 100% 99%
As described in section 3, SVs are considered as crucial examples for discrimination. Remarkably, from table 2, Tsumoto judged that all the examples obtained in three tasks correspond exactly to crucial examples for discrimination in medical context. Moreover, 79%-90% of examples concerning the remaining tasks agree with medical context. On the other hand, non-SVs are considered typical examples for each class. Remarkly, 95%-100% of examples in six tasks also agree with medical context. These results clearly demonstrate the effectiveness of our approach for this problem. We also investigated misclassified examples. There were relatively many misclassified examples in two tasks: CULT FIND and COURSE. Tsumoto explained that these tasks are difficult for a linear discrimination-model such as our SVM, and these misclassifications are reasonable. Since our method can discover effective attributes and crucial examples in discrimination as demonstrated in section 4.2 , we do not consider that these misclassified examples should be decreased in number. However, outliers and exceptions are gaining increasing attention in KD community [4,7]. We regard these examples as interesting, and are currently planning to exploit them for KD.
5
Conclusions
This paper has explored the capability of support vector machine [1,10] in knowledge discovery. Three types of knowledge in discrimination, effective attributes, crucial examples and typical examples are discovered in our approach. We have proposed attribute-rescaling based on z-scores [3] for this purpose.
Support Vector Machines for Knowledge Discovery
567
Meningoencephalitis data set [8], which is used in a contest of knowledge discovery methods, was chosen to demonstrate the effectiveness of our approach. One of the authors, Tsumoto, who is also a medical doctor, quantitatively evaluated the results. From the results, we can safely conclude that our method discovers useful attributes for discrimination, crucial examples and typical examples that are equivalent to medical knowledge. Ongoing work is focused on the exploitation of misclassified examples in knowledge discovery, and investigation of non-linear models generated by kernel methods [1,10] with this data set.
Acknowledgement This work was partially supported by the grant-in-aid for scientific research on priority area “Discovery Science” from the Japanese Ministry of Education, Science, Sports and Culture (11130207).
References 1. C.Cortes and V.Vapnik: ”Support Vector Network”, Machine Learning, Vol.20, No. 3, pp.1-25, 1995. 2. T.Joachims: ”Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, Proc. Tenth European Conf. Machine Learning (ECML), pp.137-142 , 1998. 3. L.Kaufman and P.Rousseeuw: Finding Groups in Data, John Wiley & Sons, 1990. 4. E.M.Knorr and R.T.Ng: ”Algorithms for Mining Distance-Based Outliers in Large Datasets”, Proc. 24th Ann. Int’l Conf. Very Large Data Bases (VLDB), pp.392403, 1998. 5. E.Osuna, R.Freund and F.Girosi: ”Training Support Vector Machines: an Application to Face Detection”, Proc. Computer Vision and Pattern Recognition, pp.130136, 1997. 6. M.Pontil and A.Verri: ”Support Vector Machines for 3D Object Recognition”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.20, No.6, pp.637-646, 1998. 7. E.Suzuki: ”Autonomous Discovery of Reliable Exception Rules”, Proc. Third Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp.259-262, 1997. 8. S.Tsumoto, et al.: ”Comparison of Data Mining Methods using Common Medical Datasets”, ISM Symp.: Data Mining and Knowledge Discovery in Data Science, pp. 63-72, 1999. 9. S.Tsumoto: ”Knowledge Discovery in Clinical Databases: an Experiment with Rule Induction and Statistics”, Proc. Eleventh Int’l Symp. Methodologies for Intelligent Systems (ISMIS), pp.349-357, LNAI 1609, Springer-Verlag, 1999. 10. V.Vapnik: The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995.
Regression by Feature Projections? ˙ Ilhan Uysal and H.Altay G¨ uvenir Department of Computer Engineering and Information Sciences, Bilkent University, 06533 Ankara, Turkey, {uilhan,guvenir}@cs.bilkent.edu.tr
Abstract. This paper describes a machine learning method, called Regression by Feature Projections (RFP), for predicting a real-valued target feature. In RFP training is based on simply storing the projections of the training instances on each feature separately. Prediction is computed through two approximation procedures. The first approximation process is to find the individual predictions of features by using the K-nearest neighbor algorithm (KNN). The second approximation process combines the predictions of all features. During the first approximation step, each feature is associated with a weight in order to determine the prediction ability of the feature at the local query point. The weights, found for each local query point, are used in the second step and enforce the method to have an adaptive or context-sensitive nature. We have compared RFP with the KNN algorithm. Results on real data sets show that RFP is much faster than KNN, yet its prediction accuracy is comparable with the KNN algorithm.
1
Introduction
We will describe a method in the paper for predicting a continuous target feature. Predicting a continuous feature is generally known as regression among related fields such as machine learning, statistics, pattern recognition as well as knowledge discovery in databases (KDD) and data mining. There are two different approaches for regression in the literature:eager and lazy learning. The term eager is used for the learning systems that construct models that represent knowledge using the training data. After training, predictions are made by using this model, which is a compact representation of the data. In lazy learning, on the other hand, all processing is delayed to prediction phase. We describe a lazy learning method called Regression by Feature Projections (RFP), to predict a continuous target, where the instances are stored as their projections on each feature dimension. In RFP method, we use the KNN algorithm together with linear least squares approximation to find the prediction at each feature dimension. Then we find the precision of those features at the local ?
This project is supported, in part, by TUBITAK (Scientific and Technical Research Council of Turkey) under Grant 198E015.
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 568–573, 1999. c Springer-Verlag Berlin Heidelberg 1999
Regression by Feature Projections
569
position of query instance. We define the precision as a local weight which brings an adaptive or context-sensitive nature to the method. By adaptive, we mean that the contribution of each feature changes according to the local position of the query instance. The final prediction is obtained by combining individual feature predictions and using their local weights. RFP eliminates some problems met in real data sets. Those are missing feature values, irrelevant features and normalization of data. The major limitation of the algorithm is its assumption that contribution of each feature to the final prediction is independent of other features. The empirical results show that RFP method is much faster than its natural competitor KNN, and achieves a comparable accuracy. The description of the weighted KNN regression algorithm we used for comparisons is given in [5]. For most data mining or knowledge discovery applications, where very large databases are in concern, this is thought of a solution because of its small computational complexity, and elimination of the above problems with real data sets. In Section 2 and Section 3 description of RFP and its evaluation are given respectively. Finally in Section 4, conclusions and future works are presented.
2
Regression by Feature Projections
In this section we introduce a lazy regression method based-on feature projections, called Regression by Feature Projections (RFP). The main property of the algorithm is that, a different approximation is done for each feature, where the training data is projected to every feature. This approximation is done by using the nearest instances to the query point, where these instances may differ at each feature dimension, independent of other features. The final prediction is found with the weighted combination of feature predictions. 2.1
Training
As we have described above, training involves simply storing the training set as projections to the features. This is done by associating a copy of target value with each projection, then sorting the instances for each feature dimension according to their feature values. If there are missing values of features, they are simply ignored on the corresponding features. 2.2
Approximation at Feature Projections
In the first approximation step of the prediction algorithm, we employ the KNN algorithm at each feature dimension. Since the instances are sorted according to feature values in the training, the nearest neighbors can be found by a binary search. Then all K nearest neighbors are found by comparing the sorted feature values of neighboring instances. After determining the K nearest instances, a prediction is made for that feature using the feature and target values of these instances. We apply linear least squares approximation, given by Equation (1)
570
˙ Uysal and H. Altay G¨ I. uvenir
to find the predicted target value at a particular feature dimension. Linear least squares algorithm is described in [4], which minimizes the sum of squared errors of instances (2). y¯i = β0 + β1 xi1 Error =
n X
(yi − y¯i )2
(1) (2)
i=1
where n is the number of instances and yi is the actual target value. After constructing a linear equation by using the linear least squares algorithm, the prediction at a particular feature projection is done by simply substituting the feature value of the query instance to this equation. 2.3
Local Weight
Some regions at any feature dimension may produce better approximation than others. In order to obtain a measure for estimation at a particular feature, we employ a weighting measure in the prediction algorithm. If the region that query point fall is smooth, we give a high weight to that feature in the final prediction. By this way we both eliminate the effect of irrelevant features, as well as the irrelevant regions of a feature dimension. This establishes an adaptive, or context-sensitive nature, where at different locations in the instance space, the contribution of features on the final approximation differs. Since we employ linear least squares approximation, the smoothness is determined according to the constructed linear equation. In order to measure the degree of smoothness, we compute the distance weighted mean of squared differences of the target values of the nearest neighbors and their estimated values found by using linear equation. We denote this measure with Vf shown in Equation (4). By subtracting it from the variance of the target values of all instances, Vall , we find the explained variance for that region, and by normalizing it with the variance of training set we obtain a measure, called prediction index (PI)given in Equation (6). We use the squared PI as the local weight (LW) for each feature (7). Pn (yi − y¯)2 (3) Vall = i=1 n Pk wi (yi − y¯i )2 (4) Vf = i=1Pk i=1 wi where n is the number of instances, y¯ is the mean of target values, y¯i is the estimation of the feature for ith instance and wi is defined in Equation (5). wi =
1 + (xif − xqf )2
where is a positive real number close to zero.
(5)
Regression by Feature Projections
571
Training: Store each feature value with target separately Sort each input feature dimension according to feature values Prediction(q,k): /* q: query instance k: number of neighbors */ Let Sum_weight = 0 and prediction = 0 for each feature f if feature value of q is not missing Find k nearest neighbors /* apply binary search to find the nearest neighbor */ Find linear least squares estimate, Pf Find local weight, LW Sum_weight = Sum_weight + LW prediction = prediction + LW * Pf prediction = prediction / Sum_weight return (prediction) Fig. 1. Training and Prediction Algorithms
P If = LWf =
2.4
Vall − Vf Vall P If2 if P If > 0 0 otherwise
(6) (7)
Prediction
We find the final approximation, by merging the predictions found for each feature dimension. This is obtained by averaging these results where the local weights are also employed. Figure 1 summarizes the prediction phase as well as the training. If there are missing values of a query instance, this situation is refined by the prediction algorithm, by simply ignoring the prediction on the feature dimension whose value of query is missing. Finally a prediction is done by giving higher weights to the feature predictions, whose local regions at the query location are smooth.
3
Empirical Evaluation
RFP inherits most properties of other lazy approaches. Two most important benefits of lazy learning are very small training complexity and handling local information in the instance space. RFP benefits these properties with an additional property of having small prediction time. The method also deals with
572
˙ Uysal and H. Altay G¨ I. uvenir
both types of input features, categorical and continuous, and handles irrelevant features. The single drawback of the method is its inability for dealing with interactions or relations among input features which lead to a decrease in prediction accuracy. However, we have observed that generally the real world datasets do not contain such interactions between features [1,2,3]. On the other hand, especially for large datasets with large number of input features and instances, the RFP method can be considered as a reliable solution, since it can eliminate the irrelevant features by assigning them lower weights. In order to evaluate the prediction performance of a regression method, we used relative error(RE) computed by the following formula: RE =
1 t
M SE ¯)2 i=1 (yt − y
Pt
(8)
where t is the number of test cases, y¯ is the median of the target values of training instances and mean squared error(MSE) is defined below. t
M SE =
1X (yt − predictiont )2 t i=1
(9)
Most of the real world datasets are selected from the collection of datasets provided by the machine learning group at University of California at Irvine. The information about the number of instances, features and missing values and the target features of datasets1 we used for experiments are summarized in Table 1. Dataset Instances Features Miss.Val. Target Feature abalone 4177 8 None Rings (Age=Rings+1.5) cpu 209 9 None Relative CPU Performance housing 506 13 None Housing Values in Boston villages1 887 32 Many Animal Resources villages2 766 32 Many Agriculture Area Table 1. Datasets
We have measured the error rate RE, using 10-fold cross-validation. We have compared the results for RFP with the results of distance-weighted-KNN (WKNN), for K values of 5 and 10. From the results given in Table 2, we can easily conclude that, RFP achieves a comparable prediction performance with K-nearest neighbor algorithm. For villages datasets, where there are many features and missing values, it achieves lower prediction error. The computational complexity of RFP is O(m log(n)). It is better than the complexity of KNN, O(mn), which is apparent in empirical results, especially for large data sets. 1
The official villages dataset includes data about villages around the same region. It can be obtained from the authors.
Regression by Feature Projections K 5 10 10 5 10 10
Data: RFP RFP TEST TIME(ms) WKNN WKNN TEST TIME(ms)
abalone 0.56 0.57 768 0.51 0.47 8047
cpu 0.30 0.25 24 0.52 0.52 17.2
housing 0.60 0.60 94 0.39 0.39 143
villages1 0.94 0.95 430 1.46 1.14 635
573
villages2 0.90 0.90 529 1.33 1.14 853
Table 2. RE Rates of RFP
4
Conclusions
We have described a regression method called RFP, based on feature projections, which achieves a fast computation time, by preserving a comparable accuracy with the most popular lazy method, KNN. The method inherits most of the properties of lazy regression methods and have some additional benefits. It handles missing values appropriately simply by ignoring them and handles both nominal and continuous feature values. Besides them, it does not require a normalization of the data which is an important process required for KNN algorithm. Finally RFP is appropriate method for data sets having irrelevant features, since it employs a weighting for them. The performance results and fast computation time of RFP encourage us to present this method as a data mining solution for high dimensional databases with very large sizes. On the other hand the major limitation of RFP is its assumption that the features are independent. Future works can be directed towards new methods which inherit the advantages of RFP and deals with interactions, in order to reach much better prediction performance. Also new methods can be developed for regression that make generalizations on feature projections, in order to enable the interpretation of data.
References ˙ 1. G¨ uvenir, H. A. and Sirin, I.,Classification by Feature Partitioning, Machine Learning, 23:47-67, 1996. 2. G¨ uvenir, H. A. and Demiroz, G., Ilter N., Learning Differential Diagnosis of Erythemato Squamous Diseases using Voting Feature Intervals, Artificial Intelligence in Medicine, 13:147-165, 1998. 3. Holte, R. C., Very Simple Classification Rules Perform Well on Most Commonly Used Data Sets, Machine Learning, 11:63-91, 1993. 4. Mathews,J.H.,Numerical Methods for Computer Science, Engineering and Mathematics Prentice-Hall, 1987. 5. Mitchell,T.M.,Machine Learning, McGraw Hill, 1997.
Generating Linguistic Fuzzy Rules for Pattern Classification with Genetic Algorithms N. Xiong and L. Litz Institute of Process Automation, University of Kaiserslautern, Postfach 3049, D-67653 Kaiserslautern, Germany {xiong,Litz}@e-technik.uni-kl.de
Abstract. This paper presents a new genetic-based approach to automatically extracting classification knowledge from numerical data by means of premise learning. A genetic algorithm is utilized to search for premise structure in combination with parameters of membership functions of input fuzzy sets to yield optimal conditions of classification rules. The consequence under a specific condition is determined by choosing from all possible candidates the class which lead to a maximal truth value of the rule. The major advantage of our work is that a parsimonious knowledge base with a low number of classification rules is made possible. The effectiveness of the proposed method is demonstrated by the simulation results on the Iris data.
1. Introduction Fuzzy logic based systems have found various applications to solve pattern classification problems [8]. The task of building a fuzzy classification system is to find a set of linguistic rules, based on which the class of an unknown object can be inferred through fuzzy reasoning. The premise determines the structure of a rule and it corresponds to a fuzzy subspace in the input domain. A simple way to define fuzzy subspaces as rule conditions is to partition the pattern space by a simple fuzzy grid [7, 9]. This is, however, not suitable in the cases of high attribute dimension because the rue number increases exponentially with the number of attributes. This paper aims at learning general premises of rules by a genetic algorithm (GA) to treat problems of multiple attribute inputs. The upper limit of the number of rules is predetermined by man in advance. It can be considered as an estimation of the sufficient amount of rules to achieve a satisfactory classification. The proposed modeling procedure for a fuzzy classifier consists of two loops. In the extern loop GA is utilized to search for optimal premise structure of the rule base and to optimize parameters of input fuzzy sets at the same time. The task of the intern loop is to determine the classes in the conclusions by maximizing the truth values of the rules. The effectiveness of our method is examined by the well known example of Iris Data [3] classification using linguistic fuzzy rules. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 574-579, 1999. © Springer-Verlag Berlin Heidelberg 1999
Generating Linguistic Fuzzy Rules for Pattern Classification
575
2. Fuzzy Classification System Let us consider a K-class classification problem as follows. The objects or cases in the universe are described by a collection of attributes x1, x2, ¡, and xn. The fuzzy sets of the attribute xj (j=1¡ n) are represented by A(j,1), A(j,2), ¡, A(j, q[j]), and q[j] is the number of linguistic terms for xj. Denote p() as an integer function mapping from {1,2,...,s(sn)} to {1,2,....,n} satisfying " xy, p(x)p(y). Fuzzy rules to be generated for classification can be formulated as: if [ x p(1) = U A( p(1), j )] andLand [ x p( s ) = U A( p( s ), j )] then Class B j ∈D (1)
where
j ∈D ( s )
D(i) ±{1, 2, ¡ q[p(i)]}
for i=1¡s,
and
(1)
B³ {C1, C2, ¡,CK}.
If the rule premise includes all input variables in it (e.g. s=n), we say that this rule has a complete structure, otherwise its structure is incomplete. An important feature of the rules in form (1) is that an union operation of input fuzzy sets is allowed in their premises. Rules containing such OR connections in conditions can cover a group of related rules which use complete AND connections of single linguistic terms as rule premises. By substituting the premise description of the rule in (1) with symbol A, i.e. A = [ x p (1) = U A( p (1), j )] and L and [ x p ( s ) = U A( p ( s ), j )] j ∈D (1 )
j ∈D ( s )
(2)
the rule can be abbreviated as ‘‘If A Then B’’. The condition A, on the other hand, can be regarded as a fuzzy subset on the training set UT={u1, u2, ¡, um}. The membership value of an object to this fuzzy subset is equal to the degree to which A is satisfied by its attribute vector. Thus we write: u u2 um A= 1 , , ⋅ ⋅ ⋅ ⋅, u u µ ( ) µ ( ) µ A A (um ) 2 A 1
µ A ( ui ) = µ A ( x i1 , x i 2 ,L, x in )
(3)
i = 1L m
(4)
Here (xi1, xi2,¡,xin) is the attribute vector of the object ui in the training set. Similarly the conclusion B is treated as a crisp subset on the training set. An object belongs to this crisp set, if and only if its class is the same as B. Therefore the subset for B is represented as:
Ï1 if m B ( ui ) = Ì Ó0
class( ui ) = B otherwise
i = 1L m
(5)
The rule ‘‘If A Then B’’ corresponds to an implication of AÃB, which is equivalent to the proposition that A is B’s subset, i.e. A²B. In this view, the measure of subsethood of A in B is utilized as the truth value of the rule. So we obtain
576
N. Xiong and L. Litz
M ( A « B) truth( A fi B ) = == M ( A)
 (m
A
Âm
( ui ) Ÿ m B ( ui ))
ui ŒU T
Âm
A
( ui )
=
ui ŒU T
A class( ui )= B
Âm
A
( ui )
(6)
( ui )
ui ŒU T
where M(A) and M(A¬B) indicate the cardinality measures of the sets A and A¬B respectively. Given a condition A, we choose the consequent class B from the finite candidates such that the truth value of the considered rule reaches its maximum. This means that the rule consequent can be selected with the following two steps: Step 1: Calculate the competition strength for each class as
α (c) =
∑µ
A class ( ui ) = c
( ui )
c = C1 , C2 ,L, CK
(7)
Step 2: Determine the consequent class B as c* that has the maximal competition strength, i.e.
α ( c*) = max(α ( C1 ), α (C2 ),L, α (CK ))
(8)
3. Genetic Learning of Rule Premises Genetic algorithms [4] are global algorithms based on mechanics of natural genetics and selection. They are theoretically and empirically proven to be an effective and robust means to find optimal or desirable solutions in complex spaces. A GA starts with a population of randomly or heuristically generated solutions (chromosomes). New offspring are created by applying genetic operators on selected parents. The selection of parents is performed based on the principle of ‘‘survive of the fittest’’. In this manner, a gradual improvement of the quality of individuals in the population can be achieved. Suitable coding scheme, genetic operators and fitness function must be defined for GA to search for optimal premise structure and membership functions of input fuzzy sets simultaneously. The information concerning structure of rule premises can be considered as a set of discrete parameters, while the information about input fuzzy sets is described by a set of continuous parameters. Owing to the different nature between the information about rule structure and about fuzzy set membership functions, a hybrid string consisting of two substrings is proposed here. The first substring codes premise structure of a knowledge base and the second substring corresponds to parameters of fuzzy sets used by rules. For coding parameters of membership functions the classical "concatenated binary mapping" method is certainly feasible. But for the sake of a shorter string length, an integer-substring is used in this paper. Let a parameter for fuzzy sets be quantised in the integer interval {0,1, ...., Kmax} and denote Vmin, Vmax as user-determined minimal and maximal parameter values, then the relationship between the corresponding integer K in the substring and the parameter value V is as follows:
Generating Linguistic Fuzzy Rules for Pattern Classification
V = V min +
K (V - V min ) K max max
577
(9)
From (1), we can see that premise structure of such rules is characterized by integer sets D(i) (i=1, 2, ¡s). This fact suggests that a binary code be a suitable scheme for encoding structure of premises, since an integer from {1, 2, ¡, q[p(i)]} is either included in the set D(i) or excluded from it. For attribute xj which is included in the -1 premise (i.e. p (j)«), q[j] binary bits must be introduced to depict the set D(p 1 (j))²{1, 2,...., q[j]}, with bit "1" representing the presence of the corresponding fuzzy set in the OR-connection and vice versa. If attribute xj does not appear in the premise, -1 i.e. p (j)=«, we use q[j] one-bits to describe the wildcard of ‘‘don’t care’’. For instance, the condition "if [x1=(small or large)] and [x3=middle] and [x4=(middle or large)]" can be coded by the binary group (101; 111; 010; 011). Further, the whole substring for the premise structure of a rule base is a merging of bit groups for all individual rule premises. It is worthy noting that the following two cases by a binary group lead to an invalid premise encoded: 1) All the bits in the group are equal to one, meaning that all attributes are not considered in the premise; 2) All the bits for an attribute is zero. This attribute therefore takes no linguistic term in the premise resulting in an empty fuzzy set for the condition part of that attribute. Through elimination of invalid rule premises, the actual rule number can be reduced from the upper limit given by man. This also provides a possibility to adjust the size of the rule base by GA. By the operation of crossover, parent strings mix and exchange their characters through a random process with expected improvement of fitness in the next generation. Owing to the distinct nature between the two substrings, it is preferable that the information in both substrings be mixed and exchanged separately. Here a three-point crossover is used. One breakpoint of this operation is fixed to be the splitting point between both substrings, and the other two breakpoints can be randomly selected within the two substrings respectively. At breakpoints the parent bits are alternatively passed on to the offspring. This means that offspring get bits from one of the parents until a breakpoint is encountered, at which they switch and take bits from the other parent. Mutation is a random alteration of a bit in a string so as to increase the variability of population. Because of the distinct substrings used, different mutation schemes are needed. Since parameters of input membership functions are essentially continuous, a small mutation with high probability is more meaningful. Therefore it is so designed that each bit in the substrings for membership functions undergo a disturbance. The magnitude of this disturbance is determined by a Gaussian distribution function. For the binary substring representing structure of rule premises, mutation is simply to inverse a bit, replace ‘1’ with ‘0’ and vice versa. Every bit in this substring undergoes a mutation with a small probability. An individual in the population is evaluated according to the performance of the fuzzy classifier coded by it. Therefore the following four steps must be done to get an appropriate numerical fitness value of a string: 1) Decode the binary substring into premise structure of rules and integer-substring into membership functions of input
578
N. Xiong and L. Litz
fuzzy sets; 2) Determine consequent classes under rule conditions by maximizing truth values of the rules; 3) Classify the training patterns with the rules generated from the above two steps; 4) Compute the rate of correctly classified training patterns as fitness value.
4. Simulation Results We applied the proposed approach to build classification rules according to the Iris Data [3]. The task is to classify three species of iris (setosa, versicolor and virginica) by four-dimensional attribute vectors consisting of sepal length (x1), sepal width(x2), petal length (x3) and petal width(x4). There are 50 samples of each class in this data set. By randomly taking 10 patterns from each class as test data, the total data set was divided into training set (80%) and test set (20%). We built the fuzzy classifier based on the training data and then verified its performance according to the test dada that were not used for learning.
1.0
short
0.0
middle
long
1.0
Fig. 1. The membership functions of the attributes
Each attribute of the classifier was assigned with three linguistic terms: short, middle and long. By normalizing attribute values of each attribute into the interval between zero and one, the membership functions of input fuzzy sets are depicted in Figure 1. The upper limit of the rule number in the rule base was set to 16, meaning that 16 rules were supposed to be sufficient to achieve a desirable classification accuracy. GA was put into work to search for the premise structure of possible rules and to optimize the parameters (corresponding to the circle in Fig. 1) of the input fuzzy sets at the same time. As a result of the search process, 10 rule premises were identified as invalid, so that the rule base in fact contains only six rules in it. Clearly, the size of the rule base in this example was not prescribed exactly by man. Instead it was automatically adjusted by the learning algorithm within an upper limit. The fuzzy classifier learned from the GA has very good performances despite its relatively small number of rules. On the training set it correctly classifies 119 patterns (i.e. 99.2% of the 120 patterns). On the test data 30 patterns (i.e. 100% of the 30 patterns) are properly classified. The results of some other machine learning algorithms on the Iris Flower Problem are given in Table 1 for comparison. It is evident that our method outperforms the other learning algorithms on this well known benchmark problem.
Generating Linguistic Fuzzy Rules for Pattern Classification
579
Table 1. Accuracy of the other five learning algorithms on the Iris Data
Algorithms Hirsh [5] Aha [1] Dasarathy [2] C4 [10] Hong [6]
Setosa 100% 100% 100% 100% 100%
Viginica 93.33% 93.50% 98% 91.07% 94%
Versicolor 94.00% 91.13% 86% 90.61% 94%
Average 95.78% 94.87% 94.67% 93.89% 96.00%
5. Conclusions The method proposed in this paper provides an effective means for automatically acquiring classification knowledge from sample examples. GA is used to derive appropriate premise structure of classification rules as well as to optimize membership functions of fuzzy sets at the same time. The size of the rule base is adapted by the search algorithm within an upper limit of the rule amount assumed by man in advance. The resulted knowledge base is compact in the sense that it usually contains a much smaller number of rules compared with that enumerating all canonical AND connections of linguistic terms as input situations. A small number of rules makes it simple for human users to check and interpret the contents of the knowledge base.
References 1.
Aha, D.W. and Kibler, D.: Noise-tolerant instance-based learning algorithms. In Proc. 11th Internat. Joint Conf. On Artificial Intelligence, Detroit, MI (1989), 794-799 2. Dasarathy, B.V.: Noising around the neighbourhood: a new system structure and classification rule for recognition in partially exposed environments. IEEE Trans. Pattern Analysis and Machine Intelligence 2 (1980) 67-71 3. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen 7 (1936) 179-188 4. Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning. Addison-Wesley, New York (1989). 5. Hirsh, H.: Generalizing version spaces. Machine Learning 17 (1994) 5-46 6. Hong, T.P. and Tseng, S.S.: A generalized version space learning algorithm for noisy and uncertain data. IEEE Trans. Knowledge Data Eng. 9 (1997) 336-340 7. Ishibuchi, H. et al.: Distributed representation of fuzzy rules and its application to pattern classification. Fuzzy Sets and Systems 52 (1992) 21-32 8. Meier, W., Weber, R. and Zimmermann, H.-J.: Fuzzy data analysis – methods and industrial application. Fuzzy Sets and Systems 61 (1994) 19-28 9. Nozaki, K. et al.: Adaptive fuzzy rule-based classification systems. IEEE Trans. Fuzzy Systems 4 (1996) 238-250 10. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, (1993).
Data Mining for Robust Business Intelligence Solutions
Jan Mrazek Bank of Montreal, Global Information Technology, 4100 Gordon Baker Road, Toronto, Ontario, M1W 3E8, Canada [email protected]
Data Mining has quickly matured out of isolated, small scale, PC based, single algorithm techniques to robust analytical solutions which utilize a combination of various artificial intelligence algorithms, massively parallel technology, direct both-way access to relational databases and open systems with published Application Programming Interfaces. For an organization, this presents an opening of new opportunities, but it also generates a number of integration challenges. This tutorial will introduce Data Mining from the perspective of a large organization. It will describe why Data Mining is essential for new business development and will present its enablers and drivers. Typical historically diversified organizational structure of Business Intelligence competency in a large organization will be introduced and the challenges posed to a large scale implementation of Data Mining will be portrayed. Briefly, core technologies and techniques will also be introduced (Classification, Segmentation, Value Prediction, Basket Analysis, Time Series). Some hints on using Time greedy techniques will be given. Data Mining processes will be discussed in detail along with a brief genealogy of Data Mining technology and the criteria used for selecting the right technology. There will be a special focus on models for segmentation and cross-selling. These will be accompanied by brief case study demonstrations. The necessity of analyzing interaction of different business life cycles will be explained. A number of business applications of Data Mining will be listed. Special attention will be paid to the ability to determine account profitability on the lowest possible level of detail. A technically sophisticated profitability system will be introduced. Data Mining will be presented in its integrity with other Business Intelligence solutions. The architecture of a large corporate Business Intelligence complex will be discussed from the technological as well as the business perspective. The delegates will be presented with the problem of the integration of Data Mining with the corporate Data Warehouse, Data Marts and OLAP. Challenges will be listed along with some feasible solutions. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 580-581, 1999. © Springer-Verlag Berlin Heidelberg 1999
Data Mining for Robust Business Intelligence Solutions
581
Data Warehouses are usually designed in 3rd normalized form or close to it. Data Marts use star or snowflake design (to support OLAP solutions). Neither one really suits Data Mining purposes. A design for Data Warehouse/Data Mart to satisfy all needs, including Data Mining will be discussed. In relation to Data Mining needs, some important data modeling techniques and modeling challenges will be explained. Examples are horizontal vs. vertical data structures, star and snowflake schemas, handling of slow changing dimensions, etc. Data Mining starts with good data. Therefore special attention will be paid to data quality issues: insufficient data, outliers, missing values, data enrichment and data transformation. Data Mining techniques will be discussed from the perspective of requirements for data, its quality and its structure and type. In a case study, an advanced concept for Behavior Oriented Segmentation will be shown. This will support the author’s conviction that behavioral segmentation is more important than segmentation based on demographics and psychographics. It will be shown how these approaches complement each other. A link to customer profitability will be emphasized. Challenges of mass deployment of Data Mining capabilities in an organization will be shown from the perspective of metadata management, decision timing and quality of results. Data Mining is a dangerous weapon, which may cause self-injury. Therefore some corporate safety rules will be introduced. Finally, a vision of the near future of Data Mining and Data Mining Technology will be portrayed. WEB References: http://www.dw-institute.com/montreal.html http://www.software.ibm.com/data/solutions/customer/montreal/montreal.htm http://www.cio.com/archive/051598_mining_content.html http://www2.computerworld.com/home/print9497.nsf/All/SL42mont16896 http://www.bmo.com/ http://sites.netscape.net/drjanmrazek/
Query Languages for Knowledge Discovery in Databases Jean-Francois Boulicaut Institut National des Sciences Appliquees de Lyon LISI B^ atiment 501 F-69621 Villeurbanne cedex, France mailto:[email protected] http://www.insa-lyon.fr/People/LISI/jfboulic
Abstract Discovering knowledge from data appears as a complex iterative and interactive process containing many steps: understanding the data, preparing the data set, discovering potentially interesting patterns (mining phase), postprocessing of discovered patterns and nally putting the results in use. Dierent kinds of patterns might be used and therefore dierent data mining techniques are needed (e.g., association and episode rules for alarm analysis, clusters and decision trees for sales data analysis, inclusion and functional dependencies for database reverse engineering, etc). This tutorial adresses the challenge of supporting KDD processes following a querying approach. Following Imielinski and Mannila [6], second generation data mining systems might support the whole process by means of powerful query languages. We propose not only a state of the art in that eld but also introduce a research perspective, the so-called inductive database framework [3]. It is out of the scope of our presentation to consider coupling architectures between data mining algorithms and database management systems. Instead, we focuse on user written queries that capture their needs during a KDD process. The popular association rules mining processes is used to illustrate most of the concepts. The use of query languages to select data for a mining task seems obvious. However, crucial issues like cleaning, sampling, supporting multidimensional data manipulation or on-line analytical processing queries when preparing a data set are still to be discussed. Indeed, available query languages oer a rather poor support for that. Next, mining phases quite often provide a huge amount of extracted patterns though just a few of them can be of practical interest for the end-user. Within the rule mining domain, one copes with that problem by selecting patterns w.r.t. interestingness measures, objective ones (e.g., con dence [1], J-measure [10], conviction [5]) or subjective ones (e.g., templates [2] or more generally queries). Indeed, it is possible to use standard query languages like SQL3 or OQL to query rule databases and perform typical postprocessing like inclusive/exclusive selection of rules or rule cover computations (elimination of "redundancy"). However, more or less specialized query languages have been •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 582−583, 1999. Springer−Verlag Berlin Heidelberg 1999
Query Languages for Knowledge Discovery in Databases
583
proposed like M-SQL [7] or MINE RULE [8]. These languages enable to select the data, to specify mining tasks and to perform some postprocessing as well. The main ideas concerning query evaluation are considered. This leads us to the concept of inductive database and general-purpose query languages for KDD applications. An inductive database is a database that in addition to data contains intensionnaly de ned generalizations about the data. Using the simple formalization from [3], it is possible to de ne query languages that satis es the closure property, i.e., the result of a query on an inductive database is, again, an inductive database. The whole discovery process can be then be modelized by means of a sequence of queries (on data, on patterns or linking data to patterns). This gives rise to optimization techniques (compiling scheme) for KDD processes. Among others, we demonstrate the possibilities according to a research proposal for a rule-based language [4]. Implementating inductive databases for various classes of patterns is still an open problem. However, the research about association rule mining has demonstrated that a technology (see e.g., [9]) is now available for such a simple but still important class of patterns.
References 1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In: Proc.SIGMOD'93, pages 207 { 216, May 1993. ACM Press. 2. M. Klemettinen
A Knowledge Discovery Methodology for Telecommunications
Network Alarm Databases. PhD thesis, Report A-1999-1, University of Helsinki (FIN), January 1999. 3. J.-F. Boulicaut, M. Klemettinen, and H. Mannila. Modeling KDD Processes within the Inductive Database Framework. In: Proc. DaWak'99, August 29-September 2nd 1999. Florence (I), Springer-Verlag. To appear. 4. J.-F. Boulicaut, P. Marcel, and C. Rigotti. A Query driven knowledge discovery in multidimensional data. Research Report, INSA Lyon, LISI, July 1999, 15 p., Submitted. 5. S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In: Proc. SIGMOD'97, pages 255 { 264, 1997. ACM Press. 6. T. Imielinski and H. Mannila. A Database Perspective on Knowledge Discovery. Communications of the ACM,
39(11):58 { 64, November 1996.
7. A. Virmani. Second Generation Data Mining: Concetps and Implementation. PhD thesis, Rutgers University (USA), April 1998. 8. R. Meo, G. Psaila, and S. Ceri. A new SQL-like Operator for Mining Association Rules. In: Proc. VLDB'96, pages 122 { 133, September 1996. Morgan Kaufmann. 9. R. Ng, L. Lakshmanan, J. Han, and A. Pang. Exploratory Mining and Pruning Optimizations of Constrained Associations Rules. In: Proc. SIGMOD'98, pages 13 { 24, 1998. ACM Press. 10. P. Smyth and R. M. Goodman. An Information Theoretic Approach to Rule Induction from Databases. IEEE Transactions on Knowledge and Data Engineering, 4(4):301 { 316, August 1992.
The ESPRIT Project CreditMine and Its Relevance for the Internet Market 1
Susanne Köhler , Michael Krieger
2
1
Fraunhofer IAO, Information Engineering, Nobelstr. 12, D-70569 Stuttgart, Germany [email protected]
2
MINDLAB Krieger & Partner, Ingenieure und Informatiker, Marktstr. 31 D-73207 Plochingen, Germany [email protected]
Abstract. Data mining technology although it is considered as state-of-the-art it is still an embryonic technology employed only by the group of early adopters. Till now very few organisations have made the big investment in bringing together their corporate data in a data warehouse type environment for the purpose of data mining. Only large and wealthy companies have experimented with such methodologies and the outcomes of their efforts were not disclosed, as it was considered confidential material. Thus there is currently a lack of information concerning the effectiveness of such investments. The aim of the project is to provide decision support for companies, especially in the banking sector for investment in data mining.
Introduction The objective of CreditMine and its internet pendant MarketMine is to facilitate the usage of data mining technology in traditional banking and in internet banking by sensitising for the possibilities of data mining technologies in electronic business. In order to adopt the new technology for such users, examples of successful implementations are needed which will prove the financial and technical feasibility of such investment.
CreditMine In the long run, the winners in the competitive financial industry will be those who can gain fractional improvements in understanding and reacting to their customer’s needs, by effective customer targeting management. The evaluation of highperformance, analytical and prediction software tools, for the banking industry that can make use of hidden data assets, by employing data mining technologies, within an existed DBMS system is the objective of the project. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 584-585, 1999. © Springer-Verlag Berlin Heidelberg 1999
The ESPRIT Project CreditMine and Its Relevance for the Internet Market
585
The CreditMine project intends to offer a benchmark study for implementing state-ofthe-art data mining techniques in the banking sector by implementing pilot applications in a medium size bank. The idea is to offer a framework of reference in order to evaluate in advance the feasibility of such investment projects. The goal of the project is to facilitate the usage of data mining technology by less innovative target groups of major business: small and medium size banks all over Europe. Approach Implementation and evaluation of an integrated set of two applications using existing data mining / data warehouse technology and tools: CustoNeed: to conduct market segmentation by identifying the common characteristics of customers who buy the same products from the bank. CustoChurn: to predict which customers are likely to leave the company and go to a competitor and whether the return from the customer is higher than the cost to retain him. Based on the current market needs, CreditMine adapts, enhances, and integrates the latest technologies necessary to design and develop the data mining tools, and evaluates them in real situations faced by a financial institution. CreditDeploy: a feasibility and evaluation study documents the quantitative and qualitative results of the proposed approach. The lessons learned by the project will be included in the form of guidelines and methodology for setting up similar systems.
MarketMine Since there is a strong need in Germany for the use of data mining technologies even for internet Data (e.g. tracking and profiling data from internet banking), MINDLAB has developed a methodology which permits internet market research and internet marketing with the help of data mining technologies and online user interviews in an integrated process chain. This methodology is especially adapted to the needs of internet banks, but it will later also be practicable available for other sectors as well such as online shops etc. Internet-sites provide a multitude of data caused by the comprehensible use of these sites. User actions and time spent by the user on a certain internet site can be registered and analysed in detail which allows to define exact profiles of customer habits. By using tracking data, the newly developed method makes it possible to register user behaviour and to segment these data in groups with the same or a similar behaviour. The developed methodology considers the strict European provisions of data privacy. The objective of this research project which strongly involves the practical side, is to integrate data mining applications used in internet market research into the marketing process which is to be followed. In this manner, possible synergies can be made use of.
Logics and Statistics for Association Rules and Beyond Abstract of Tutorial Petr H´ajek1 and Jan Rauch2 1
2
Institute of Computer Science, Academy of Sciences, Prague [email protected] Laboratory of Intelligent Systems, University of Economics, Prague [email protected]
The aim of the tutorial is four-fold: (1) To present a very natural class of logical systems suitable for formalizing, generating and evaluating statements on dependences found in given data. The popular association rules form a particular, but by far not the only example. Technically, our logical systems are monadic observational predicate calculi, i.e. calculi with generalized quantifiers, only finite models and effective semantics. Be not shocked by these terms; no non-trivial knowledge of predicate logic will be assumed. Logic includes deductive properties, i.e. possibility to deduce truth of a sentence from other sentences already found true. Transparent deduction rules are very useful for systematic data mining. Special attention will be paid to sentences formalizing expressions of the form “many objects having a certain combination A of attributes have a also combination B” and, more generally, “combinations A, B of attributes are associated (dependent, correlated etc.) in a precisely defined manner”. (2) To show how suitable observational sentences are related to statistical hypothesis testing (this aspect appears sometimes unjustly neglected in data mining). A general pattern of statistical inference will be presented in logical terms (theoretical calculi and their relation to observational ones). In particular, statistical meaning of two variants of “associational rules” as well as of some “symmetric associations” will be explained. (3) To present short history of the GUHA method of automatic generation of hypotheses (General Unary Hypothesis Automaton). It is an original Czech method of exploratory data analysis, one of the oldest approaches to KDD or data mining. Its principle was formulated long before the advent of data mining. Its theoretical foundations (as presented in [5] and later publications) lead to the theory described in points (1), (2) above. (4) Finally, to show how modern fuzzy logic (in the narrow sense of the word, i.e. particular many-valued symbolic logic) may enter the domain of KDD and fruitfully generalize the field. This fourth point will concentrate to open problems and research directions. The tutorial will be complemented by demonstrations of two recent implementations of the GUHA method. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 586–587, 1999. c Springer-Verlag Berlin Heidelberg 1999
Logics and Statistics for Association Rules and Beyond
587
References 1. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A.: Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI Press, Menlo Park, 1996, pp. 307–328. 2. H´ ajek, P.: Logics for data mining (GUHA rediviva). (Workshop of Japanese Society for Artificial Intelligence, title in Japanese) Tokio, JSAI 1998, 27–34. 3. H´ ajek, P.: Metamathematics of Fuzzy Logic. Kluwer 1998. 4. H´ ajek, P., Havel, I. and Chytil, M.: The GUHA-method of automatic hypotheses determination. Computing 1 (1966), 293–308. 5. H´ ajek, P, Havr´ anek T.: Mechanizing Hypothesis Formation (Mathematical Foundations for a General Theory). Springer-Verlag, Berlin-Heidelberg-New York, 1978. 6. H´ ajek, P., Holeˇ na. M.: Formal logics of discovery and hypothesis formation by machine. Proc. Discovery Science (Fukuoka 1998), Springer-Verlag Editors: S. Arikawa H. Motoda LNAI 1532 pp: 291 – 302 7. H´ ajek, P, Sochorov´ a, A. and Zv´ arov´ a, J.: GUHA for personal computers. Computational Statistics & Data Analysis 19 (1995), 149–153. 8. Honz´ıkov´ a, Z.: GUHA+- – user’s Guide. (Tech. report, Institute of Computer Science AS CR, Prague 1999.) 9. Rauch, J.: Classes of Four Fold Table Quantifiers. Principles of Data Mining and Knowledge Discovery. Red. Zytkow, J - Quafafou, M. Berlin, Springer Verlag 1998, p. 203 -211 10. Rauch, J.: Four-Fold Table Calculi for Discovery Science. Discovery Science. Red. Arikava, S. and Motoda, eds.), Springer Verlag, Berlin, pp. 405 - 406. 11. Rauch, J.: GUHA as a Data Mining Tool. Practical Aspects of Knowledge Management. Schweizer Informatiker Gesellshaft Basel, 1996, 10 s. 12. Rauch, J.: Logical Calculi for Knowledge Discovery in Databases. Principles of Data Mining and Knowledge Discovery. Red. Komorowski, J. - Zytkow, J. Berlin, Springer Verlag 1997, p. 47 - 57 13. Rauch, J.: Four-Fold Table Calculi and Missing Information. In JCIS’98 Proceedings, (Paul P. Wang, editor), Association for Intelligent Machinery, 375-378, 1998. This work is partly supported by grant 47160008 of the Ministry of Education of the Czech Republic.
Logics and Statistics for Association Rules and Beyond Abstract of Tutorial Petr H´ajek1 and Jan Rauch2 1
2
Institute of Computer Science, Academy of Sciences, Prague [email protected] Laboratory of Intelligent Systems, University of Economics, Prague [email protected]
The aim of the tutorial is four-fold: (1) To present a very natural class of logical systems suitable for formalizing, generating and evaluating statements on dependences found in given data. The popular association rules form a particular, but by far not the only example. Technically, our logical systems are monadic observational predicate calculi, i.e. calculi with generalized quantifiers, only finite models and effective semantics. Be not shocked by these terms; no non-trivial knowledge of predicate logic will be assumed. Logic includes deductive properties, i.e. possibility to deduce truth of a sentence from other sentences already found true. Transparent deduction rules are very useful for systematic data mining. Special attention will be paid to sentences formalizing expressions of the form “many objects having a certain combination A of attributes have a also combination B” and, more generally, “combinations A, B of attributes are associated (dependent, correlated etc.) in a precisely defined manner”. (2) To show how suitable observational sentences are related to statistical hypothesis testing (this aspect appears sometimes unjustly neglected in data mining). A general pattern of statistical inference will be presented in logical terms (theoretical calculi and their relation to observational ones). In particular, statistical meaning of two variants of “associational rules” as well as of some “symmetric associations” will be explained. (3) To present short history of the GUHA method of automatic generation of hypotheses (General Unary Hypothesis Automaton). It is an original Czech method of exploratory data analysis, one of the oldest approaches to KDD or data mining. Its principle was formulated long before the advent of data mining. Its theoretical foundations (as presented in [5] and later publications) lead to the theory described in points (1), (2) above. (4) Finally, to show how modern fuzzy logic (in the narrow sense of the word, i.e. particular many-valued symbolic logic) may enter the domain of KDD and fruitfully generalize the field. This fourth point will concentrate to open problems and research directions. The tutorial will be complemented by demonstrations of two recent implementations of the GUHA method. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 586–587, 1999. c Springer-Verlag Berlin Heidelberg 1999
Logics and Statistics for Association Rules and Beyond
587
References 1. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A.: Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI Press, Menlo Park, 1996, pp. 307–328. 2. H´ ajek, P.: Logics for data mining (GUHA rediviva). (Workshop of Japanese Society for Artificial Intelligence, title in Japanese) Tokio, JSAI 1998, 27–34. 3. H´ ajek, P.: Metamathematics of Fuzzy Logic. Kluwer 1998. 4. H´ ajek, P., Havel, I. and Chytil, M.: The GUHA-method of automatic hypotheses determination. Computing 1 (1966), 293–308. 5. H´ ajek, P, Havr´ anek T.: Mechanizing Hypothesis Formation (Mathematical Foundations for a General Theory). Springer-Verlag, Berlin-Heidelberg-New York, 1978. 6. H´ ajek, P., Holeˇ na. M.: Formal logics of discovery and hypothesis formation by machine. Proc. Discovery Science (Fukuoka 1998), Springer-Verlag Editors: S. Arikawa H. Motoda LNAI 1532 pp: 291 – 302 7. H´ ajek, P, Sochorov´ a, A. and Zv´ arov´ a, J.: GUHA for personal computers. Computational Statistics & Data Analysis 19 (1995), 149–153. 8. Honz´ıkov´ a, Z.: GUHA+- – user’s Guide. (Tech. report, Institute of Computer Science AS CR, Prague 1999.) 9. Rauch, J.: Classes of Four Fold Table Quantifiers. Principles of Data Mining and Knowledge Discovery. Red. Zytkow, J - Quafafou, M. Berlin, Springer Verlag 1998, p. 203 -211 10. Rauch, J.: Four-Fold Table Calculi for Discovery Science. Discovery Science. Red. Arikava, S. and Motoda, eds.), Springer Verlag, Berlin, pp. 405 - 406. 11. Rauch, J.: GUHA as a Data Mining Tool. Practical Aspects of Knowledge Management. Schweizer Informatiker Gesellshaft Basel, 1996, 10 s. 12. Rauch, J.: Logical Calculi for Knowledge Discovery in Databases. Principles of Data Mining and Knowledge Discovery. Red. Komorowski, J. - Zytkow, J. Berlin, Springer Verlag 1997, p. 47 - 57 13. Rauch, J.: Four-Fold Table Calculi and Missing Information. In JCIS’98 Proceedings, (Paul P. Wang, editor), Association for Intelligent Machinery, 375-378, 1998. This work is partly supported by grant 47160008 of the Ministry of Education of the Czech Republic.
Data Mining for the Web Myra Spiliopoulou Institut fur Wirtschaftsinformatik, Humboldt-Universitat zu Berlin [email protected], http://www.wiwi.hu-berlin.de/myra
Abstract. The web is being increasingly used as a borderless market-
place for the purchase and exchange of goods, the most prominent among them being information. To remain competitive, companies must recognize the interests and demands of their users and adjust their products, services and sites accordingly. In this tutorial, we discuss data mining for the web. The relevant research can be categorized into (i) web usage mining, focussing in the analysis of the navigational activities of web users, (ii) web text mining, concentrating on the information being acquired by the users and (iii) user modelling, which exploits all information and knowledge available about the web users to build user pro les.
In this tutorial, we discuss advances of data mining for the analysis of data related to the web and its usage. The tutorial is organized similarly to the classical procedere for knowledge discovery.
Problem Formulation We brie y discuss the dierent application areas calling for web mining. They range from marketing, on line shopping and information dissemination by public institutions to hypermedia teachware. We abstract the potential strategic aims in each domain into mining goals as: prediction of the users' behaviour within the site, comparison between expected and actual web site usage, adjustment of the web site to the interests of its users. We also review measures of evaluating whether those goals are achieved
Data Preparation The information sources available to mine the web encompasses web usage logs, web page descriptions and textual contents, web site topology, user registries and questionnaires. We explain shortly how each of the above sources contributes to full lling each of our three tasks and then discuss advances on data preparation for web data mining. Web usage data are noisy and of rather poor quality. Particular emphasis is given to techniques that try to improve their quality by heuristics and statistic methods. We close this entity by discussing the notion of web data warehousing and the exploitation of OLAP technology to assess information on web site usage. •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 588−589, 1999. Springer−Verlag Berlin Heidelberg 1999
Data Mining for the Web
589
Web Data Mining
We rst discuss advances on web usage mining, coming from the domains of sequence mining, discovery of association rules and clustering. The clustering paradigm is mostly used when the focus is on the contents of the pages being accessed together. Hence, clustering for web usage mining serves as a bridge to switch to the discussion on web text mining. We shortly present advances on analyzing text in general. Then, the focus shifts on the particularities of web text and on mining web pages and hypertext. Finally, we turn into methods for user pro ling, addressing mostly clustering techniques. This subsection* closes with open issues related to the combination of results from all three domains.
Evaluation of the Results
Here, we discuss statistical methods for evaluating the results of sequence miners, text miners and clusterers. These methods are mostly of general purpose. There are also speci c evaluation strategies for web mining results, e.g. from the marketing domain. We then point out issues that impede the evaluation of web mining results. These include the idiosyncracy of web-related data, which forces some researchers to use synthetic data for testing, as well as biases occuring in all collections of user registration data. Open research issues related to the alleviation of those problems are presented next.
Exploitation of the Results
The exploitation of the results in any data mining domain is claimed to be domain speci c, thus falling beyond the well-modelled subjects of knowledge discovery. We describe potential ways of exploiting web mining results, namely by reorganizing web sites statically or dynamically and by building personalized services on the basis of user pro les. We then elaborate on the need for guidelines to transform the results of a data miner into suggestions or hints for web site reorganization or pro le-based size construction. We discuss reports addressing the above issues for particular applications.
Evolution
We thus far have spoken about the web and about web mining as for a static world. However, (almost) nothing is less static than the web. Here, we address two issues: (i) the impact of site changes on the results of web mining and (ii) the observation of changes in web usage, caused e.g. by the increasing experience of users or by shifts of interest. The rst issue relates to rules' maintenance. The second issue is relevant to time series analysis and to temporal mining. However, research in those two domains is not oriented towards web usage analysis. More focussed research is needed to deal with the issue of time in the rapidly changing world of the web.
Relational Learning and Inductive Logic Programming Made Easy – abstract of tutorial Luc De Raedt1 and Hendrik Blockeel2 1
Albert Ludwig University, Freiburg [email protected] 2 Katholieke Universiteit, Leuven
The tutorial will provide answers to the basic questions about inductive logic programming. In particular, – What is inductive logic programming and relational learning ? – What are the differences between attribute value learning and inductive logic programming ? – For what types of applications do we need inductive logic programming ? The tutorial will also present the representation employed by inductive logic programming and the corresponding generalization and specialisation operators (focussing on theta-subsumption). It will then be shown how these representations and operators can be combined with classical machine learning and data mining algorithms to obtain some of the best known inductive logic programming systems, such as FOIL, GOLEM, TILDE, PROGOL, RIBL, WARMR, ... Finally the tutorial shall discuss current trends and open questions, as well as applications of the field. The tutorial will also include a short demo-session with some inductive logic programming tools. Intended audience : the tutorial wants to provide a gentle and general introduction to the field of inductive logic programming to all persons familiar with data mining or machine learning but not yet with (inductive) logic programming
•
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, p. 590, 1999. Springer−Verlag Berlin Heidelberg 1999
Relational Learning and Inductive Logic Programming Made Easy – abstract of tutorial Luc De Raedt1 and Hendrik Blockeel2 1
Albert Ludwig University, Freiburg [email protected] 2 Katholieke Universiteit, Leuven
The tutorial will provide answers to the basic questions about inductive logic programming. In particular, – What is inductive logic programming and relational learning ? – What are the differences between attribute value learning and inductive logic programming ? – For what types of applications do we need inductive logic programming ? The tutorial will also present the representation employed by inductive logic programming and the corresponding generalization and specialisation operators (focussing on theta-subsumption). It will then be shown how these representations and operators can be combined with classical machine learning and data mining algorithms to obtain some of the best known inductive logic programming systems, such as FOIL, GOLEM, TILDE, PROGOL, RIBL, WARMR, ... Finally the tutorial shall discuss current trends and open questions, as well as applications of the field. The tutorial will also include a short demo-session with some inductive logic programming tools. Intended audience : the tutorial wants to provide a gentle and general introduction to the field of inductive logic programming to all persons familiar with data mining or machine learning but not yet with (inductive) logic programming
•
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, p. 590, 1999. Springer−Verlag Berlin Heidelberg 1999
Author Index
˚ Agotnes, T. 193 ´ 271 Alphonse, E. Ananyan, S.M. 366 Arruda, L.V.R. 341 Arseniev, S.B. 366 Aumann, Y. 165, 277 Bertino, E. 41 Blockeel, H. 32, 590 Boulicaut, J.-F. 582 Breunig, M.M. 262 Brighton, H. 283 Bristol, D.W. 360 Caglio, E. 41 Catania, B. 41 Cattral, R. 289 Chen, X. 295 Clifton, C. 174 Coenen, F. 301 Cole, R. 309 Cooley, R. 174
Freitas, A.A. 341 Fresko, M. 165 Fu, Z. 348 Garc´ıa-Serrano, J.R. 354 Giannotti, F. 125 Giraud-Carrier, C. 360, 436 Grbovi´c, J. 32 G¨ uvenir, H.A. 568 H´ ajek, P. 586 Hamilton, H.J. 232 Hilderman, R.J. 232 Huber, K. 492 Iba, H. 456 Iv´anek, J. 116 Jappy, P. 223
Eichholz, M. 242 Eklund, P. 309 Elomaa, T. 89 El-Mouadib, F.A. 71 Ertel, W. 323
Kaestner, C. 341 Keane, J.A. 448 Kennedy, C.J. 360 Keogh, E.J. 1 Kiselev M.V. 366 Klemettinen, M. 372 Knobbe, A.J. 378 K¨ ohler, S. 584 Komorowski, J. 193, 462 Koronacki, J. 71 Kr¸etowski, M. 392 Kriegel, H.-P. 262 Krieger, M. 584 Kuznetsov, S.O. 384 Kwedlo, W. 392
Feelders, A. 329 Feldman, R. 165, 277 Fernandez-Baiz´an, M.C. 335 Fertig, C.S. 341 Flexer, A. 80
Lallich, S. Landau, D. Li, J. 406 Liau, C.-J. Lindner, G.
Datta, P. 316 De Raedt, L. 590 Deugo, D. 289 Di Palma S. 184, 510 Dong, G. 406 Dˇzeroski, S. 32, 98
398, 510 277 412 418
592
Author Index
Liphstat, O. 165, 277 Litz, L. 574 Liu, D.-R. 412 Lodi, S. 424 Løken, T. 193 Lopes, S. 430 Mac Kinney-Romero, R. 436 Manco, G. 125 Mannila, H. 372 Mart´ınez-Trinidad, J.F. 354 Massa, S. 51 Masuda, G. 61 Mayoraz, E. 442 McClean, S. 156 Mellish, C. 283 Menasalvas Ruiz, E. 335 Merkl, D. 524 Mesa, E. 335 Mikˇsovsk´ y, P. 476 Mill´an, S. 335 Moreira, M. 442 Mrazek, J. 580 Murad, U. 251 Muyeba, M.K. 448 Nikolaev, N. 456 Nock, R. 214, 223 Ng, R.T. 262 Nguyen, H.S. 107 Øhrn, A. 462 Ohsuga, S. 136 Okada, T. 468 Oppacher, F. 289 P´ airc´eir, R. 156 Pavelek, T. 498 Pazzani, M.J. 1 Pˇechouˇcek, M. 476 Pe˜ na S´ anchez, J.M. 335 Petit, J.-M. 430 Petrounias, I. 295 Pinkas, G. 251
Pizzuti, C. 484 Poddig, T. 492 Popel´ınsk´ y, L. 498 Puliafito, P.P. 51 Raedt, L. De 590 Rainsford, C.P. 504 Rakotomalala, R. 510 Ramamohanarao, K. 406 Ras, Z.W. 518 Rauber, A. 524 Rauch, J. 586 Reami, L. 424 Roddick, J.F. 504 Rosenfeld, B. 165 Rousu, J. 89 Rouveirol C. 271 Ruhland, J. 242, 530 Sakamoto, N. 61 Sander, J. 262 Sartori, C. 424 Savinov, A.A. 536 Schler, Y. 165, 277 Schramm, M. 323 Scotney, B. 156 Sebban, M. 184, 214, 223 Siebes, A. 12, 378 Skowron, A. 107, 542 ´ ezak, D. 548 Sl¸ Spiliopoulou, M. 554, 588 Stepaniuk, J. 542 Struzik, Z.R. 12 Studer, R. 418 Sugaya, S. 561 Sun, Q. 406 Suzuki, E. 561 Swinnen, G. 301 ˇ ep´ Stˇ ankov´ a, O. 476 Talia, D. 484 Todorovski, L. 98 Toumani, F. 430 Tsumoto, S. 23, 147, 561
Author Index
Ushijima, K. 61 ˙ 568 Uysal, I. Vanhoof, K. 301 Verkamo, A.I. 372 Vonella, G. 484 Wallen, D. van der 378 Wets, G. 301 Wittmann, T. 242, 530 Wr´ oblewski, J. 548
Xiong, N. 574 Yano, R. 61 Yao, Y.Y. 136 Yehuda, Y.B. 277 Zelenko, D. 204 Zhang, X. 406 Zhong, N. 136 Zighed, D.A. 184 ˙ Zytkow, J.M. 71
593