Subseries of Lecture Notes in Computer Science
3
Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo
Discovery Science 6th International Conference, DS 2003 Sapporo, Japan, October 17-19, 2003 Proceedings
13
Volume Editors Gunter Grieser Technical University Darmstadt Alexanderstr. 10, 64283 Darmstadt, Germany E-mail:
[email protected] Yuzuru Tanaka Akihiro Yamamoto Hokkaido University MemeMedia Laboratory N-13, W-8, Sapporo, 060-8628, Japan E-mail: {tanaka;yamamoto}@meme.hokudai.ac.jp
Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
.
CR Subject Classification (1998): H.2.8, I.2, H.3, J.1, J.2 ISSN 0302-9743 ISBN 3-540-20293-5 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin GmbH Printed on acid-free paper SPIN: 10964132 06/3142 543210
Preface
This volume contains the papers presented at the 6th International Conference on Discovery Science (DS 2003) held at Hokkaido University, Sapporo, Japan, during 17–19 October 2003. The main objective of the discovery science (DS) conference series is to provide an open forum for intensive discussions and the exchange of new information among researchers working in the area of discovery science. It has become a good custom over the years that the DS conference is held in parallel with the International Conference on Algorithmic Learning Theory (ALT). This combination of ALT and DS allows for a comprehensive treatment of the whole range, from theoretical investigations to practical applications. Continuing the good tradition, DS 2003 was co-located with the 14th ALT conference (ALT 2003). The proceedings of ALT 2003 were published as a twin volume 2842 of the LNAI series. The DS conference series has been supervised by the international steering committee chaired by Hiroshi Motoda (Osaka University, Japan). The other members are Alberto Apostolico (Univ. of Padova, Italy and Purdue University, USA), Setsuo Arikawa (Kyushu University, Japan), Achim Hoffmann (UNSW, Australia), Klaus P. Jantke (DFKI, Germany), Masahiko Sato (Kyoto University, Japan), Ayumi Shinohara (Kyushu University, Japan), Carl H. Smith (University of Maryland, USA), and Thomas Zeugmann (University of L¨ ubeck, Germany). We received 80 submissions, out of which 18 long and 29 short papers were selected by the program committee, based on clarity, significance, and originality, as well as on relevance to the field of discovery science. The DS 2003 conference had two paper categories, long and short papers. Long papers were presented as 25-minute talks, while short papers were presented as 5-minute talks accompanied by poster presentations. Due to the limited time of the conference, some long submissions could only be accepted as short papers. Some authors of those papers decided not to submit final versions or to just publish abstracts of their papers. This volume consists of three parts. The first part contains the invited talks of ALT 2003 and DS 2003. These talks were given by Thomas Eiter (Technische Universit¨ at Wien, Austria), Genshiro Kitagawa (The Institute of Statistical Mathematics, Japan), Akihiko Takano (National Institute of Informatics, Japan), Naftali Tishby (The Hebrew University, Israel), and Thomas Zeugmann (University of L¨ ubeck, Germany). Because the invited talks were shared by the DS 2003 and ALT 2003 conferences, this volume contains the full versions of Thomas Eiter’s, Genshiro Kitagawa’s, and Akihiko Takano’s talks, as well as abstracts of the talks by the others. The second part of this volume contains the accepted long papers, and the third part contains the accepted short papers.
VI
Preface
We would like to express our gratitude to our program committee members and their subreferees who did great jobs in reviewing and evaluating the submissions, and who made the final decisions through intensive discussions to ensure the high quality of the conference. Furthermore, we thank everyone who led this conference to a great success: the authors for submitting papers, the invited speakers for accepting our invitations and giving stimulating talks, the steering committee and the sponsors for their support, the ALT Chairpersons Klaus P. Jantke, Ricard Gavalda, and Eiji Takimoto for their collaboration, and, last but not least, Makoto Haraguchi and Yoshiaki Okubo (both Hokkaido University, Japan) for their local arrangement of the twin conferences.
October 2003
Gunter Grieser Yuzuru Tanaka Akihiro Yamamoto
Organization
Conference Chair Yuzuru Tanaka
Hokkaido University, Japan
Program Committee Gunter Grieser (Co-chair) Akihiro Yamamoto (Co-chair) Simon Colton Vincent Corruble Johannes F¨ urnkranz Achim Hoffmann Naresh Iyer John R. Josephson Eamonn Keogh Mineichi Kudo Nicolas Lachiche Steffen Lange Lorenzo Magnani Michael May Hiroshi Motoda Nancy J. Nersessian Vladimir Pericliev Jan Rauch Henk W. de Regt Ken Satoh Tobias Scheffer Einoshin Suzuki Masayuki Takeda Ljupˇco Todorovski Gerhard Widmer
Technical University, Darmstadt, Germany Hokkaido University, Japan Imperial College London, UK Universit´e P. et M. Curie Paris, France Research Institute for AI, Austria University of New South Wales, Australia GE Global Research Center, USA Ohio State University, USA University of California, USA Hokkaido University, Japan University of Strasbourg, France DFKI GmbH, Germany University of Pavia, Italy Fraunhofer AIS, Germany Osaka University, Japan Georgia Institute of Technology, USA Academy of Sciences, Sofia, Bulgaria University of Economics, Prague, Czech Republic Free University Amsterdam, The Netherlands National Institute of Informatics, Japan Humboldt University, Berlin, Germany Yokohama National University, Japan Kyushu University, Japan Joˇzef Stefan Institute, Ljubljana, Slovenia University of Vienna, Austria
Local Arrangements Makoto Haraguchi (Chair) Yoshiaki Okubo
Hokkaido University, Japan Hokkaido University, Japan
VIII
Organization
Subreferees Hideo Bannai Rens Bod Marco Chiarandini Nigel Collier Pascal Divoux Thomas G¨artner Peter Grigoriev Makoto Haraguchi Katsutoshi Hirayama Akira Ishino Takayuki Ito Ai Kawazoe Nikolay Kirov Edda Leopold
Tsuyoshi Murata Atsuyoshi Nakamura Luis F. Paquete Lourdes Pe˜ na Castillo Johann Petrak Detlef Prescher Ehud Reither Alexandr Savinov Tommaso Schiavinotto Alexander K. Seewald Esko Ukkonen Serhiy Yevtushenko Kenichi Yoshida Sandra Zilles
Sponsors Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) The Suginome Memorial Foundation, Japan
Table of Contents
Invited Talks Abduction and the Dualization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Eiter, Kazuhisa Makino
1
Signal Extraction and Knowledge Discovery Based on Statistical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Genshiro Kitagawa
21
Association Computation for Information Access . . . . . . . . . . . . . . . . . . . . . . Akihiko Takano
33
Efficient Data Representations That Preserve Information . . . . . . . . . . . . . . Naftali Tishby
45
Can Learning in the Limit Be Done Efficiently? . . . . . . . . . . . . . . . . . . . . . . . Thomas Zeugmann
46
Long Papers Discovering Frequent Substructures in Large Unordered Trees . . . . . . . . . . Tatsuya Asai, Hiroki Arimura, Takeaki Uno, Shin-ichi Nakano
47
Discovering Rich Navigation Patterns on a Web Site . . . . . . . . . . . . . . . . . . . Karine Chevalier, C´ecile Bothorel, Vincent Corruble
62
Mining Frequent Itemsets with Category-Based Constraints . . . . . . . . . . . . Tien Dung Do, Siu Cheung Hui, Alvis Fong
76
Modelling Soil Radon Concentration for Earthquake Prediction . . . . . . . . . Saˇso Dˇzeroski, Ljupˇco Todorovski, Boris Zmazek, Janja Vaupotiˇc, Ivan Kobal
87
Dialectical Evidence Assembly for Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Alistair Fletcher, John Davis Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Daiji Fukagawa, Tatsuya Akutsu Performance Evaluation of Decision Tree Graph-Based Induction . . . . . . . . 128 Warodom Geamsakul, Takashi Matsuda, Tetsuya Yoshida, Hiroshi Motoda, Takashi Washio
X
Table of Contents
Discovering Ecosystem Models from Time-Series Data . . . . . . . . . . . . . . . . 141 Dileep George, Kazumi Saito, Pat Langley, Stephen Bay, Kevin R. Arrigo An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets and Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 153 Xiaoshu Hang, Honghua Dai Extraction of Coverings as Monotone DNF Formulas . . . . . . . . . . . . . . . . . . 166 Kouichi Hirata, Ryosuke Nagazumi, Masateru Harao What Kinds and Amounts of Causal Knowledge Can Be Acquired from Text by Using Connective Markers as Clues? . . . . . . . . . . . . . . . . . . . . . . . . . 180 Takashi Inui, Kentaro Inui, Yuji Matsumoto Clustering Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Toshihiro Kamishima, Jun Fujiki Business Application for Sales Transaction Data by Using Genome Analysis Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Naoki Katoh, Katsutoshi Yada, Yukinobu Hamuro Improving Efficiency of Frequent Query Discovery by Eliminating Non-relevant Candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 J´erˆ ome Maloberti, Einoshin Suzuki Chaining Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Taneli Mielik¨ ainen An Algorithm for Discovery of New Families of Optimal Regular Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Oleg Monakhov, Emilia Monakhova Enumerating Maximal Frequent Sets Using Irredundant Dualization . . . . . 256 Ken Satoh, Takeaki Uno Discovering Exceptional Information from Customer Inquiry by Association Rule Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Keiko Shimazu, Atsuhito Momma, Koichi Furukawa
Short Papers Automatic Classification for the Identification of Relationships in a Meta-data Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Gerd Beuster, Ulrich Furbach, Margret Gross-Hardt, Bernd Thomas Effects of Unreliable Group Profiling by Means of Data Mining . . . . . . . . . 291 Bart Custers
Table of Contents
XI
Using Constraints in Discovering Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Saˇso Dˇzeroski, Ljupˇco Todorovski, Peter Ljubiˇc SA-Optimized Multiple View Smooth Polyhedron Representation NN . . . . 306 Mohamad Ivan Fanany, Itsuo Kumazawa Elements of an Agile Discovery Environment . . . . . . . . . . . . . . . . . . . . . . . . . 311 Peter A. Grigoriev, Serhiy A. Yevtushenko Discovery of User Preference in Personalized Design Recommender System through Combining Collaborative Filtering and Content Based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Kyung-Yong Jung, Jason J. Jung, Jung-Hyun Lee Discovery of Relationships between Interests from Bulletin Board System by Dissimilarity Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 Kou Zhongbao, Ban Tao, Zhang Changshui A Genetic Algorithm for Inferring Pseudoknotted RNA Structures from Sequence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Dongkyu Lee, Kyungsook Han Prediction of Molecular Bioactivity for Drug Design Using a Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Sanghoon Lee, Jihoon Yang, Kyung-whan Oh Mining RNA Structure Elements from the Structure Data of Protein-RNA Complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Daeho Lim, Kyungsook Han Discovery of Cellular Automata Rules Using Cases . . . . . . . . . . . . . . . . . . . . 360 Ken-ichi Maeda, Chiaki Sakama Discovery of Web Communities from Positive and Negative Examples . . . . 369 Tsuyoshi Murata Association Rules and Dempster-Shafer Theory of Evidence . . . . . . . . . . . . 377 Tetsuya Murai, Yasuo Kudo, Yoshiharu Sato Subgroup Discovery among Personal Homepages . . . . . . . . . . . . . . . . . . . . . . 385 Toyohisa Nakada, Susumu Kunifuji Collaborative Filtering Using Projective Restoration Operators . . . . . . . . . 393 Atsuyoshi Nakamura, Mineichi Kudo, Akira Tanaka, Kazuhiko Tanabe Discovering Homographs Using N-Partite Graph Clustering . . . . . . . . . . . . 402 Hidekazu Nakawatase, Akiko Aizawa
XII
Table of Contents
Discovery of Trends and States in Irregular Medical Temporal Data . . . . . 410 Trong Dung Nguyen, Saori Kawasaki, Tu Bao Ho Creating Abstract Concepts for Classification by Finding Top-N Maximal Weighted Cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 Yoshiaki Okubo, Makoto Haraguchi Content-Based Scene Change Detection of Video Sequence Using Hierarchical Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 Jong-Hyun Park, Soon-Young Park, Seong-Jun Kang, Wan-Hyun Cho An Appraisal of UNIVAUTO – The First Discovery Program to Generate a Scientific Article . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Vladimir Pericliev Scilog: A Language for Scientific Processes and Scales . . . . . . . . . . . . . . . . . 442 Joseph Phillips Mining Multiple Clustering Data for Knowledge Discovery . . . . . . . . . . . . . 452 Thanh Tho Quan, Siu Cheung Hui, Alvis Fong Bacterium Lingualis – The Web-Based Commonsensical Knowledge Discovery Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 Rafal Rzepka, Kenji Araki, Koji Tochinai Inducing Biological Models from Temporal Gene Expression Data . . . . . . . 468 Kazumi Saito, Dileep George, Stephen Bay, Jeff Shrager Knowledge Discovery on Chemical Reactivity from Experimental Reaction Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 Hiroko Satoh, Tadashi Nakata A Method of Extracting Related Words Using Standardized Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 Tomohiko Sugimachi, Akira Ishino, Masayuki Takeda, Fumihiro Matsuo Discovering Most Classificatory Patterns for Very Expressive Pattern Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 Masayuki Takeda, Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, Setsuo Arikawa Mining Interesting Patterns Using Estimated Frequencies from Subpatterns and Superpatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 Yukiko Yoshida, Yuiko Ohta, Ken’ichi Kobayashi, Nobuhiro Yugami
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
Abduction and the Dualization Problem Thomas Eiter1 and Kazuhisa Makino2 1
2
Institut f¨ur Informationssysteme, Technische Universit¨at Wien, Favoritenstraße 9-11, A-1040 Wien, Austria, [email protected] Division of Mathematical Science for Social Systems, Graduate School of Engineering Science, Osaka University, Toyonaka, Osaka, 560-8531, Japan, [email protected]
Abstract. Computing abductive explanations is an important problem, which has been studied extensively in Artificial Intelligence (AI) and related disciplines. While computing some abductive explanation for a literal χ with respect to a set of abducibles A from a Horn propositional theory Σ is intractable under the traditional representation of Σ by a set of Horn clauses, the problem is polynomial under model-based theory representation, where Σ is represented by its characteristic models. Furthermore, computing all the (possibly exponentially) many explanations is polynomial-time equivalent to the problem of dualizing a positive CNF, which is a well-known problem whose precise complexity in terms of the theory of NP-completeness is not known yet. In this paper, we first review the monotone dualization problem and its connection to computing all abductive explanations for a query literal and some related problems in knowledge discovery. We then investigate possible generalizations of this connection to abductive queries beyond literals. Among other results, we find that the equivalence for generating all explanations for a clause query (resp., term query) χ to the monotone dualization problem holds if χ contains at most k positive (resp., negative) literals for constant k, while the problem is not solvable in polynomial total-time, i.e., in time polynomial in the combined size of the input and the output, unless P=NP for general clause resp. term queries. Our results shed new light on the computational nature of abduction and Horn theories in particular, and might be interesting also for related problems, which remains to be explored. Keywords: Abduction, monotone dualization, hypergraph transversals, Horn functions, model-based reasoning, polynomial total-time computation, NPhardness.
1
Introduction
Abduction is a fundamental mode of reasoning, which was extensively studied by C.S. Peirce [54]. It has taken on increasing importance in Artificial Intelligence (AI) and
This work was supported in part by the Austrian Science Fund (FWF) Project Z29-N04, by a TU Wien collaboration grant, and by the Scientific Grant in Aid of the Ministry of Education, Science, Sports, Culture and Technology of Japan.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 1–20, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
T. Eiter and K. Makino
related disciplines, where it has been recognized as an important principle of commonsense reasoning (see e.g. [9]). Abduction has applications in many areas of AI and Computer Science including diagnosis, database updates, planning, natural language understanding, learning etc. (see e.g. references in [21]), where it is primarily used for generating explanations. Specialized workshops have been held un the recent years, in which the nature and interrelation with other modes of reasoning, in particular induction and deduction, have been investigated. In a logic-based setting, abduction can be seen as the task to find, given a set of formulas Σ (the background theory), a formula χ (the query), and a set of formulas A (the abducibles or hypotheses), a minimal subset E of A such that Σ plus E is satisfiable and logically entails χ (i.e., an explanation). A frequent scenario is where Σ is a propositional Horn theory, χ is a single literal or a conjunction of literals, and A contains literals. For use in practice, the computation of abductive explanations in this setting is an important problem, for which well-known early systems such as Theorist [55] or ATMS solvers [13,56] have been devised. Since then, there has been a growing literature on this subject. Computing some explanation for a query literal χ from a Horn theory Σ w.r.t. assumptions A is a well-known NP-hard problem [57], even if χ and A are positive. Much effort has been spent on studying various input restrictions, cf. [29,11,27,16,15,21, 57,58,59], in order to single out tractable cases of abduction. For example, the case where A comprises all literals is tractable; such explanations are assumption-free explanations. It turned out that abduction is tractable in model-based reasoning, which has been proposed as an alternative form of representing and accessing a logical knowledge base, cf. [14,34,35,36,42,43]. Model-based reasoning can be seen as an approach towards Levesque’s notion of “vivid” reasoning [44], which asks for a more straight representation of the background theory Σ from which common-sense reasoning is easier and more suitable than from the traditional formula-based representation. In model-based reasoning, Σ is represented by a subset S of its models, which are commonly called characteristic models, rather than by a set of formulas. Given a suitable query χ, the test for Σ |= χ becomes then as easy as to check whether χ is true in all models of S, which can be decided efficiently. We here mention that formula-based and the model-based approach are orthogonal, in the sense that while a theory may have small representation in one formalism, it has an exponentially larger representation in the other. The intertranslatability of the two approaches, in particular for Horn theories, has been addressed in [34,35,36,40,42]. Several techniques for efficient model-based representation of various fragments of propositional logic have been devised, cf. [35,42,43]. As shown by Kautz et al., an explanation for a positive literal χ = q w.r.t. assumptions A from a Horn theory Σ, represented by its set of characteristic models, char(Σ), can be computed in polynomial time [34,35,42]; this results extends to negative literal queries χ = q as well, and has been generalized by Khardon and Roth [42] to other fragments of propositional logic. Hence, model-based representation is attractive from this view point of finding efficiently some explanation. While computing some explanation of a query χ has been studied extensively in the literature, computing multiple or even all explanations for χ has received less attention. However, this problem is important, since often one would like to select one out of a
Abduction and the Dualization Problem
3
set of alternative explanations according to a preference or plausibility relation; this relation may be based on subjective intuition and thus difficult to formalize. As easily seen, exponentially many explanations may exist for a query, and thus computing all explanations inevitably requires exponential time in general, even in propositional logic. However, it is of interest whether the computation is possible in polynomial total-time (or output-polynomial time), i.e., in time polynomial in the combined size of the input and the output. Furthermore, if exponential space is prohibitive, it is of interest to know whether a few explanations (e.g., polynomially many) can be generated in polynomial time, as studied by Selman and Levesque [58]. In general, computing all explanations for a literal χ (positive as well as negative) w.r.t. assumptions A from a Horn theory Σ is under formula-based representation not possible in polynomial total-time unless P=NP; this can be shown by standard arguments appealing to the NP-hardness of deciding the existence of some explanation. For generating all assumption-free explanations for a positive literal, a resolution-style procedure has been presented in [24] which works in polynomial total-time, while for a negative literal no polynomial total-time algorithm exists unless P=NP [25]. However, under model-based representation, such results are not known. It turned out that generating all explanation for a literal is polynomial-time equivalent to the problem of dualizing a monotone CNF expression (cf. [2,20,28]), as shown in [24]. Here, polynomial-time equivalence means mutual polynomial-time transformability between deterministic functions, i.e., A reduces to B, if there are polynomial-time functions f, g such that for any input I of A, f (I) is an input of B, and if O is the output for f (I), then g(O) is the output of I, cf. [52]; moreover, O is requested to have size polynomial in the size of the output for I (otherwise, trivial reductions may exist). This result, for definite Horn theories and positive literals, is implicit also in earlier work on dependency inference [49,50], and is closely related to results in [40]. The monotone dualization problem is an interesting open problem in the theory of NP-completeness (cf. [45,53]), which has a number of applications in different areas of Computer Science, [2,19], including logic and AI [22]; the problem is reviewed in Section 2.2 where also briefly related problems in knowledge discovery are mentioned. In the rest of this paper, we first review the result on equivalence between monotone dualization and generating all explanations for a literal under model-based theory representation. We then consider possible generalizations of this result for queries χ beyond literals, where we consider DNF, CNF and important special cases such as a clause and a term (i.e., a conjunction of literals). Note that the explanations for single clause queries correspond to the minimal support clauses for a clause in Clause Management Systems [56,38,39]. Furthermore, we shall consider on the fly also some of these cases under formula-based theory representation. Our aim will be to elucidate the frontier of monotone dualization equivalent versus intractable instances, i.e., not solvable in polynomial total-time unless P=NP, of the problem. It turns out that indeed the results in [24] generalize to clause and term queries under certain restrictions. In particular, the equivalence for generating all explanations for a clause query (resp., term query) χ to the monotone dualization holds if χ contains at most k positive (resp., negative) literals for constant k, while the problem is not solvable in polynomial total-time unless P=NP for general clause (resp., term) queries.
4
T. Eiter and K. Makino
Our results shed new light on the computational nature of abduction and Horn theories in particular, and might be interesting also for related problems, which remains to be explored.
2
Notation and Concepts
We assume a propositional (Boolean) setting with atoms x1 , x2 , . . . , xn from a set At, where each xi takes either value 1 (true) or 0 (false). Negated atoms are denoted by xi , and the opposite of a literal by . Furthermore, we use A = { | ∈ A} for any set of literals A and set Lit = At ∪ At. A any finite set of formulas. theory Σ is A clause is a disjunction c = p∈P (c) p ∨ p∈N (c) p of literals, where P (c) and N (c) are respectively the sets of atoms occurring positively andnegatively in c and P (c) ∩ N (c) = ∅. Dually, a term is a conjunction t = p∈P (t) p ∧ p∈N (t) p of literals, where P (t) and N (t) are similarly defined. A conjunctive normal form (CNF) is a conjunction of clauses, and a disjunctive normal form (DNF) is a disjunction of terms. As common, we also view clauses c and terms t as the sets of literals they contain, and similarly CNFs ϕ and DNFs ψ as sets of clauses and terms, respectively, and write ∈ c, c ∈ ϕ etc. A clause c is prime w.r.t. theory Σ, if Σ |= c but Σ |= c for every c ⊂ c. A CNF ϕ is prime, if each c ∈ ϕ is prime, and irredundant, if ϕ \ {c} ≡ ϕ for every c ∈ ϕ. Prime terms and irredundant prime DNFs are defined analogously. A clause c is Horn, if |P (c)| ≤ 1 and negative (resp., positive), if |P (c)| = 0 (resp., |N (c)| = 0). A CNF is Horn (resp., negative, positive), if it contains only Horn clauses (resp., negative, positive clauses). A theory Σ is Horn, if it is a set of Horn clauses. As usual, we identify Σ with ϕ = c∈Σ c. Example 1. The CNF ϕ = (x1 ∨x4 )∧(x4 ∨x3 )∧(x1 ∨x2 )∧(x4 ∨x5 ∨x1 )∧(x2 ∨x5 ∨x3 ) over At = {x1 , x2 , . . . , x5 } is Horn. The following proposition is well-known. Proposition 1. Given a Horn CNF ϕ and a clause c, deciding whether ϕ |= c is possible in polynomial time (in fact, in linear time, cf. [18]). Horn theories have a well-known semantic characterization. A model is a vector v ∈ {0, 1}n , whose i-th component is denoted by vi . For B ⊆ {1, . . . , n}, we let xB be the model v such that vi = 1, if i ∈ B and vi = 0, if i ∈ / B, for i ∈ {1, . . . , n}. The notions of satisfaction v |= ϕ of a formula ϕ and consequence Σ |= ϕ, ψ |= ϕ etc. are as usual; the set of models of ϕ (resp., theory Σ), is denoted by mod(ϕ) (resp., mod(Σ)). In the example above, the vector u = (0, 1, 0, 1, 0) is a model of ϕ, i.e., u |= ϕ. For models v, w, we denote by v ≤ w the usual componentwise ordering, i.e., vi ≤ wi for all i = 1, 2, . . . , n, where 0 ≤ 1; v < w means v = w and v ≤ w. For any set of models M , we denote by max(M ), (resp., min(M )) the set of all maximal (resp., minimal) models in M . We denote by v w componentwise AND of vectors v, w ∈{0, 1}n (i.e., their intersection), and by Cl∧ (S) the closure of S ⊆ {0, 1}n under . Then, a theory Σ is Horn representable, iff mod(Σ) = Cl∧ (mod(Σ)).
Abduction and the Dualization Problem
5
Example 2. Consider M1 = {(0101), (1001), (1000)} and M2 = {(0101), (1001), (1000),(0001), (0000)}. Then, for v = (0101), w = (1000), we have w, v ∈ M1 , while v w = (0000) ∈ / M1 ; hence M1 is not the set of models of a Horn theory. On the other hand, Cl∧ (M2 ) = M2 , thus M2 = mod(Σ2 ) for some Horn theory Σ2 . As discussed by Kautz et al. [34], a Horn theory Σ is semantically represented by its characteristic models, where v ∈ mod(Σ) is called characteristic (or extreme [14]), if v ∈ Cl∧ (mod(Σ) \ {v}). The set of all such models, the characteristic set of Σ, is denoted by char(Σ). Note that char(Σ) is unique. E.g., (0101) ∈ char(Σ2 ), while (0000) ∈ / char(Σ2 ); we have char(Σ2 ) = M1 . The following proposition is compatible with Proposition 1 Proposition 2. Given char(Σ), and a clause c, deciding whether Σ |= c is possible in polynomial time (in fact, in linear time, cf. [34,26]). The model-based reasoning paradigm has been further investigated e.g. in [40,42], where also theories beyond Horn have been considered [42]. 2.1 Abductive Explanations The notion of an abductive explanation can be formalized as follows. Definition 1. Given a (Horn) theory Σ, called the background theory, a CNF χ (called query), and a set of literals A ⊆ Lit (called abducibles), an explanation of χ w.r.t. A is a minimal set of literals E over A such that (i) Σ ∪ E |= χ, and (ii) Σ ∪ E is satisfiable. If A = Lit, then we call E simply an explanation of χ. The above definition generalizes the assumption-based explanations of [58], which emerge as A=P ∪ P where P ⊆ P (i.e., A contains all literals over a subset P of the letters) and χ = q for some atom q. Furthermore, in some texts (e.g., [21]) explanations must be sets of positive literals, and χ is restricted to a special form; in [21], χ is requested to be a conjunction of atoms. The following characterization of explanations is immediate from the definition. Proposition 3. For any theory Σ, any query χ, and any E ⊆ A(⊆ Lit), E is an explanation for χ w.r.t. A from Σ iff the following conditions hold: (i) Σ ∪E is satisfiable, (ii) Σ ∪ E |= χ, and (iii) Σ ∪ (E \ {}) |= χ, for every ∈ E. Example 3. Reconsider the Horn CNF ϕ = (x1 ∨ x4 ) ∧ (x4 ∨ x3 ) ∧ (x1 ∨ x2 ) ∧ (x4 ∨ x5 ∨ x1 ) ∧ (x2 ∨ x5 ∨ x3 ) from above. Suppose we want to explain χ = x2 from A = {x1 , x4 }. Then, we find that E = {x1 } is an explanation. Indeed, Σ ∪ {x1 } |= x2 , and Σ ∪ {x1 } is satisfiable; moreover, E is minimal. On the other hand, E = {x1 , x4 } satisfies (i) and (ii) for χ = x2 , but is not minimal. We note that there is a close connection between the explanations of a literal and the prime clauses of a theory.
6
T. Eiter and K. Makino
Proposition 4 (cf. [56,32]). For any theory Σ and literalχ, a set E ⊆ A(⊆ Lit) with E ={χ} is an explanation of χ w.r.t. A, iff the clause c = ∈E ∨ χ is a prime clause of Σ. Thus, computing explanations for a literal is easily seen to be polynomial-time equivalent to computing prime clauses of a certain form. We refer here to [51] for an excellent survey of techniques for computing explanations via computing prime implicates and related problems. 2.2
Dualization Problem
Dualization of Boolean functions (i.e., given a formula ϕ defining a function f , compute a formula ψ for the dual function f d ) is a well-known computational problem. The problem is trivial if ψ may be any formula, since we only need to interchange ∨ and ∧ in ϕ (and constants 0 and 1, if occurring). The problem is more difficult if ψ should be of special form. In particular, if ϕ is a CNF and ψ should be a irredundant prime CNF (to avoid redundancy); this problem is known as Dualization [22]. For example, if ϕ = (x1 ∨ x3 )(x2 ∨ x3 ), then a suitable ψ would be (x2 ∨ x3 )(x1 ∨ x3 ), since (x1 ∧ x3 ) ∨ (x2 ∧ x3 ) ≡ (x1 ∨ x2 )(x2 ∨ x3 )(x1 ∨ x3 ) simplifies to it. Clearly, ψ may have size exponential in the size of ϕ, and thus the issue is here whether a polynomial total-time algorithms exists (rather than one polynomial in the input size). While it is easy to see that the problem is not solvable in polynomial total unless P=NP, this result could neither be established for the important class of positive (monotone) Boolean functions so far, nor is a polynomial total-time algorithm known to date, cf. [23,28,37]. Note that for this subproblem, denoted Monotone Dualization, the problem looks simpler: all prime clauses of a monotone Boolean function f are positive and f has a unique prime CNF, which can be easily computed from any given CNF (just remove all negative literals and non-minimal clauses). Thus, in this case the problem boils down to convert a prime positive DNF ϕ constructed from ϕ into the equivalent prime (monotone) CNF. An important note is that Monotone Dualization is intimately related to its decisional variant, Monotone Dual, since Monotone Dualization is solvable in polynomial total-time iff Monotone Dual is solvable in polynomial time cf. [2]. Monotone Dual consists of deciding whether a pair of CNFs ϕ, ψ whether ψ is the prime CNF for the dual of the monotone function represented by ϕ (strictly speaking, this is a promise problem [33], since valid input instances are not recognized in polynomial time. For certain instances such as positive ϕ, this is ensured). A landmark result on Monotone Dual was [28], which presents an algorithm solving the problem in time no(log n) . More recently, algorithms have been exhibited [23,37] which show that the complementary problem can be solved with limited nondeterminism in polynomial time, i.e, by a nondeterministic polynomial-time algorithm that makes only a poly-logarithmic number of guesses in the size of the input. Although it is still open whether Monotone Dual is polynomially solvable, several relevant tractable classes were found by various authors (see e.g. [8,12,17,20,30,47,47] and references therein). A lot of research efforts have been spent on Monotone Dualization and Monotone Dual (see survey papers, e.g. [45,53,22]), since a number of problems turned
Abduction and the Dualization Problem
7
out to be polynomial-time equivalent to this problem; see e.g. [2,19,20] and the more paper [22]. Polynomial-time equivalence of computation problems Π and Π is here understood in the sense that problem Π reduces to Π and vices versa, where Π reduces to Π , if there a polynomial functions f, g s.t. for any input I of Π, f (I) is an input of Π , and if O is the output for f (I), then g(O) is the output of I, cf. [52]; moreover, O is requested to have size polynomial in the size of the output for I (if not, trivial reductions may exist). Of the many problems to which Monotone Dualization is polynomially equivalent, we mention here computing the transversal hypergraph of a hypergraph (known as Transversal Enumeration (TRANS-ENUM)) [22]. A hypergraph H = (V, E) is a collection E of subsets e ⊆ V of a finite set V , where the elements of E are called hyperedges (or simply edges). A transversal of H is a set t ⊆ V that meets every e ∈ E, and is minimal, if it contains no other transversal properly. The transversal hypergraph of H is then the unique hypergraph Tr (H) = (V, T ) where T are all minimal transversals of H. Problem TRANS-ENUM is then, given a hypergraph H = (V, E), to generate all the edges of Tr (H); TRANS-HYP is deciding, given H = (V, E) and a set of minimal transversals T , whether Tr (H) = (V, T ). There is a simple correspondence between Monotone Dualization and TRANSENUM: For any positive CNF ϕ on At representing a Boolean function f , the prime CNF ψ for the dual of f consists of all clauses c such that c ∈ Tr (At, ϕ) (viewing ϕ as set of clauses). E.g., if ϕ = (x1 ∨ x2 )x3 , then ψ = (x1 ∨ x3 )(x2 ∨ x3 ). As for computational learning theory, Monotone Dual resp. Monotone Dualization are of relevance in the context of exact learning, cf. [2,31,47,48,17], which we briefly review here. Let us consider the exact learning of DNF (or CNF) formulas of monotone Boolean functions f by membership oracles only, i.e., the problem of identifying a prime DNF (or prime CNF) of an unknown monotone Boolean function f by asking queries to an oracle whether f (v)=1 holds for some selected models v. It is known [1] that monotone DNFs (or CNFs) are not exact learnable with membership oracles alone in time polynomial in the size of the target DNF (or CNF) formula, since information theoretic barriers impose a |CNF(f )| + |DNF(f )| lower bound on the number of queries needed, where |CNF(f )| and |DNF(f )| denote the numbers of prime implicates and prime implicants of f , respectively. This fact raises the following question: – Can we identity both the prime DNF and CNF of an unknown monotone function f by membership oracles alone in time polynomial in |CNF(f )| + |DNF(f )| ? Since the prime DNF (resp., prime CNF) corresponds one-to-one to the set of all minimal true models (resp., all maximal false models) of f , the above question can be restated in the following natural way [2,47]: Can we compute the boundary between true and false areas of an unknown monotone function in polynomial total-time ? There should be a simple algorithm for the problem as follows, which uses a DNF h and a CNF h consisting of some prime implicants and prime implicates of f , respectively, such that h |= ϕ and ϕ |= h , for any formula ϕ representing f : Step 1. Set h and h to be empty (i.e., falsity and truth).
8
T. Eiter and K. Makino
Step 2. while h ≡ h do Take a counterexample x of h ≡ h ; if f (x) = 1 then begin Minimize t = i:xi =1 xi to a prime implicant t∗ of f ; h := h ∨ t∗ (i.e., add t∗ to h); end else /* f (x) = 0 */ begin Minimize c = i:xi =0 xi to a prime implicate c∗ of f ; h := h ∧ c∗ (i.e., add c∗ to h ); end Step 3. Output h and h . This algorithm needs O(n(|CNF(f )| + |DNF(f )|)) many membership queries. If h ≡ h (i.e., the pair (hd , h) is a Yes instance of Monotone Dual) can always be decided in polynomial time, then the algorithm is polynomial in n, CNF(f ), and DNF(f ). (The converse is also known [2], i.e., if the above exact learning problem is solvable in polynomial total time, then Monotone Dual is polynomially solvable.) Of course, other kinds of algorithms exist; for example, [31] derived an algorithm with different behavior and query bounds. Thus, for the classes C of monotone Boolean functions which enjoying that (i) Monotone Dual is polynomially solvable and a counterexample is found in polynomial time in case (which is possible under mild conditions, cf. [2]), and (ii) the family of prime DNFs (or CNFs) is hereditary, i.e., if a function with the prime DNF φ = i∈I ti is in C, then any function with the prime DNF φS = i∈S ti , where S ⊆ I, is in C, the above is a simple polynomial time algorithm which uses polynomially many queries within the optimal bound (assuming that |CNF(f )| + |DNF(f )| is at least of order n). For many classes of monotone Boolean functions, we thus can get the learnability results from the results of Monotone Dual, e.g., k-CNFs, k-clause CNFs, read-k CNFs, and k-degenerate CNFs [20,23,17]. In knowledge discovery, Monotone Dualization and Monotone Dual are relevant in the context of several problems. For example, it is encountered in computing maximal frequent and minimal infrequent sets [6], in dependency inference and key computation from databases [49,50,19], which we will address in Sections 3 and 4.2 below, as well as in translating between models of a theory and formula representations [36,40]. Moreover, their natural generalizations have been studied to model various interesting applications [4,5,7].
3
Explanations and Dualization
Deciding the existence of some explanation for a literal χ = w.r.t. an assumption set A from a Horn Σ is NP-complete under formula representation (i.e., Σ is given by a
Abduction and the Dualization Problem
9
Horn CNF), for both positive and negative , cf. [57,24]; hence, generating some resp. all explanations is intractable in very elementary cases (however, under restrictions such as A = Lit for positive , the problem is tractable [24]). In the model-based setting, matters are different, and there is a close relationship between abduction and monotone dualization. If we are given char(Σ) of a Horn theory Σ and an atom q, computing an explanation E for q from Σ amounts to computing a minimal set E of letters such that (i) at least one model of Σ (and hence, a model in char(Σ)) satisfies E and q, and that (ii) each model of Σ falsifying q also falsifies E; this is because an atom q has only positive explanations E, i.e., it contains only positive literals (see e.g. [42] for a proof). Viewing models as v = xB , then (ii) means that E is a minimal transversal of the hypergraph (V, M ) where V corresponds to the set of the variables and M consists of all V − B such that xB ∈ char(Σ) and xB |= q. This led Kautz et al. [34] to an algorithm for computing an explanation E for χ = q w.r.t. a set of atoms A ⊆ At which essentially works as follows: 1. Take a model v ∈ char(Σ) such that v |= q. / B }, where v = xB . 2. Let V := A ∩ B and M = {V \ B | xB ∈ char(Σ), q ∈ 3. if ∅ ∈ / M , compute a minimal transversal E of H = (V, M ), and output E; otherwise, select another v in Step 1 (if no other is left, terminate with no output). In this way, some explanation E for q w.r.t. A can be computed in polynomial time, since computing some minimal transversal of a hypergraph is clearly polynomial. Recall that under formula-based representation, this problem is NP-hard [57,58]. The method above has been generalized to arbitrary theories represented by models using Bshouty’s Monotone Theory [10] and positive abducibles A, as well as for other settings, by Khardon and Roth [42] (cf. also Section 4.2). Also all explanations of q can be generated in the way above, by taking all models v ∈ char(Σ) and all minimal transversals of (V, M ). In fact, in Step 1 v can be restricted to the maximal vectors in char(Σ). Therefore, computing all explanations reduces to solving a number of transversal computation problems (which trivially amount to monotone dualization problems) in polynomial time. As shown in [24], the latter can be polynomially reduced to a single instance. Conversely, monotone dualization can be easily reduced to explanation generation, cf. [24]. This established the following result. Theorem 1. Given char(Σ) of a Horn theory Σ, a query q, and A ⊆ Lit, computing the set of all explanations for q from Σ w.r.t. A is polynomial-time equivalent to Monotone Dualization. A similar result holds for negative literal queries χ = q as well. Also in this case, a polynomial number of transversal computation problems can be created such that each minimal transversal corresponds to some explanation. However, matters are more complicated here since a query q might also have non-positive explanations. This leads to a case analysis and more involved construction of hypergraphs. We remark that a connection between dualization and abduction from a Horn theory represented by char(Σ) is implicit in dependency inference from relational databases. An important problem there is to infer, in database terminology, a prime cover of the
10
T. Eiter and K. Makino
set Fr+ of all functional dependencies (FDs) X→A which hold on an instance r of a relational schema U = {A1 , . . . , An } where the Ai are the attributes. A functional dependency X→A, X ⊆ U , A ∈ U , is a constraint which states that for every tuples t1 and t2 occurring in the same relation instance r, it holds that t1 [A] = t2 [A] whenever t1 [X] = t2 [X], i.e., coincidence of t1 and t2 on all attributes in X implies that t1 and t2 also coincide on A. A prime cover is a minimal (under ⊆) set of non-redundant FDs X→A (i.e., X →A is violated for each X ⊂ X) which is logically equivalent to Fr+ . In our terms, a non-redundant FD X→A corresponds to a prime clause B∈X B ∨A of the CNF ϕFr+ , where for any set of functional dependencies F , ϕF is the CNF + ϕF = X→A∈F B∈X B ∨A , where the attributes in U are viewed as atoms and Fr is the set of all FDs which hold on r. Thus, by Proposition 4, the set X is an explanation of A from ϕr . As shown in [49], so called max sets for all attributes A are polynomialtime computable from r, which in totality correspond to the characteristic models of the (definite) Horn theory Σr defined by ϕFr+ [41]. Computing the explanations for A is then reducible to an instance of Trans-Enum [50], which establishes the result for generating all assumption-free explanations from definite Horn theories. We refer to [41] for an elucidating discussion which reveals the close correspondence between concepts and results in database theory on Armstrong relations and in model-based reasoning, which can be exploited to derive results about space bounds for representation and about particular abduction problems. The latter will be addressed in Section 4.2. Further investigations on computing prime implicates from model-based theory representation appeared in [36] and in [40], which exhibited further problems equivalent to Monotone Dualization. In particular, Khardon has shown, inspired by results in [49, 50,19], that computing all prime implicates of a Horn theory Σ represented by its characteristic models is, under Turing reducibility (which is more liberal than the notion of reducibility we employ here in general), polynomial-time equivalent to TRANS-ENUM. Note, however, that by Proposition 4, we are here concerned with computing particular prime implicates rather than all.
4
Possible Generalizations
The results reported above deal with queries χ which are a single literal. As already stated in the introduction, often queries will be more complex, however, and consist of a conjunction of literals, of clauses [56], etc. This raises the question about possible extensions of the above results for queries of a more general form, and in particular whether we encounter other kinds of problem instances which are equivalent to Monotone Dualization. 4.1
General Formulas and CNFs
Let us first consider how complex finding abductive explanations can grow. It is known [21] that deciding the existence of an explanation for literal query χ w.r.t. a set A is Σ2P -complete (i.e., complete for NPNP ), if the background theory Σ is a set of arbitrary clauses (not necessarily Horn). For Horn Σ, we get a similar result if the query χ is an arbitrary formula.
Abduction and the Dualization Problem
11
Proposition 5. Given a Horn CNF ϕ, a set A ⊆ Lit, and a query χ, deciding whether χ has some explanation w.r.t A from ϕ is (i) Σ2P -complete, if χ is arbitrary (even if it is a DNF), (ii) NP-complete, if χ is a CNF, and (iii) NP-complete, if A = Lit. Intuitively, an explanation E for χ can be guessed and then, by exploiting Proposition 3, be checked in polynomial time with an oracle for propositional consequence. The Σ2P -hardness in case (i) can be shown by an easy reduction from deciding whether a quantified Boolean formula (QBF) of form F = ∃X∀Y α, where X and Y are disjoint sets of Boolean variables and α is a DNF over X ∪ Y . Indeed, just let Σ = ∅, and A = {x, x | x ∈ X}. Then, χ = α has an explanation w.r.t. A iff formula F is valid. On the other hand, if χ is a CNF, then deciding consequence Σ ∪ S |= χ is polynomial for every set of literals S; hence, in case (ii) the problem has lower complexity and is in fact in NP. As for case (iii), if A = Lit, then an explanation exists iff Σ ∪ {χ} has a model, which can be decided in NP. The hardness parts for (ii) and (iii) are immediate by a simple reduction from SAT (given a CNF β, let Σ=∅, χ=β, and A=Lit). We get a similar picture under model-based representation. Here, inferring a clause c from char(Σ) is feasible in polynomial time, and hence also inferring a CNF. On the other hand, inferring an arbitrary formula (in particular, a DNF) α, is intractable, since to witness Σ |= α we intuitively need to find proper models v1 , . . . , vl ∈ char(Σ) such |= α. that i vi Proposition 6. Given char(Σ), a set A ⊆ Lit, and a query χ, deciding whether χ has some explanation w.r.t A from Σ is (i) Σ2P -complete, if χ is arbitrary (even if it is a DNF), (ii) NP-complete, if χ is a CNF, and (iii) NP-complete, if A = Lit. As for (iii), we can guess a model v of χ and check whether v is also a model of Σ from char(Σ) in polynomial time (indeed, check whether v = {w ∈ char(Σ) | v ≤ w} holds). The hardness parts can be shown by slight adaptations of the constructions for the formula based case, since char(Σ) for the empty theory is easily constructed (it consists of xAt and all xAt\{i} , i ∈ {1, . . . , n}). So, like in the formula-based case, also in the model-based case we loose the immediate computational link of computing all explanations to Monotone Dualization if we generalize queries to CNFs and beyond. However, it appears that there are interesting cases between a single literal and CNFs which are equivalent to Monotone Dualization. As for the formula-based representation, recall that generating all explanations is polynomial total-time for χ being a positive literal (thus, tractable and “easier” than Monotone Dualization), while it is coNP-hard for χ being a CNF (and in fact, a negative literal); so, somewhere between the transition from tractable to intractable might pass instances which are equivalent to Monotone Dualization. Unfortunately, restricting to positive CNFs (this is the first idea) does not help, since the problem remains coNP-hard, even if all clauses have small size; this can be shown by a straightforward reduction from the well-known EXACT HITTING SET problem [25]. However, we encounter monotone dualization if Σ is empty. Proposition 7. Given a set A ⊆ Lit, and a positive CNF χ, generating all explanations of χ w.r.t A from Σ = ∅ is polynomial-time equivalent to dualizing a positive CNF, under both model-based and formula based representation.
12
T. Eiter and K. Makino
This holds since, as easily seen, every explanation E must be positive, and moreover, must be a minimal transversal of the clauses in χ. Conversely, every minimal transversal T of χ after removal of all positive literals that do not belong to A (viewed as hypergraphs), is an explanation. Note that this result extends to those CNFs which are unate, i.e., convertible to a positive CNF by flipping the polarity of some variables. 4.2
Clauses and Terms
Let us see what happens if we move from general CNFs to the important subcases of a single clause and a single term, respectively.As shown in [25], generating all explanations remains under formula-based polynomial total-time for χ being a positive clause or a positive term, but is intractable as soon as we allow a negative literal. Hence, we do not see an immediate connection to monotone dualization. More promising is model-based representation, since for a single literal query, equivalence to monotone dualization is known. It appears that we can extend this result to clauses and terms of certain forms. Clauses. Let us first consider the clause case, which is important for Clause Management Systems [56,38,39]. Here, the problem can be reduced to the special case of a single literal query as follows. Given a clause c, introduce a fresh letter q. If we would add the formula c ⇒ q to Σ, then the explanations of q would be, apart from the trivial q in case, just explanation the explanations of c. We can rewrite c ⇒ q to a Horn CNF α = x∈P (c) (x ∨ q) ∧ x∈N (c) (x ∨ q), so adding α maintains Σ Horn and thus the reduction works in polynomial time under formula-based representation. (Note, however, that we reduce this to a case which is intractable in general.) Under model-based representation, we need to construct char(Σ ∪ {α}), however, and it is not evident that this is always feasible in polynomial time. We can establish the following relationship, though. Let P (c) = {q1 , . . . , qk } and N (c) = {qk+1 , . . . , qm }; thus, α is logically equivalent to q ⇒ q 1 ∧ · · · q k ∧ qk+1 ∧ · · · ∧ qm . Claim. For Σ = Σ ∪ {α}, we have char(Σ ) ⊆ {v@(0) | v ∈ char(Σ)} ∪ m (v1 ∧· · · ∧ vk )@(1) | vi ∈ char(Σ), vi |= q i ∧ j=k+1 qj , for 1 ≤ i ≤ k ∪ (v0 ∧ v1 ∧· · · ∧ vk )@(1) | vi ∈ char(Σ), m for 0 ≤ i ≤ k, . m v0 |= i=1 qi , vi |= q i ∧ j=k+1 qj , for 1 ≤ i ≤ k where “@” denotes concatenation and q is assumed to be the last in the variable ordering. Indeed, each model of Σ, extended to q, is a model of Σ if we set q to 0; all models of Σ of this form can be generated by intersecting characteristic models of Σ extended in this way. On the other hand, any model v of Σ in which q is 1 must have also qk+1 , . . . , qm set to 1 and q1 , . . . , qk set to 0. We can generate such v by expanding the intersection of some characteristic models v1 , . . . , vl of Σ (where l ≤ |char(Σ)|) in which qk+1 , . . . , qm are set to 1, and where each of q1 , q2 , . . . , qk is made 0 by
Abduction and the Dualization Problem
13
intersection with at least one of these vectors. By associativity and commutativity of intersection, we can generate v then by expanding the intersection of vectors of the form given in the above equation. From the set RHS on the right hand side, char(Σ ) can be easily computed by eliminating those vectors v which are generated by the intersection of other vectors (i.e., such that v = {w ∈ RHS | v < w}). In general, RHS will have exponential size; however, computing RHS is polynomial if k is bounded by a constant; clearly, computing char(Σ ) from RHS is feasible in time polynomial in the size of RHS, and hence in polynomial time in the size of char(Σ). Thus, the reduction from clause explanation to literal explanation is computable in polynomial time. We thus obtain the following result. Theorem 2. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a clause query c, and A ⊆ Lit, computing all explanations for c from Σ w.r.t. A is polynomial-time equivalent to Monotone Dualization, if |P (c)| ≤ k for some constant k. Note that the constraint on P (c) is necessary, in the light of the following result. Theorem 3. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a positive clause query c, and some explanations E1 , . . . , El for c from Σ, deciding whether there exists some further explanation is NP-complete. The NP-hardness can be shown by a reduction from the well-known 3SAT problem. By standard arguments, we obtain from this the following result for computing multiple explanations. Corollary 1. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a set A ⊆ Lit and a clause c, computing a given number resp. all explanations for c w.r.t. A from Σ is not possible in polynomial total-time unless P=NP. The hardness holds even for A = Lit, i.e., for assumption-free explanations. Terms. Now let us finally turn to queries which are given by terms, i.e., conjunctions of literals. With a similar technique as for clause queries, explanations for a term t can be reduced to explanations for a positive literal query in some cases. Indeed, introduce a fresh atom q and consider t ⇒ q; this formula is equivalent to a Horn clause α if |N (t)| ≤ 1 (in particular, if t is positive). Suppose t = q 0 ∧ q1 ∧ · · · ∧ qm , and let Σ = Σ ∪ {α} (where α = q ∨ q 1 ∨ · · · ∨ q m ∨ q0 ). Then we have char(Σ ) ⊆ {v@(0) | v ∈ char(Σ)} ∪ { v@(1) | v ∈ char(Σ), v |= q0 ∨ q 1 ∨ · · · ∨ q m } ∪ {(v ∧ v )@(1) | v, v ∈ char(Σ), v |= q 1 ∨ · · · ∨ q m , v |= q 0 ∧ q1 ∧ · · · ∧ qm }, from which char(Σ ) is easily computed. The explanations for t from Σ then correspond to the explanations for q from Σ modulo the trivial explanation q in case. This implies polynomial-time equivalence of generating all explanations for a term t to Monotone Dualization if t contains at most one negative literal. In particular, this holds for the case of positive terms t.
14
T. Eiter and K. Makino
Note that the latter case is intimately related to another important problem in knowledge discovery, namely inferring the keys of a relation r, i.e., the minimal (under ⊆) sets of attributes K ⊆ U = {A1 , A2 , . . . , An } whose values uniquely identify the rest of any tuple in the relation. The keys for a relation instance r over attributes U amount to the assumption-free explanations of t from the Horn CNF ϕFr+ defined as above in Section 2.2. Thus, abduction procedures can be used for generating keys. Note that char(Fr+ ) is computable in polynomial time from r (cf. [41]), and hence generating all keys polynomially reduces to Monotone Dualization; the converse has also been shown [19] [19]. Hence, generating all keys is polynomially equivalent to Monotone Dualization and also to generating all explanations for a positive term from char(Σ). A similar reduction can be used to compute all keys of a generic relation scheme (U, F ) of attributes U and functional dependencies F , which amount to the explanations for the term t = A1 A2 · · · An from the CNF ϕF . Note that Khardon et al. [41] investigate computing keys for given F and more general Boolean constraints ψ by a simple reduction to computing all nontrivial explanations E of a fresh letter q (i.e., E = {q}; A ∨ q ), which can thus be done in polynomial total-time as follows use ψ ∧ Ai ∈U i from results in [24]; for FDs (i.e., ψ = ϕF ) this is a classic result in database theory [46]. Furthermore, [41] also shows how to compute an abductive explanation using an oracle for key computation. We turn back to abductive explanations for a term t, and consider what happens if we have multiple negative literals in t. Without constraints on N (t), we face the intractability, since deciding the existence of a nontrivial explanation for t is already difficult, where an explanation E for t is nontrivial if E = P (t) ∪ {q | q ∈ N (t)}. Theorem 4. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ and a term t, deciding whether (i) there exists a nontrivial assumption-free explanation for t from Σ is NPcomplete; (ii) there exists an explanation for t w.r.t. a given set A ⊆ Lit from Σ is NP-complete. In both cases, the NP-hardness holds even for negative terms t. The hardness parts of this theorem can be shown by a reduction from 3SAT. Corollary 2. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a set A ⊆ Lit and a term t, computing a given number resp. all explanations for t w.r.t. A from Σ is not possible in polynomial total-time, unless P=NP. The hardness holds even for A = Lit, i.e., for assumption-free explanations. While the above reduction technique works for terms t with a single negative literal, it is not immediate how to extend it to multiple negative literals such that we can derive polynomial equivalence of generating all explanations to Monotone Dualization if |N (t)| is bounded by a constant k. However, this can be shown by a suitable extension of the method presented in [24], transforming the problem into a polynomial number of instances of Monotone Dualization, which can be polynomially reduced to a single instance [24].
Abduction and the Dualization Problem
15
Proposition 8. For any Horn theory Σ and E = {x1 , . . . , xk , xk+1 , . . . , xm } (⊆ Lit), char(Σ ∪ E) is computable from char(Σ) by char(Σ ∪ E) = char(M1 ∪ M2 ), where m xj for 1 ≤ i ≤ k}, M1 = {v1 ∧ · · · ∧ vk | vi ∈ char(Σ), vi |= xi ∧ j=k+1 m M2 = {v ∧ v0 | v ∈ M1 , v0 ∈ char(Σ), v0 |= xj }, j=1
and char(S) = {v ∈ S | v ∈ / Cl∧ (S \ {v})} for every S ⊆ {0, 1}n . This can be done in polynomial time if k = |E ∩ At| is bounded by some constant. For any model v and any V ⊆ At, let v[V ] denote the projection of v on V , and for any theory Σ and any V ⊆ At, Σ[V ] denotes the projection of Σ on V , i.e., mod(Σ[V ]) = {v[V ] | V ∈ mod(Σ)}. Proposition 9. For any Horn theory Σ and any V ⊆ At, char(Σ[V ]) can be computed from char(Σ) in polynomial time by char(Σ[V ]) = char(char(Σ)[V ]). For any model v and any set of models M , let maxv (M ) denote the set of all the models in M that is maximal with respect to ≤v . Here, for models w and u, we write w ≤v u if wi ≤ ui if vi = 0, and wi ≥ ui if vi = 1. Proposition 10. For any Horn theory Σ and any model v, maxv (Σ) can be computed from char(Σ) by maxv ({w ∈ MS | S ⊆ {xi | vi = 1} }), where M∅ = char(Σ) and for S = {x1 , . . . , xk }, MS = {v1 ∧ · · · ∧ vk | vi ∈ char(Σ), vi |= xi for 1 ≤ i ≤ k}. This can be done in polynomial time if |{xi | vi = 1}| is bounded by some constant. Theorem 5. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a term query t, and A ⊆ Lit, computing all explanations for t from Σ w.r.t. A is polynomial-time equivalent to Monotone Dualization, if |N (t)| ≤ k for some constant k. Proof. (Sketch) We consider the following algorithm. Algorithm Term-Explanations Input: char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a term t, and A ⊆ Lit. Output: All explanations for t from Σ w.r.t. A. Step 1. Let Σ = Σ ∪ P (t) ∪ {q | q ∈ N (t)}. Compute char(Σ ) from char(Σ). Step 2. For each xi ∈ N (t), let Σxi = Σ ∪{xi }, and for each xi ∈ P (t), let Σxi = Σ ∪ {xi }. Compute char(Σxi ) from char(Σ) for xi ∈ N (t), and compute char(Σxi ) from char(Σ) for xi ∈ P (t). Step 3. For each B = B− ∪ B+ , where B− ⊆ A ∩ At with |B− | ≤ |N (t)| and B+ = (A ∩ At) \ {q | q ∈ B− }, let C = B+ ∪ {q | q ∈ B− }. (3-1) Compute char(Σ [C]) from char(Σ ). (3-2) Let v ∈ {0, 1}C be the model with vi = 1 if xi ∈ B− , and vi = 0 if xi ∈ B+ . Compute maxv (Σ [C]) from char(Σ [C]). (3-3) For each w ∈ maxv (Σ [C]), let Cv,w ={xi | vi =wi } and let v ∗ =v[Cv,w ].
16
T. Eiter and K. Makino
(3-3-1) Compute maxv∗ (Σxi [Cv,w ]) and maxv∗ (Σxi [Cv,w ]) from char(Σxi ) and char(Σxi ), respectively. Let
Mv,w = maxv∗ (Σxi [Cv,w ]) ∪ maxv∗ (Σxi [Cv,w ]). xi ∈N (t)
xi ∈P (t)
(3-3-2) Dualize the CNF ϕv,w = cu =
u∈Mv,w cu ,
i:ui =vi =0
xi ∨
where
xi .
i:ui =vi =1
Each prime clause c of ϕdv,w corresponds to an explanation E = P (c)∪{xj | xj ∈ N (c)}, which is output. Note that ϕv,w is unate, i.e., convertible to a positive CNF by flipping the polarity of some variables. Informally, the algorithm works as follows. The theory Σ is used for generating candidate sets of variables C on which explanations can be formed; this corresponds to condition (i) of an explanation, which combined with condition (ii) amounts to consistency of Σ ∪ E ∪ {t}. These sets of variables C are made concrete in Step 3 via B, where the easy fact is taken into account that any explanation of a term t can contain at most |N (t)| negative literals. The projection of char(Σ ) to the relevant variables C, computed in Step 3-1, serves then as the basis for a set of variables, Cv,w , which is a largest subset of C on which some vector in Σ [C] is compatible with the selected literals B; any explanation must be formed on variables included in some Cv,w . Here, the ordering of vectors under ≤v is relevant, which respects negative literals. The explanations over Cv,w are then found by excluding every countermodel of t, i.e., all the models of Σxi , xi ∈ N (t) resp. Σxi , xi ∈ P (t), with a smallest set of literals. This amounts to computing minimal transversals (where only maximal models under ≤v∗ need to be considered), or equivalently, to dualization of the given CNF ϕv,w . More formally, it can be shown that the algorithm above computes all explanations. Moreover, from Propositions 8, 9, and 10, we obtain that computing all explanations reduces in polynomial time to (parallel) dualization of positive CNFs if |N (t)| ≤ k, which can be polynomially reduced to dualizing a single positive CNF [24]. Since as already known, Monotone Dualization reduces in polynomial-time to computing all explanations of a positive literal, the result follows. We remark that algorithm Term Explanations is closely related to results by Khardon and Roth [42] about computing some abductive explanation for a query χ from a (not necessarily Horn) theory represented by its characteristic models, which are defined using Monotone Theory [10]. In fact, Khardon and Roth established that computing some abductive explanation for a Horn CNF query χ w.r.t. a set A containing at most k negative literals from a theory Σ is feasible in polynomial time, provided that Σ is represented by an appropriate type of characteristic models (for Horn Σ, the characteristic models chark+1 (Σ) with respect to (k + 1)-quasi Horn functions will do, which are those functions with a CNF ϕ such that |P (c)| ≤ k + 1 for every c ∈ ϕ). Proposition 10 implies that chark+1 (Σ) can be computed in polynomial time from char(Σ). Hence, by a detour through characteristic models conversion, some explanation for a
Abduction and the Dualization Problem
17
Table 1. Complexity of computing all abductive explanations for a query from a Horn theory Query χ
general/ CNF
Horn theory Σ
DNF
Horn CNF ϕ
Π2P d
char(Σ)
Π2P d
literal pos a
coNP coNP coNP
clause
neg c
coNP
Dual
pos a
coNP nPTT
b,c
term
Horn general pos c
coNP
coNP
Dual nPTT
c
b,c
a
coNP
neg
general
c
coNPc
b,c
coNPb,c
coNP
Dual coNP
a
polynomial total-time for assumption-free explanations (A = Lit).
b
Dual for k-positive clauses resp. k-negative terms, k bounded by a constant.
c
nPTT for assumption-free explanations (A = Lit).
d
coNP (resp. nPTT) for assumption-free explanations (A = Lit) and general χ (resp. DNF χ).
Horn CNF w.r.t. A as above can be computed from a Horn Σ represented by char(Σ) in polynomial time using the method of [42]. This can be extended to computing all explanations for χ, and exploiting the nature of explanations for terms to an algorithm similar to Term Explanations. Furthermore, the results of [42] provides a basis for obtaining further classes of abduction instances Σ, A, χ polynomially equivalent to Monotone Dualization where Σ is not necessarily Horn. However, this is not easy to accomplish, since roughly nonHorn theories lack in general the useful property that every prime implicate can be made monotone by flipping the polarity of some variables, where the admissible flipping sets induce a class of theories in Monotone Theory. Explanations corresponding to such prime implicates might not be covered by a simple generalization of the above methods.
5
Conclusion
In this paper, we have considered the connection between abduction and the well-known dualization problems, where we have reviewed some results from recent work and added some new; a summary picture is given in Table 1. In this table, “nPTT” stands for “not polynomial total-time unless P=NP,” and “coNP” resp. “Π2P ” stands for for deciding whether the output is empty (i.e., no explanation exists) is coNP-complete resp. Π2P -complete (which trivially implies nPTT); Dual denotes polynomial-time equivalence to Monotone Dualization. In order to elucidate the role of abducibles, the table highlights also results for assumption-free explanations (A = Lit) when they deviate from an arbitrary set A of abducibles. As can be seen from the table, there are several important classes of instances which are equivalent to Monotone Dualization. In particular, this holds for generating all explanations for a clause query (resp., term query) χ if χ contains at most k positive (resp., negative) literals for constant k. It remains to be explored how these results, via the applications of abduction, lead to the improvement of problems that appear in applications. In particular, the connections to problems in knowledge discovery remain to be explored. Furthermore, an implementation of the algorithms and experiments are left for further work.
18
T. Eiter and K. Makino
We close by pointing out that besides Monotone Dualization, there are related problems whose precise complexity status in the theory of NP-completeness is not known to date.A particular interesting one is the dependency inference problem which we mentioned above, i.e., compute a prime cover of the set Fr+ of all functional dependencies (FDs) X→A which hold on an instance r of a relational schema [49] (recall that a prime cover is a minimal (under ⊆) set of non-redundant FDs which is logically equivalent to Fr+ ). There are other problems which are polynomial-time equivalent to this problem [40] under the more liberal notion of Turing-reduction used there; for example, one of these problems is computing the set of all characteristic models of a Horn theory Σ from a given Horn CNF ϕ representing it. Dependency inference contains Monotone Dualization as a special case (cf. [20]), and is thus at least as hard, but to our knowledge there is no strong evidence that it is indeed harder, and in particular, it is yet unknown whether a polynomial total-time algorithm for this problem implies P=NP. It would be interesting to see progress on the status of this problem, as well as possible connections to abduction.
References 1. D. Angluin. Queries and Concept Learning. Machine Learning, 2:319–342, 1996. 2. C. Bioch and T. Ibaraki. Complexity of identification and dualization of positive Boolean functions. Information and Computation, 123:50–63, 1995. 3. E. Boros, Y. Crama, and P. L. Hammer. Polynomial-time inference of all valid implications for Horn and related formulae. Ann. Mathematics and Artificial Intelligence, 1:21–32, 1990. 4. E. Boros, V. Gurvich, L. Khachiyan and K. Makino. Dual-bounded generating problems: Partial and multiple transversals of a hypergraph. SIAM J. Computing, 30:2036–2050, 2001. 5. E. Boros, K. Elbassioni, V. Gurvich, L. Khachiyan and K. Makino. Dual-bounded generating problems: All minimal integer solutions for a monotone system of linear inequalities. SIAM Journal on Computing, 31:1624-1643, 2002. 6. E. Boros, V. Gurvich, L. Khachiyan, and K. Makino. On the complexity of generating maximal frequent and minimal infrequent sets. In Proc. 19th Annual Symposium on Theoretical Aspects of Computer Science (STACS-02), LNCS 2285, pp. 133–141, 2002. 7. E. Boros, K. Elbassioni, V. Gurvich, L. Khachiyan, and K. Makino. An intersection inequality for discrete distributions and related generation problems. In Proc. 30th Int’l Coll. on Automata, Languages and Programming (ICALP 2003), LNCS 2719, pp. 543-555, 2003. 8. E. Boros, V. Gurvich, and P. L. Hammer. Dual subimplicants of positive Boolean functions. Optimization Methods and Software, 10:147–156, 1998. 9. G. Brewka, J. Dix, and K. Konolige. Nonmonotonic Reasoning – An Overview. Number 73 in CSLI Lecture Notes. CSLI Publications, Stanford University, 1997. 10. N. H. Bshouty. Exact Learning Boolean Functions via the Monotone Theory. Information and Computation, 123:146–153, 1995. 11. T. Bylander. The monotonic abduction problem: A functional characterization on the edge of tractability. In Proc. 2nd International Conference on Principles of Knowledge Representation and Reasoning (KR-91), pp. 70–77, 1991. 12. Y. Crama. Dualization of regular boolean functions. Discrete App. Math., 16:79–85, 1987. 13. J. de Kleer. An assumption-based truth maintenance system. Artif. Int., 28:127–162, 1986. 14. R. Dechter and J. Pearl. Structure identification in relational data. Artificial Intelligence, 58:237–270, 1992.
Abduction and the Dualization Problem
19
15. A. del Val. On some tractable classes in deduction and abduction. Artificial Intelligence, 116(1-2):297–313, 2000. 16. A. del Val. The complexity of restricted consequence finding and abduction. In Proc. 17th National Conference on Artificial Intelligence (AAAI-2000), pp. 337–342, 2000. 17. C. Domingo, N. Mishra, and L. Pitt. Efficient read-restricted monotone CNF/DNF dualization by learning with membership queries. Machine Learning, 37:89–110, 1999. 18. W. Dowling and J. H. Gallier. Linear-time algorithms for testing the satisfiability of propositional Horn theories. Journal of Logic Programming, 3:267–284, 1984. 19. T. Eiter and G. Gottlob. Identifying the minimal transversals of a hypergraph and related problems. Technical Report CD-TR 91/16, Christian Doppler Laboratory for Expert Systems, TU Vienna, Austria, January 1991. 20. T. Eiter and G. Gottlob. Identifying the minimal transversals of a hypergraph and related problems. SIAM Journal on Computing, 24(6):1278–1304, December 1995. 21. T. Eiter and G. Gottlob. The complexity of logic-based abduction. Journal of the ACM, 42(1):3–42, January 1995. 22. T. Eiter and G. Gottlob. Hypergraph transversal computation and related problems in logic and AI. In Proc. 8th European Conference on Logics in Artificial Intelligence (JELIA 2002), LNCS 2424, pp. 549–564. Springer, 2002. 23. T. Eiter, G. Gottlob, and K. Makino. New results on monotone dualization and generating hypergraph transversals. SIAM Journal on Computing, 32(2):514–537, 2003. Preliminary paper in Proc. ACM STOC 2002. 24. T. Eiter and K. Makino. On computing all abductive explanations. In Proc. 18th National Conference on Artificial Intelligence (AAAI ’02), pp. 62–67, 2002. Preliminary Tech. Rep. INFSYS RR-1843-02-04, Institut f¨ur Informationssysteme, TU Wien, April 2002. 25. T. Eiter and K. Makino. Generating all abductive explanations for queries on propositional Horn theories. In Proc. 12th Annual Conference of the EACSL (CSL 2003), August 25-30 2003, Vienna, Austria. LNCS, Springer, 2003. 26. T. Eiter, T. Ibaraki, and K. Makino. Computing intersections of Horn theories for reasoning with models. Artificial Intelligence, 110(1-2):57–101, 1999. 27. K. Eshghi. A tractable class of abduction problems. In Proc. 13th International Joint Conference on Artificial Intelligence (IJCAI-93), pp. 3–8, 1993. 28. M. Fredman and L. Khachiyan. On the complexity of dualization of monotone disjunctive normal forms. Journal of Algorithms, 21:618–628, 1996. 29. G. Friedrich, G. Gottlob, and W. Nejdl. Hypothesis classification, abductive diagnosis, and therapy. In Proc. International Workshop on Expert Systems in Engineering, LNCS/LNAI 462, pp. 69–78. Springer, 1990. 30. D.R. Gaur and R. Krishnamurti. Self-duality of bounded monotone boolean functions and related problems. In Proc. 11th International Conference on Algorithmic Learning Theory (ALT 2000), LNCS 1968, pp. 209-223, 2000. 31. D. Gunopulos, R. Khardon, H. Mannila, and H. Toivonen. Data mining, hypergraph transversals, and machine learning. In Proc. 16th ACM Symposium on Principles of Database Systems (PODS-96), pp. 209–216, 1993. 32. K. Inoue. Linear resolution for consequence finding. Artif. Int., 56(2-3):301–354, 1992. 33. D. S. Johnson. A Catalog of Complexity Classes. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume A, chapter 2. Elsevier, 1990. 34. H. Kautz, M. Kearns, and B. Selman. Reasoning With Characteristic Models. In Proc. 11th National Conference on Artificial Intelligence (AAAI-93), pp. 34–39, 1993. 35. H. Kautz, M. Kearns, and B. Selman. Horn approximations of empirical data. Artificial Intelligence, 74:129–245, 1995.
20
T. Eiter and K. Makino
36. D. Kavvadias, C. Papadimitriou, and M. Sideri. On Horn envelopes and hypergraph transversals. In Proc. 4th International Symposium on Algorithms and Computation (ISAAC-93), LNCS 762, pp. 399–405, Springer, 1993. 37. D. J. Kavvadias and E. C. Stavropoulos. Monotone Boolean dualization is in co-NP[log2 n]. Information Processing Letters, 85:1–6, 2003. 38. A. Kean and G. Tsiknis. Assumption based reasoning and Clause Management Systems. Computational Intelligence, 8(1):1–24, 1992. 39. A. Kean and G. Tsiknis. Clause Management Systems (CMS). Computational Intelligence, 9(1):11–40, 1992. 40. R. Khardon. Translating between Horn representations and their characteristic models. Journal of Artificial Intelligence Research, 3:349–372, 1995. 41. R. Khardon, H. Mannila, and D. Roth. Reasoning with examples: Propositional formulae and database dependencies. Acta Informatica, 36(4):267–286, 1999. 42. R. Khardon and D. Roth. Reasoning with models. Artif. Int., 87(1/2):187–213, 1996. 43. R. Khardon and D. Roth. Defaults and relevance in model-based reasoning. Artificial Intelligence, 97(1/2):169–193, 1997. 44. H. Levesque. Making believers out of computers. Artificial Intelligence, 30:81–108, 1986. 45. L. Lov´asz. Combinatorial optimization: Some problems and trends. DIMACS Technical Report 92-53, RUTCOR, Rutgers University, 1992. 46. C. L. Lucchesi and S. Osborn. Candidate Keys for Relations. Journal of Computer and System Sciences, 17:270–279, 1978. 47. K. Makino and T. Ibaraki. The maximum latency and identification of positive Boolean functions. SIAM Journal on Computing, 26:1363–1383, 1997. 48. K. Makino and T. Ibaraki, A fast and simple algorithm for identifying 2-monotonic positive Boolean functions. Journal of Algorithms, 26:291–305, 1998. 49. H. Mannila and K.-J. R¨aih¨a. Design by Example: An application of Armstrong relations. Journal of Computer and System Sciences, 22(2):126–141, 1986. 50. H. Mannila and K.-J. R¨aih¨a. Algorithms for inferring functional dependencies. Technical Report A-1988-3, University of Tampere, CS Dept., Series of Publ. A, April 1988. 51. P. Marquis. Consequence FindingAlgorithms. In D. Gabbay and Ph.Smets, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems, volume V: Algorithms for Uncertainty and Defeasible Reasoning, pp. 41–145. Kluwer Academic, 2000. 52. C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994. 53. C. H. Papadimitriou. NP-Completeness:A retrospective, In Proc. 24th Int’l Coll. on Automata, Languages and Programming (ICALP 1997), pp. 2–6, LNCS 1256. Springer, 1997. 54. C. S. Peirce. Abduction and Induction. In J. Buchler, editor, Philosophical Writings of Peirce, chapter 11. Dover, New York, 1955. 55. D. Poole. Explanation and prediction: An architecture for default and abductive reasoning. Computational Intelligence, 5(1):97–110, 1989. 56. R. Reiter and J. de Kleer. Foundations of assumption-based truth maintenance systems: Preliminary report. In Proc. 6th National Conference on Artificial Intelligence (AAAI-87), pp. 183–188, 1982. 57. B. Selman and H. J. Levesque. Abductive and default reasoning: A computational core. In Proc. 8th National Conference on Artificial Intelligence (AAAI-90), pp. 343–348, July 1990. 58. B. Selman and H. J. Levesque. Support set selection for abductive and default reasoning. Artificial Intelligence, 82:259–272, 1996. 59. B. Zanuttini. New polynomial classes for logic-based abduction. Journal of Artificial Intelligence Research, 2003. To appear.
Signal Extraction and Knowledge Discovery Based on Statistical Modeling Genshiro Kitagawa The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku Tokyo 106-8569 Japan [email protected] http://www.ism.ac.jp/˜kitagawa
Abstract. In the coming post IT era, the problems of signal extraction and knowledge discovery from huge data sets will become very important. For this problem, the use of good model is crucial and thus the statistical modeling will play an important role. In this paper, we show two basic tools for statistical modeling, namely the information criteria for the evaluation of the statistical models and generic state space model which provides us with a very flexible tool for modeling complex and time-varying systems. As examples of these methods we shall show some applications in seismology and macro economics.
1
Importance of Statistical Modeling in Post IT Era
Once the model is specified, various types of inferences and prediction can be deduced from the model. Therefore, the model plays a curial role in scientific inference or signal extraction and knowledge discovery from data. In scientific research, it is frequently assumed that there exists a known or unknown “true” model. In statistical community as well, from the age of Fisher, the statistical theories are developed under the situation that we estimate the true model with small number of parameters based on limited number of data. However, in recent years, the models are rather considered as tools for extracting useful information from data. This is motivated by the information criterion AIC that revealed that in the estimation of model for prediction, we may obtain a good model by selecting a simple model even though it may have some bias. On the other hand, if the model is considered as just a tool for signal extraction, the model cannot be uniquely determined and there exist many possible models depending on the viewpoints of the analysts. This means that the results of the inference and the decision depend on the used model. It is obvious that a good model yields a good result and a poor model yields a poor result. Therefore, in statistical modeling, the objective of the modeling is not to find out the unique “true” model, but to obtain a “good” model based on the characteristics of the object and the objective of the analysis. To obtain a good model, we need a criterion to evaluate the goodness of the model. Akaike (1973) proposed to evaluate the model by the goodness of its G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 21–32, 2003. c Springer-Verlag Berlin Heidelberg 2003
22
G. Kitagawa
predictive ability and to evaluate it by the Kullback-Leibler information. As is well-known, the minimization of the Kullback-Leibler information is equivalent to maximizing the expected log-likelihood of the model. Further, as a natural estimate of the expected log-likelihood, we can define the log-likelihood and thus can be lead to the maximum likelihood estimators. However, in comparing the models with parameters estimated by the method of maximum likelihood, there arises a bias in the log-likelihood as an estimate of the expected log-likelihood. By correcting for this bias, we obtain Akaike information criterion AIC. After the derivation of the AIC, various modifications or extensions of the AIC such as TIC, GIC and EIC are proposed. The information criterion suggests various things that should be taken into account in modeling. Firstly, since the data is finite, the models with too large number of free parameters may have less ability for prediction. There are two alternatives to mitigate this difficulty. One way is to restrict the number of free parameters which is realized by minimizing the AIC criterion. The other way is to obtain a good model with huge number of parameters by imposing a restriction on the parameters. For this purpose, we need to combine the information not only from the data but also the one from the knowledge on the object and the objective of the analysis. Therefore, the Bayes models play important role, since the integration of information can be realized by the Bayes model with properly defined prior information and the data. By the progress of the information technology, the information infrastructure in research area and society is being fully equipped, and the environment of the data has been changed very rapidly. For example, it becomes possible to obtain huge amount of data from moment to moment in various fields of scientific research and technology, for example the CCD image of the night sky, POS data in marketing, high frequency data in finance and the huge observations obtained in environmental measurement or in the study for disaster prevention. In contrast with the conventional well designed statistical data, the special feature of these data sets is that they can be obtained comprehensively. Therefore, it is one of the most important problem in post IT era to extract useful information or discover knowledge from not-so-well designed massive data. For the analysis of such huge amount of data, an automatics treatment of the data is inevitable and a new facet of difficulty in modeling arises. Namely, in classical framework of modeling, the precision of the model increases as the increase of the data. However, in actuality, the model changes with time due to the change of the stricture of the object. Further, as the information criteria suggest, the complexity of the model increases as the increase of the data. Therefore, for the analysis of huge data set, it is necessary to develop a flexible model that can handle various types of nonstationarity, nonlinearity and nonGaussianity. It is also important to remember that the information criteria are relative criteria. This means that the selection by any information criterion is nothing but the one within the pre-assigned model class. This suggests that the process of modeling is an everlasting improvement of the model based on the
Signal Extraction and Knowledge Discovery Based on Statistical Modeling
23
increase of the data and knowledge on the object. Therefore, it is very important to prepare a flexible models that can fully utilize the knowledge of the analyst. In this paper, we shall show two basic tools for statistical modeling. Namely, firstly we shall show various information criteria AIC, TIC, GIC and EIC. We shall then show a general state space model as a generic time series model for signal extraction. Finally, we shall show some applications in seismology and macro economics.
2
Information Criteria for Statistical Modeling
Assume that the observations are generated from an unknown “true” distribution function G(x) and the model is characterized by a density function f (x). In the derivation of AIC (Akaike (1973)), the expected log-likelihood EY log f (Y ) = log f (y)dG(y) is used as the basic measure to evaluate the similarity between two distributions, which is equivalent to the Kullback-Leibler information. In actual situations, G(x) is unknown and only a sample X ={X1 , . . . , Xn } ˆ n (x) = n log f (Xi ) is given. We then use the log-likelihood = n log f (x)dG i=1 ˆ n (x) as a natural estimator of (n times of) the expected log-likelihood. Here G is the empirical distribution function defined by the data. When a model contains an unknown parameter θ and its density is of the ˆ form f (x|θ), it naturally leads to use the maximum likelihood estimator θ. 2.1
AIC and TIC
For a statisticalmodel f (x|θ) fitted to the data, however, the log-likelihood n n−1 (θ) = n−1 i=1 log f (Xi |θ) ≡ n−1 log f (X|θ) has a positive bias as an estimator of the expected log-likelihood, EG log f (Y |θ), and it cannot be directly used for model selection. By correcting the bias 1 log f (X|θ(X)) − EY log f (Y |θ(X)) , b(G) = nEX (1) n an unbiased estimator of the expected log-likelihood is given by 1 1 log f (X|θ(X)) − b(G) = −2 log f (X|θ(X)) + 2b(G). IC = −2n n n
(2)
Since it is very difficult to obtain the bias b(G) in a closed form, it is usually approximated by an asymptotic bias. Akaike (1973) approximated b(G) by the number of parameters, bAIC = m, and proposed the AIC criterion, AIC = −2 log f (X|θˆM L ) + 2m,
(3)
where θˆML is the maximum likelihood estimate. On the other hand, Takeuchi ˆ J(G) ˆ −1 }, (1976) showed that the asymptotic bias is given by bTIC = tr{I(G) ˆ ˆ where I(G) and J(G) are the estimates of the Fisher information and expected Hessian matrices, respectively .
24
2.2
G. Kitagawa
General Information Criterion, GIC
The above method of bias correction for the log-likelihood can be extended to a ˆ n ). general model constructed by using a statistical functional such as θˆ = T (G For such a general statistical model, Konishi and Kitagawa (1996) derived the asymptotic bias ∂ log f (Y |T (G)) bGIC (G) = trEY T (1) (Y ; G) , (4) ∂θ and proposed GIC (Generalized Information Criterion). Here T (1) (Y ; G) is the first derivative of the statistical functional T (Y ; G) which is usually called the influence function. The information criteria obtained so far can be generally expressed as ˆ − b1 (G ˆ n ), where b1 (G ˆ n ) is the first order bias correction term such as log f (X|θ) (4). The second order bias-corrected information criterion can be defined by 1 ˆ ˆ ˆ ˆ GIC2 = −2 log f (X|θ) + 2 b1 (Gn ) + b2 (Gn ) − ∆b1 (Gn ) . (5) n Here b2 (G) is defined by the expansion ˆ − nEY log f (Y |θ) ˆ = b1 (G) + 1 b2 (G) + O(n−2 ), (6) b(G) = EX log f (X|θ) n and the bias of the first order bias correction term ∆b1 (G) is defined by ˆ = b1 (G) + 1 ∆b1 (G) + O(n−2 ). EX b1 (G) n 2.3
(7)
Bootstrap Information Criterion, EIC
The bootstrap method (Efron 1979) provides us with an alternative way of bias correction of the log-likelihood. In this method, the bias b(G) in (1) is estimated by ˆ n ) = EX ∗ {log f (X ∗ |θ(X ∗ )) − log f (X|T (X ∗ ))} , bB (G
(8)
and the EIC (Extended Information Criterion) is defined by using this (Ishiguro ˆn) et al. (1997)). In actual computation, the bootstrap bias correction term bB (G is approximated by bootstrap resampling. The variance of the bootstrap estimate of the bias defined in (4) can be reduced automatically without any analytical arguments (Konishi and Kitagawa ˆ − nEY [log f (Y |θ)]. ˆ (1996), Ishiguro et al. (1997)). Let D(X; G) = log f (X|θ) Then D(X; G) can be decomposed into D(X; G) = D1 (X; G) + D2 (X; G) + D3 (X; G)
(9)
ˆ − log f (X|T (G)), D2 (X; G) = log f (X|T (G))− where D1 (X; G) = log f (X|θ) ˆ nEY [log f (Y |T (G))] and D3 (X; G) = nEY [log f (Y |T (G))] − nEY [log f (Y |θ)].
Signal Extraction and Knowledge Discovery Based on Statistical Modeling
25
ˆ n ), it can be For a general estimator defined by a statistical functional θˆ = T (G shown that the bootstrap estimate of EX [D1 + D3 ] is the same as that of EX [D], but Var{D} = O(n) and Var{D1 + D3 } = O(1). Therefore by estimating the bias by ˆ n ) = EX ∗ [D1 + D3 ], b∗B (G
(10)
a significant reduction of the variance can be achieved for any estimators defined by statistical functional, especially for large n.
3 3.1
State Space Modeling Smoothness Prior Modeling
A smoothing approach attributed to Whittaker [21], is as follows: Let y n = fn + ε n ,
n = 1, ..., N
(11)
denote observations, where fn is an unknown smooth function of n. εn is an i.i.d. normal random variable with zero mean and unknown variance σ 2 . The problem is to estimate fn , n = 1, ..., N from the observations, yn , n = 1, ..., N , in a statistically reasonable way. However, in this problem, the number of parameters to be estimated is equal to or even greater than the number of observations. Therefore, the ordinary least squares method or the maximum likelihood method yield meaningless results. Whittaker [21] suggested that the solution fn , n = 1, ..., N balances a tradeoff between infidelity to the data and infidelity to a smoothness constraint. Namely, for given tradeoff parameter λ2 and the difference order k, the solution satisfies N N
2 2 k 2 min (12) (yn − fn ) + λ (∆ fn ) . f
n=1
n=1
Whittaker left the choice of λ2 to the investigator. 3.2
State Space Modeling
It can be seen that the minimization of the criterion (12) is equivalent to assume the following linear-Gaussian model: yn = fn + wn , fn = ck1 fn−1 + · · · + ckk fk + vn ,
(13)
where wn ∼ N (0, σ 2 ), vn ∼ N (0, τ 2 ), λ2 = σ 2 /τ 2 and ckj is the j-th binomial coefficient.
26
G. Kitagawa
Therefore, the models (13) can be expressed in a special form of the state space model xn = F xn−1 + Gvn yn = Hxn + wn
(system model), (observation model),
(14)
where xn = (tn , ..., tn−k+1 ) is a k-dimensional state vector, F , G and H are k × k, k × 1 and 1 × k matrices, respectively. For example, for k = 2, they are given by
2 −1 1 tn xn = , F = , G= , H = [1, 0]. (15) tn−1 1 0 0 One of the merits of using this state space representation is that we can use computationally efficient Kalman filter for state estimation. Since the state vector contains unknown trend component, by estimating the state vector xn , the trend is automatically estimated. Also unknown parameters of the model, such as the variances σ 2 and τ 2 can be estimated by the maximum likelihood method. In general, the likelihood of the time series model is given by L(θ) = p(y1 , . . . , yN |θ) =
N
p(yn |Yn−1 , θ),
(16)
n=1
where Yn−1 = {y1 , . . . , yn−1 } and each component p(yn |Yn−1 , θ) can be obtained as byproduct of the Kalman filter [6]. It is interesting to note that the tradeoff parameter λ2 in the penalized least squares method (12) can be interpreted as the ratio of the system noise variance to the observation noise variance, or the signal-to-noise ratio. The individual terms in (16) are given by, in general p-dimensional observation case, − 12 1 1 −1 √ p(yn |Yn−1 , θ) = Wn|n−1 exp − ε n|n−1 Wn|n−1 εn|n−1 , (17) 2 ( 2π)p where εn|n−1 = yn − yn|n−1 is one-step-ahead prediction error of time series and yn|n−1 and Vn|n−1 are the mean and the variance covariance matrix of the observation yn , respectively, and are defined by yn|n−1 = Hxn|n−1 ,
Wn|n−1 = HVn|n−1 H + σ 2 .
(18)
Here xn|n−1 and Vn|n−1 are the mean and the variance covariance matrix of the state vector given the observations Yn−1 and can be obtained by the Kalman filter [6]. If there are several candidate models, the goodness of the fit of the models can be evaluated by the AIC criterion defined by ˆ + 2(number of parameters). AIC = −2 log L(θ)
(19)
Signal Extraction and Knowledge Discovery Based on Statistical Modeling
3.3
27
General State Space Modeling
Consider a nonlinear non-Gaussian state space model for the time series yn , xn = Fn (xn−1 , vn ) yn = Hn (xn , wn ),
(20)
where xn is an unknown state vector, vn and wn are the system noise and the observation noise with densities qn (v) and rn (w), respectively. The first and the second model in (20) are called the system model and the observation model, respectively. The initial state x0 is assumed to be distributed according to the density p0 (x). Fn (x, v) and Hn (x, w) are possibly nonlinear functions of the state and the noise inputs. This model is an extension of the ordinary linear Gaussian state space model (14). The above nonlinear non-Gaussian state space model specifies the conditional density of the state given the previous state, p(xn |xn−1 ), and that of the observation given the state, p(yn |xn ). This is the essential features of the state space model, and it is sometimes convenient to express the model in this general form based on conditional distributions xn ∼ Qn ( · |xn−1 ) yn ∼ Rn ( · |xn ).
(21)
With this model, it is possible to treat the discrete process such as the Poisson models. 3.4
Nonlinear Non-Gaussian Filtering
The most important problem in state space modeling is the estimation of the state vector xn from the observations, Yt ≡ {y1 , . . . , yt }, since many important problems in time series analysis can be solved by using the estimated state vector. The problem of state estimation can be formulated as the evaluation of the conditional density p(xn |Yt ). Corresponding to the three distinct cases, n > t, n = t and n < t, the conditional distribution, p(xn |Yt ), is called the predictor, the filter and the smoother, respectively. For the standard linear-Gaussian state space model, each density can be expressed by a Gaussian density and its mean vector and the variance-covariance matrix can be obtained by computationally efficient Kalman filter and smoothing algorithms [6]. For general state space models, however, the conditional distributions become non-Gaussian and their distributions cannot be completely specified by the mean vectors and the variance covariance matrices. Therefore, various types of approximations to the densities have been used to obtain recursive formulas for state estimation, e.g., the extended Kalman filter [6], the Gaussian-sum filter [5] and the dynamic generalized linear model [20]. However, the following non-Gaussian filter and smoother [11] can yield an arbitrarily precise posterior density.
28
G. Kitagawa
[Non-Gaussian Filter] p(xn |Yn−1 )= p(xn |xn−1 )p(xn−1 |Yn−1 )dxn−1 p(xn |Yn )=
p(yn |xn )p(xn |Yn−1 ) , p(yn |Yn−1 )
(22)
where p(yn |Yn−1 ) is defined by
p(yn |xn )p(xn |Yn−1 )dxn .
[Non-Gaussian Smoother] p(xn+1 |xn )p(xn+1 |YN ) dxn+1 . p(xn |YN ) = p(xn |Yn ) p(xn+1 |Yn )
(23)
However, the direct implementation of the formula requires computationally very costly numerical integration and can be applied only to lower dimensional state space models.
3.5
Sequential Monte Carlo Filtering
To mitigate the computational burden, numerical methods based on Monte Carlo approximation of the distribution have been proposed [9,12]. In the Monte Carlo filtering [12], we approximate each density function by many particles that can be considered as realizations from that distribution. Specifically, assume that (1) (m) each distribution is expressed by using m particles as follows: {pn , . . . , pn } ∼ (1) (m) p(xn |Yn−1 ) and {fn , . . . , fn } ∼ p(xn |Yn ). This is equivalent to approximate the distributions by the empirical distributions determined by m particles. Then it will be shown that a set of realizations expressing the one step ahead predictor p(xn |Yn−1 ) and the filter p(xn |Yn ) can be obtained recursively as follows. [Monte Carlo Filter] (j)
1. Generate a random number f0 ∼ p0 (x) for j = 1, . . . , m. 2. Repeat the following steps for n = 1, . . . , N . (j)
a) Generate a random number vn ∼ q(v), for j = 1, . . . , m. (j) (j) (j) b) Compute pn = F (fn−1 , vn ), for j = 1, . . . , m. (j)
(j)
c) Compute αn = p(yn |pn ) for j = 1, . . . , m. (j) (1) (m) d) Generate fn , j = 1, . . . , m by the resampling of pn , . . . , pn . with the (1) (j) weights proportional to αn , . . . , αn . The above algorithm for Monte Carlo filtering can be extended to smoothing by a simple modification. The details of the derivation of the algorithm is shown in [12].
Signal Extraction and Knowledge Discovery Based on Statistical Modeling
3.6
29
Self-Organizing State Space Model
If the non-Gaussian filter is implemented by the Monte Carlo filter, the sampling error sometimes renders the maximum likelihood method impractical. In this case, instead of estimating the parameter θ by the maximum likelihood method, we consider a Bayesian estimation by augmenting the state vector as zn = [xTn , θT ]T . The state space model for this augmented state vector zn is given by zn = F ∗ (zn−1 , vn ) yn = H ∗ (zn , wn )
(24)
where the nonlinear functions F ∗ (z, v) and H ∗ (z, w) are defined by F ∗ (z, v) = [F (x, v), θ]T , H ∗ (z, w) = H(x, w). Assume that we obtain the posterior distribution p(zn |YN ) given the entire observations YN = {y1 , · · · , yN }. Since the original state vector xn and the parameter vector θ are included in the augmented state vector zn , it immediately yields the marginal posterior densities of the parameter and of the original state. This method of Bayesian simultaneous estimation of the parameter and the state of the state space model can be easily extended to a time-varying parameter situation where the parameter θ = θn evolves with time n. It should be noted that in this case we need a proper model for time evolution of the parameter.
4 4.1
Examples Extraction of Seismic Waves
The earth’s surface is under continuous disturbances due to a variety of natural forces and human induced sources. Therefore, if the amplitude of the earthquake signal is very small, it will be quite difficult to distinguish it from the background noise. In this section, we consider a method of extracting small seismic signals (P-wave and S-wave) from relatively large background noise [13], [17]. For the extraction of the small seismic signal from background noise, we consider the model yn = rn + sn + εn ,
(25)
where rn , sn and εn denote the background noise, the signal and the observation noise, respectively. To separate these three components, it is assumed that the background noise rn is expressed by the autoregressive model rn =
m
ci rn−i + un
(26)
i=1
where the AR order m and the AR coefficients ci are unknown and un and εn are white noise sequences with un ∼ N (0, τ12 ) and εn ∼ N (0, σ 2 ).
30
G. Kitagawa
The seismograms are actually records of seismic waves in 3-dimensional space and the seismic signal is composed of P-wave and S-wave. Hereafter East-West, North-South and Up-Down components are denoted as yn = [xn , yn , zn ]T . Pwave is a compression wave and it moves along the wave direction. Therefore it can be approximated by a one-dimensional model, pn =
m
aj pn−j + un .
(27)
j=1
On the other hand, S-wave moves on a plane perpendicular to the wave direction and thus can be expressed by 2-dimensional model,
qn bj11 bj12 qn−j vn1 = + . rn bj21 bj22 rn−j vn2
(28)
j=1
Therefore, the observed three-variate time series can be expressed as x xn α1n β1n γ1n pn wn yn = α2n β2n γ2n qn + wny . zn α3n β3n γ3n rn wnz
(29)
In this approach, the crucial problem is the estimation of time-varying wave direction, αjn , βjn and γjn . They can be estimated by the principle component analysis of the 3D data. These models can be combined in the state space model form. Note that the variances of the component models corresponds to the amplitude of the seismic signals and are actually time varying. These variance parameters play the role of a signal to noise ratios, and the estimation of these parameters is the key problem for the extraction of the seismic signal. A selforganizing state space model can be applied to the estimation of the time-varying variances [13]. 4.2
Seasonal Adjustment
The standard model for seasonal adjustment is given by y n = t n + sn + wn ,
(30)
where tn , sn and wn are trend, seasonal and irregular components. A reasonable solution to this decomposition was given by the use of smoothness priors for both tn and sn [14]. The trend component tn and the seasonal component sn are assumed to follow tn = 2tn−1 − tn−2 + vn , sn = −(sn−1 + · · · + sn−11 ) + un ,
(31)
Signal Extraction and Knowledge Discovery Based on Statistical Modeling
31
where vn , un and wn are Gaussian white noise with vn ∼ N (0, τt2 ), un ∼ N (0, τs2 ) and wn ∼ N (0, σ 2 ). However, by using a more sophisticated model, we can extract a more information from the data. For example, many of the economic time series related to sales or production are affected by the number of days of the week. Therefore, the sales of a department store will be strongly affected by the number of Sundays and Saturdays in each month. Such kind of effect is called the trading day effect. To extract the trading day effect, we consider the decomposition yn = tn + sn + tdn + wn ,
(32)
where tn , sn and wn are as above and the trading day effect component, tdn , is assumed to be expressed as tdn =
7
βj djn ,
(33)
j=1
where djn is the number of j-th day of the week (e.g., j=1 for Sunday and j=2 for Monday, etc.) and βj is the unknown trading day effect coefficient. To assure the identifiability, it is necessary to put constraint that β1 + · · · + β7 = 0. Since the numbers of day of the week are completely determined by the calendar, if we obtain good estimates of the trading day effect coefficients, then it will greatly contribute to the increase of the precision of the prediction. 4.3
Analysis of Exchange Rate Data
We consider the multivariate time series of exchange rate between US dollars and other foreign currencies. By using proper smoothness prior models, we try to decompose the change of the exchange rate into two components, one expresses the effect of US economy and the other the effect of other country. By this decomposition, it is possible to determine, for example, whether the decrease of the Yen/USD exchange rate at a certain time is due to weak Yen or strong US dollar.
References 1. Akaike, H.: Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov, B.N., Csaki, F. (eds.): 2nd International Symposium in Information Theory. Akademiai Kiado, Budapest, (1973) 267–281. 2. Akaike, H.: A new look at the statistical model identification, IEEE Transactions on Automatic Control, AC-19, 716–723 (1974) 3. Akaike, H.: Likelihood and the Bayes procedure (with discussion), In Bayesian Statistics, edited by J.M. Bernardo, M.H. De Groot, D.V. Lindley and A.F.M. Smith, University press, Valencia, Spain, 143–166 (1980)
32
G. Kitagawa
4. Akaike, H., and Kitagawa, G. eds.: The Practice of Time Series Analysis, SpringerVerlag New York (1999) 5. Alspach, D.L., Sorenson, H.W.: Nonlinear Bayesian Estimation Using Gaussian Sum Approximations. IEEE Transactions on Automatic Control, AC-17 (1972) 439–448. 6. Anderson, B.D.O., Moore, J.B.: Optimal Filtering, New Jersey, Prentice-Hall (1979). 7. Doucet, A., de Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer-Verlag, NewYork (2000). 8. Efron, B.: Bootstrap methods: Another look at the jackknife. Ann. Statist. 7, (1979) 1–26. 9. Gordon, N.J., Salmond, D.J., Smith, A.F.M., Novel approach to nonlinear /nonGaussian Bayesian state estimation, IEE Proceedings-F, 140, (2) (1993) 107–113. 10. Ishiguro, M., Sakamoto, Y.,Kitagawa, G.: Bootstrapping log-likelihood and EIC, an extension of AIC. Annals of the Institute of Statistical Mathematics, 49 (3), (1997) 411–434. 11. Kitagawa, G.: Non-Gaussian state-space modeling of nonstationary time series. Journal of the American Statistical Association, 82 (1987) 1032–1063. 12. Kitagawa, G.: Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5 (1996) 1–25. 13. Kitagawa, G.: Self-organizing State Space Model. Journal of the American Statistical Association. 93 (443) (1998) 1203–1215. 14. Kitagawa, G., Gersch, W.: A Smoothness Priors-State Space Approach to the Modeling of Time Series with Trend and Seasonality. Journal of the American Statistical Association, 79 (386) (1984) 378–389. 15. Kitagawa, G. and Gersch, W.: Smoothness Priors Analysis of Time Series, Lecture Notes in Statistics, No. 116, Springer-Verlag, New York (1996). 16. Kitagawa, G. and Higuchi, T.: Automatic transaction of signal via statistical modeling, The Proceedings of The First International Conference on Discovery Science, Springer-Verlag Lecture Notes in Artificial Intelligence Series, 375–386 (1998). 17. Kitagawa, G., Takanami, T., Matsumoto, N.: Signal Extraction Problems in Seismology, Intenational Statistical Review, 69 (1), (2001) 129–152. 18. Konishi, S., Kitagawa, G.: Generalised information criteria in model selection. Biometrika, 83, (4), (1996) 875–890. 19. Sakamoto, Y., Ishiguro, M. and Kitagawa, G.: Akaike Information Criterion Statistics, D-Reidel, Dordlecht, (1986) 20. West, M., Harrison, P.J., Migon,H.S.: Dynamic generalized linear models and Bayesian forecasting (with discussion). Journal of the American Statistical Association. 80 (1985) 73–97. 21. Whittaker, E.T: On a new method of graduation, Proc. Edinborough Math. Assoc., 78, (1923) 81–89.
Association Computation for Information Access Akihiko Takano National Institute of Informatics Hitotsubashi, Chiyoda, Tokyo 101-8430 Japan [email protected]
Abstract. GETA (Generic Engine for Transposable Association) is a software that provides efficient generic computation for association. It enables the quantitative analysis of various proposed methods based on association, such as measuring similarity among documents or words. Scalable implementation of GETA can handle large corpora of twenty million documents, and provides the implementation basis for the effective information access of next generation. DualNAVI is an information retrieval system which is a successful example to show the power and the flexibility of GETA-based computation for association. It provides the users with rich interaction both in document space and in word space. Its dual view interface always returns the retrieved results in two views: a list of titles for document space and “Topic Word Graph” for word space. They are tightly coupled by their cross-reference relation, and inspires the users with further interactions. The two-stage approach in the associative search, which is the key to its efficiency, also facilitates the content-based correlation among databases. In this paper we describe the basic features of GETA and DualNAVI.
1
Introduction
Since the last decade of the twentieth century, we have experienced an unusual expansion of the information space. Virtually any documents including Encyclopaedia, newspapers, and daily information within industries become available in digital form. The information we can access are literary exploding in amount and variation. Information space we face in our daily life is rapidly losing its coherence of any kind, and this has brought many challenges in the field of information access research. The effective access to such information is crucial to our intelligent life. Productivity of each individual in this new era can be redefined as a power for recovering order from the chaos which this information flood left. It requires the ability to collect appropriate information, one to analyze and discover the order within the collected information, and the ability to make proper judgement based on the analysis. This leads to the following three requirements for the effective information access we need: – Flexible methods for collecting relevant information. – Extracting mutual association (correlation) within the collected information. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 33–44, 2003. c Springer-Verlag Berlin Heidelberg 2003
34
A. Takano
– Interaction with user’s intention (in mind) and the archival system of knowledge (e.g. ontology). We need a swift and reliable method to collect relevant information from millions of documents. But the currently available methods are mostly based on simple keyword search, which suffers low precision and low recall. We strongly believe that the important clue to attack these challenges lies in the metrication of the information space. Once we got proper metrics for measuring similarity or correlation in information space, it should not be difficult to recover some order through this metrics. We looked for a candidate of the metrication in the accumulation of previous research, and found that the statistical (or probabilistic) measures for the document similarity are the promising candidates. It is almost inspirational when we realize that these measures establish the duality between document space and word space. Following this guiding principle, we have developed an information retrieval system DualNAVI [8,10,13]. It provides the users with rich interaction both in document space and in word space. Its dual view interface always returns the retrieved results in two views: a list of titles for document space and “Topic Word Graph” for word space. They are tightly coupled by their cross-reference relation, and inspires the users with further interactions. DualNAVI supports two kinds of search facilities, document associative search and keyword associative search, both of them are similarity based search from given items as a query. The dual view provides a natural interface to invoke these two search functions. The two-stage approach in the document associative search is the key to the efficiency of DualNAVI. It also facilitates the contentbased correlation among databases which are maintained independently and distributively. Our experience with DualNAVI tells that the association computation based on mathematically sound metrics is the crucial part to realize the new generation IR (Information Retrieval) technologies [12]. DualNAVI has been commercially used in the Encyclopaedia search service over the internet since 1998, and BioInformatics DB search [1] since 2000. Its effectiveness has been confirmed by many real users. It is also discussed as one of the promising search technologies for scientists in Nature magazine [2]. The main reason why the various proposed methods have not been used in practice is that they are not scalable in nature. Information access based on similarity between documents or words looks promising to offer an intuitive way to overview a large document sets. But the heavy computing cost for evaluating similarity prevents them from being practical for large corpora of million documents. DualNAVI was not the exception. To overcome this scalability problem, we have developed a software called GETA (Generic Engine for Transposable Association) [3], which provides efficient generic computation for association. It enables the quantitative analysis of various proposed methods based on association, such as measuring similarity among documents or words. Scalable implementation of GETA which works on various PC clusters, can handle large corpora of twenty million documents, and
Association Computation for Information Access
35
Fig. 1. Navigation Interface of DualNAVI (Nishioka et al., 1997)
provides the implementation basis for the effective information access of next generation [1,14]. In this paper we first overview the design principle and the basic functions of DualNAVI. Two important features of DualNAVI, topic word graph and associative search, are discussed in detail. We also explain how it works in the distributive setting and realizes the cross DB associative search. Finally, we describe the basic features of GETA.
2 2.1
DualNAVI : IR Interface Based on Duality DualNAVI Interaction Model
DualNAVI is an information retrieval system which provides users with rich interaction methods based on two kinds of duality, dual view and dual query types. Dual view interface is composed of two views of retrieved results: one in document space and the other in word space (See Fig.1). Titles of retrieved results are listed on the left-hand side of the screen (for documents), and the summarizing information are shown as a “Topic Word Graph” on the right of the screen (for words). Topic word graphs are dynamically generated by analyzing the retrieved set of documents. A set of words characterizing the retrieved results
36
A. Takano
Dual Query
Associative search
Keyword search
feedback
feedback Dual View Interface
Title List
Cross Checking
Topic Word Graph
Fig. 2. Dual view bridges Dual query types
are shown, and the links between them represent the statistically meaningful cooccurrence relations among them. Connected subgraphs are expected to include good potential keywords with which to refine searches. Two views are tightly coupled each other based on their cross-reference relation. Select some topic words, and related documents which include them are highlighted in the list of documents. And vice versa. On the other hand, dual query types mean that DualNAVI supports two kinds of search facilities. Document associative search finds related documents to given set of key documents. Keyword associative search finds related documents to given set of key words. Dual view interface provides a natural way for indicating the key objects for these search methods. Just select the relevant documents or words within the previous search result, and the user can start a new associative search. This enables easy and intuitive relevance feedbacks to refine searches effectively. Search by documents is especially helpful when users have some interesting documents but feel difficult in selecting proper keywords. The effectiveness of these two types of feedback using DualNAVI has been evaluated in [10]. The results were significantly positive for both types of interaction.
Association Computation for Information Access
2.2
37
Dual View Bridges Dual Query Types
The dual view and dual query types are not just two isolated features. Dual query types can work effectively only with dual view framework. Figure 2 illustrates how they relate each other. We can start with either a search by keywords or by a document, and the retrieved results are shown in the dual view. If the title list includes interesting articles, we can proceed to the next associative search using these found articles as key documents. If some words in the topic word graph are interesting, we can start a new keyword search using these topic words as keys. Another advantage of dual view interface is that the cross checking function is naturally realized. If a user selects some articles of his interest, he can easily find what topic words appear in them. Dually, it is easy to find related articles by selecting topic words. If multiple topic words are selected, the thickness of checkmark (See Fig.1) indicates the number of selected topic words included by each article. The user can sort the title list by this thickness, which approximates the relevance of each article to the topic suggested by the selected words.
3 3.1
Association Computation for DualNAVI Generation of Topic Word Graph
Topic word graphs summarize the search results and suggest proper words for further refining of searches. The method of generating topic word graphs is fully described in [9]. Here we give a brief summary. The process consists of three steps (See Fig.3). The first step is the extraction of topic words based on the word frequency analysis over the retrieved set of documents. Next step is to generate links between extracted topic words based on co-occurrence analysis. The last step assigns each topic word a xy-coordinates position on the display area. The score for selecting topic words is given by df(w) in the retrieved documents df(w) in the whole database where df(w) is the document frequency of word w, i.e. the number of documents containing w. In general, it is difficult to keep the balance between high frequency words (common words) and low frequency words (specific words) by using a single score. In order to make a balanced selection, we adopted the frequencyclass method, where all candidate words are first roughly classified by their frequencies, and then proper number of topic words are picked up from each frequency class. A link (an edge) between two words means that they are strongly related. That is, they co-appear in many documents in the retrieved results. In the link generation step, each topic word X is linked to another topic word Y which maximizes the co-occurrence strength df(X & Y) / df(Y) with X, among those
38
A. Takano
Fig. 3. Generating Topic Word Graph (Niwa et al., 1997)
having higher document frequency than X. Here df(X & Y) means the number of retrieved documents which have both X and Y. The length of a link has no specific meaning, although it might be natural to expect a shorter link means a stronger relation. In the last step to give two dimensional arrangement of topic word graphs, the y-coordinate (vertical position) is decided according to the document frequency of each word within the retrieved set. Common words are placed in the upper part, and specific words are placed in the lower part. Therefore, the graph can be considered as a hierarchical map of topics appear in the retrieved set of documents. The x-coordinate (horizontal position) has no specific meaning. It is assigned just in the way to avoid overlapping of nodes and links. 3.2
Associative Search
Associative search is a new type of information retrieval method based on the similarity between documents. It can be considered as a search documents by examples. It is useful when the user’s intention cannot clearly be expressed by one or several keywords, but the user has some documents (partly) match with his intention. Associative search is also a powerful tool for relevance feedbacks. If you find interesting items in the search results, associative search with these items as search keys may bring you more related items which were not previously retrieved. Associative search of DualNAVI consists of following steps:
Association Computation for Information Access
Document Space
39
Word Space
Related Docs Associative Associative Search Search
Characterizing Words
Some Docs
Fig. 4. Associative Search
– Extraction of characterizing words from the selected documents. The default number of characterizing words to be extracted is 200. For each word (w), which appears at least once in the selected documents, its score is calculated by score(w) = tf(w) / TF(w), where tf(w) and TF(w) are the term frequencies of w in the selected documents, and in the whole database respectively. Then the above number of words of higher score are selected. – These extracted words are used as a query, and the relevance of each document (d) in the target database with this query (q) is calculated by sim(d,q), which is described in Fig.5 [11]. Here DF(w) is the number of documents containing w, and N is the total number of documents in the database. – The documents in the target database are sorted by this similarity score and the top ranked documents are returned as the search results. In theory, associative search should not limit the size of the query. The merged documents should be used as the query in associative search from a set of documents. But if we don’t reduce the number of the distinctive words in the query, we end up on calculating sim(d,q) for almost all the documents in the target database, even when we request for just ten documents. In fact this extensive computational cost had prevented associative search from being used in the practical systems. This is why we reduce the query into manageable size in step one. Most common words which appear in many documents are dropped in this step. We have to do this carefully so as not to drop important (e.g. informative) words within
40
A. Takano
sim(d, q) = ρ(d)
σ(w) · ν(w, d) · ν(w, q)
w∈q
1 + θ(length(d) − ) ( : average document length, θ : slope constant) ρ(d) =
σ(w) = log ν(w, X) =
N DF(w)
1 + log tf(w|X) 1 + log (averageω∈X tf(ω|X) )
Fig. 5. Similarity measure for associative search (Singhal et al., 1996)
the query. In above explanation, we simply adopt tf(w) / TF(w) for measuring the importance of the words. But it is possible to use any other methods for this filtering. In [5], Hisamitsu, et. al. discuss about the representativeness of the words, which is a promising candidate for this task. It is more complex but a theoretically sound measure. Another break through we made for taming this computational cost is of course the development of GETA, which is a high-speed engine for these generic association computation. In GETA, the computations for realizing above two steps are supported in their most abstract ways: – the method to extract summarizing words from given set of documents, – the method to collect documents from the target database which are best relevant to the given key document (a set of words). With these improvement, the associative search in DualNAVI becomes efficient enough for many practical applications. 3.3
Cross DB Associative Search
This two step procedure of associative search is the key to the distributive architecture of DualNAVI system. Associative search is divided into two subtasks, summarizing and similarity evaluation. Summarizing needs only the source DB, and target DB is only used in similarity evaluation. If these two functions are provided for each DB, it is not difficult to realize the cross DB associative search. Figure 6 shows the structure for the cross DB associative search between two physically distributed DualNAVI servers, one for the encyclopaedia and the other for newspapers. User can select some articles in the encyclopaedia and search related articles in other DB’s, say newspapers. User can search between physically distributed DB’s associatively and seamlessly. We call these set of DB’s a “Virtual Database” because it can be accessed as one large database.
Association Computation for Information Access
41
Virtual Database Individual Users
Cross DB Associative Search Encyclopaedia Internet
DualNAVI Client
DualNAVI Server
Newspapers DualNAVI Server
Fig. 6. Cross DB associative search with DualNAVI
4
GETA: Generic Engine for Association Computation
GETA (Generic Engine for Transposable Association) is a software that provides efficient generic computation for association [3]. It is designed as a tool for manipulating very large sparse matrices, which typically appear as index files for the large scale text retrieval. By providing the basic operations on this matrix, GETA enables the quantitative analysis of various proposed methods based on association, such as measuring similarity among documents or words. By using GETA, it is almost trivial to realize associative search functions, which accept a set of documents as a query, and return a set of highly related documents in the relevance order. The computation by GETA is very efficient – it is typically 50 times or more faster than the quick and hack implementation of the task. The key to the performance of GETA is its representation of the matrices. GETA tries to compress the whole information of matrices in its extreme, and it facilitates GETA to put them on the memory almost always. An experimental associative search system using GETA can handle one million documents with an ordinary PC (single CPU with 1GB memory). It is verified that the standard response time for the associative search is less than a few seconds, which is fast enough to be practical. In order for higher scalability, we have also developed the parallel processing version of GETA. It divides and distributes the matrix information onto each node of PC clusters, and collaboratively calculates the same results as the mono-
42
A. Takano
Fig. 7. Experimental Associative Search Interface using GETA
lithic version of GETA. Thanks to the inner product structure of most statistical measures, the speedup by this parallelization is significant. With this version of GETA, it is confirmed that a real time associative search becomes feasible for about 20 million documents using 8 to 16-nodes PC cluster. The use of GETA is not limited to associative search. It can be applied to a large variety of text processing techniques, such as text categorization, text clustering, and text summarization. We believe GETA will be an essential tool for accelerating research and practical application of these and other text processing techniques. GETA was released as an open source software in July 2002. The major part of design and implementation of GETA has been done by Shingo Nishioka. The development of GETA has been supported by the Advanced Software Technology project under the auspices of Information Promotion Agency (IPA) in Japan. 4.1
Basic Features of GETA
Some of the characterizing features of GETA is as follows:
Association Computation for Information Access
43
– It provides Efficient and Generic computation for association using highly compressed indices. – It is Portable — it works on various UNIX’s (e.g. FreeBSD, Linux, Solaris, etc.) on PC servers or PC clusters. – It is Scalable — associative document search for over 20 million documents can be done within a few seconds using 8 to 16-nodes PC clusters. – It is Flexible — the similarity measures among documents or words can be switched dynamically during computation. Users can easily define their own measures using macros. – Most functions of GETA are accessible from Perl environment using its Perl interface. It is useful to implement experimental systems for comparing various statistical measures of similarities. For demonstrating this flexibility of GETA, we have implemented an associative search interface for document search (See Fig. 7). It can be used for quantitative comparison among various measures or methods. 4.2
Document Analysis Methods Using GETA
Various methods for document analysis have been implemented using GETA, and is included in the standard distribution of GETA system. – Tools for dynamic document clustering: The various methods for dynamic document clustering are implemented using GETA: • It provides an efficient implementation of HBC (Hierarchical Bayesian Clustering) method [6,7]. It takes a few seconds for clustering 1000 documents on an ordinary PC. • Representative terms of each cluster are available. • For comparative studies among different methods, most major existing clustering methods (e.g. single-link method, complete-link method, group average method, Ward method, etc.) are also available. – Tools for evaluating word representativeness [4,5]: Representativeness is a new measure for evaluating the power of words to represent some topic. It provides the quantitative criterion for selecting effective words to summarize the content of a given set of documents. A new measure for word representativeness is proposed together with an efficient implementation for evaluating them using GETA. It is also possible to apply it for automatic selection of important compound words.
5
Conclusions
We have shown how association computation, such as evaluating the similarity among documents or words, are essential for new generation technologies in
44
A. Takano
information access. We believe DualNAVI which is based on this principle brings a new horizon of the interactive information access, and GETA will serve as an implementation basis for these new generation systems. Acknowledgements. The work reported in this paper is an outcome of the joint research efforts with my former colleagues at Hitachi Advanced/Central Research Laboratories: Yoshiki Niwa, Toru Hisamitsu, Makoto Iwayama, Shingo Nishioka, Hirofumi Sakurai and Osamu Imaichi. This research is partly supported by the Advanced Software Technology Project under the auspices of Information-technology Promotion Agency (IPA), Japan. It is also partly supported by CREST Project of Japan Science and Technology.
References 1. BACE (Bio Association CEntral). http://bace.ims.u-tokyo.ac.jp/, August 2000. 2. D. Butler. Souped-up search engines. Nature, 405, pages 112–115, 2000. 3. GETA (Generic Engine for Transposable Association). http://geta.ex.nii.ac.jp/, July 2002. 4. T. Hisamitsu, Y. Niwa, and J. Tsujii. Measuring Representativeness of Terms. In Proceedings of IRAL’99, pages 83–90, 1999. 5. T. Hisamitsu, Y. Niwa, and J. Tsujii. A Method of Measuring Term Representativeness. In Proceedings of COLING 2000, pages 320–326, 2000. 6. M. Iwayama and T. Tokunaga. Hierarchical Bayesian Clustering for Automatic Text Classification. In Proceedings of IJCAI’95, pages 1322–1327, 1995. 7. M. Iwayama. Relevance reedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Proceedings of ACM SIGIR 2000, pages 10–16, 2000. 8. S. Nishioka, Y. Niwa, M. Iwayama, and A. Takano. DualNAVI: An information retrieval interface. In Proceedings of JSSST WISS’97, pages 43–48, 1997. (in Japanese). 9. Y. Niwa, S. Nishioka, M. Iwayama, and A. Takano. Topic graph generation for query navigation: Use of frequency classes for topic extraction. In Proceedings of NLPRS’97, pages 95–100, 1997. 10. Y. Niwa, M. Iwayama, T. Hisamitsu, S. Nishioka, A. Takano, H. Sakurai, and O. Imaichi. Interactive Document Search with DualNAVI. In Proceedings of NTCIR’99, pages 123–130, 1999. 11. A. Singhal, C. Buckley, and M. Mitra. Pivoted Document Length Normalization In Proceedings of ACM SIGIR’96, pages 21–29, 1996. 12. A. Takano, Y. Niwa, S. Nishioka, M. Iwayama, T. Hisamitsu, O. Imaichi and H. Sakurai. Information Access Based on Associative Calculation, In Proceedings of SOFSEM 2000, LNCS Vol.1963, pages 187–201, Springer-Verlag, 2000. 13. A. Takano, Y. Niwa, S. Nishioka, T. Hisamitsu, M. Iwayama, and O. Imaichi. Associative Information Access using DualNAVI. In Proceedings of NLPRS 2001, pages 771–772, 2001. 14. Webcat Plus (Japanese Books Information Service). http://webcatplus.nii.ac.jp/, October 2002.
Efficient Data Representations That Preserve Information Naftali Tishby School of Computer Science and Engineering and Center for Neural Computation The Hebrew University, Jerusalem 91904, Israel [email protected]
Abstract. A fundamental issue in computational learning theory, as well as in biological information processing, is the best possible relationship between model representation complexity and its prediction accuracy. Clearly, we expect more complex models that require longer data representation to be more accurate. Can one provide a quantitative, yet general, formulation of this trade-off? In this talk I will discuss this question from Shannon’s Information Theory perspective. I will argue that this trade-off can be traced back to the basic duality between source and channel coding and is also related to the notion of ”coding with side information”. I will review some of the theoretical achievability results for such relevant data representations and discuss our algorithms for extracting them. I will then demonstrate the application of these ideas for the analysis of natural language corpora and speculate on possibly-universal aspects of human language that they reveal. Based on joint works with Ran Bacharach, Gal Chechik, Amir Globerson, Amir Navot, and Noam Slonim.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, p. 45, 2003. c Springer-Verlag Berlin Heidelberg 2003
Can Learning in the Limit Be Done Efficiently? Thomas Zeugmann Institut f¨ ur Theoretische Informatik, Universit¨ at zu L¨ ubeck, Wallstraße 40, 23560 L¨ ubeck, Germany [email protected]
Abstract. Inductive inference can be considered as one of the fundamental paradigms of algorithmic learning theory. We survey results recently obtained and show their impact to potential applications. Since the main focus is put on the efficiency of learning, we also deal with postulates of naturalness and their impact to the efficiency of limit learners. In particular, we look at the learnability of the class of all pattern languages and ask whether or not one can design a learner within the paradigm of learning in the limit that is nevertheless efficient. For achieving this goal, we deal with iterative learning and its interplay with the hypothesis spaces allowed. This interplay has also a severe impact to postulates of naturalness satisfiable by any learner. Finally, since a limit learner is only supposed to converge in the limit, one never knows at any particular learning stage whether or not the learner did already succeed. The resulting uncertainty may be prohibitive in many applications. We survey results to resolve this problem by outlining a new learning model, called stochastic finite learning. Though pattern languages can neither be finitely inferred from positive data nor PAC-learned, our approach can be extended to a stochastic finite learner that exactly infers all pattern languages from positive data with high confidence.
The full version of this paper is published in the Proceedings of the 14th International Conference on Algorithmic Learning Theory, Lecture Notes in Artificial Intelligence Vol. 2842
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, p. 46, 2003. c Springer-Verlag Berlin Heidelberg 2003
Discovering Frequent Substructures in Large Unordered Trees Tatsuya Asai1 , Hiroki Arimura1 , Takeaki Uno2 , and Shin-ichi Nakano3 1
Kyushu University, Fukuoka 812–8581, JAPAN {t-asai,arim}@i.kyushu-u.ac.jp 2 National Institute of Informatics, Tokyo 101–8430, JAPAN [email protected] 3 Gunma University, Kiryu-shi, Gunma 376–8515, JAPAN [email protected]
Abstract. In this paper, we study a frequent substructure discovery problem in semi-structured data. We present an efficient algorithm Unot that computes all frequent labeled unordered trees appearing in a large collection of data trees with frequency above a user-specified threshold. The keys of the algorithm are efficient enumeration of all unordered trees in canonical form and incremental computation of their occurrences. We then show that Unot discovers each frequent pattern T in O(kb2 m) per pattern, where k is the size of T , b is the branching factor of the data trees, and m is the total number of occurrences of T in the data trees.
1
Introduction
By rapid progress of network and storage technologies, huge amounts of electronic data have been available in various enterprises and organizations. These weakly-structured data are well modeled by graph or trees, where a data object is represented by a nodes and a connection or relationships between objects are encoded by an edge between them. There have been increasing demands for efficient methods for graph mining, the task of discovering patterns in large collections of graph and tree structures [1,3,4,7,8,9,10,13,15,17,18,19,20]. In this paper, we present an efficient algorithm for discovering frequent substructures in a large graph structured data, where both of the patterns and the data are modeled by labeled unordered trees. A labeled unordered tree is a rooted directed acyclic graph, where all but the root node have exactly one parent and each node is labeled by a symbol drawn from an alphabet (Fig. 1). Such unordered trees can be seen as either a generalization of labeled ordered trees extensively studied in semi-structured data mining [1,3,4,10,13,18,20], or as an efficient specialization of attributed graphs in graph mining researches [7,8, 9,17,19]. They are also useful in modeling various types of unstructured or semistructured data such as chemical compounds, dependency structure in discourse analysis and the hyperlink structure of Web sites. On the other hand, difficulties arise in discovery of trees and graphs such as the combinatorial explosion of the number of possible patterns, the isomorphism problem for many semantically equivalent patterns. Also, there are other G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 47–61, 2003. c Springer-Verlag Berlin Heidelberg 2003
48
T. Asai et al. T B 2 C
D
A 1
3
C
B 5 4
C
6
A 1
B 2 C
3
C A
4 5
B 7 B
6
A
8
B
B 10 9
C
11
A
12
C
C 14
13
C
15
Fig. 1. A data tree D and a pattern tree T
difficulties such as the computational complexity of detecting the embeddings or occurrences in trees. We tackle these problems by introducing a novel definitions of the support and the canonical form for unordered trees, and by developing techniques for efficient enumeration of all unordered trees in canonical form without duplicates and for incremental computation of the embeddings of each patterns in data trees. Interestingly, these techniques can be seen as instances of the reverse search technique, known as a powerful design tool for combinatorial enumeration problems [6,16]. Combining these techniques, we present an efficient algorithm Unot that computes all labeled unordered trees appearing in a collection of data trees with frequency above a user-specified threshold. The algorithm Unot has a provable performance in terms of the output size unlike other graph mining algorithm presented so far. It enumerates each frequent pattern T in O(kb2 m) per pattern, where k is the size of T , b is the branching factor of the data tree, and m is the total number of occurrences of T in the data trees. Termier et al. [15] developed the algorithm TreeFinder for discovering frequent unordered trees. The major difference with Unot is that TreeFinder is not complete, i.e., it finds a subset of the actually frequent patterns. On the other hand, Unot computes all the frequent unordered trees. Another difference is that matching functions preserve the parent relationship in Unot, whereas ones preserve the ancestor relationship in TreeFinder. Very recently, Nijssen et al. [14] independently proposed an algorithm for the frequent unordered tree discovery problem with an efficient enumeration technique essentially same to ours. This paper is organized as follows. In Section 2, we prepare basic definitions on unordered trees and introduce our data mining problems. In Section 3, we define the canonical form for the unordered trees. In Section 4, we show an efficient algorithm Unot for finding all the frequent unordered trees in a collection of semi-structured data. In Section 5, we conclude the results.
2
Preliminaries
In this section, we give basic definitions on unordered trees according to [2] and introduce our data mining problems. For a set A, |A| denotes the size of A. For a binary relation R ⊆ A2 on A, R∗ denotes the reflexive transitive closure of R.
Discovering Frequent Substructures in Large Unordered Trees
2.1
49
The Model of Semi-structured Data
We introduce the class of labeled unordered trees as a model of semi-structured data and patterns according to [2,3,12]. Let L = {, 1 , 2 , . . .} be a countable set of labels with a total order ≤L on L. A labeled unordered tree (an unordered tree, for short) is a directed acyclic graph T = (V, E, r, label), with a distinguished node r called the root, satisfying the followings: V is a set of nodes, E ⊆ V ×V is a set of edges, and label : V → L is the labeling function for the nodes in V . If (u, v) ∈ E then we say that u is a parent of v, or v is a child of u. Each node v ∈ V except r has exactly one parent and the depth of v is defined by dep(v) = d, where (v0 = r, v1 , . . . , vd ) is the unique path from the root r to v. A labeled ordered tree (an ordered tree, for short) T = (V, E, B, r, label) is defined in a similar manner as a labeled unordered tree except that for each internal node v ∈ V , its children are ordered from left to right by the sibling relation B ⊆ V × V [3]. We denote by U and T the classes of unordered and ordered trees over L, respectively. For a labeled tree T = (V, E, r, label), we write VT , VE , rT and labelT for V, E, r and label if it is clear from the context. The following notions are common in both unordered and ordered trees. Let T be an unordered or ordered tree and u, v ∈ T be its nodes. If there is a path from u to v, then we say that u is an ancestor of v, or v is a descendant of u. For a node v, we denote by T (v) the subtree of T rooted at v, the subgraph of T induced in the set of all descendants of v. The size of T , denoted by |T |, is defined by |V |. We define the special tree ⊥ of size 0, called the empty tree. Example 1. In Fig. 1, we show examples of labeled unordered trees T and D on alphabet L = {A, B, C} with the total ordering A > B > C. In the tree T , the root is 1 labeled with A and the leaves are 3, 4, and 6. The subtree T (2) at node 2 consists of nodes 2, 3, 4. The size of T is |T | = 6. Throughout this paper, we assume that for every labeled ordered tree T = (V, E, B, r, label) of size k ≥ 1, its nodes are exactly {1, . . . , k}, which are numbered consecutively in preorder. Thus, the root and the rightmost leaf of T are root(T ) = 1 and rml(T ) = k, respectively. The rightmost branch of T is the path RM B(T ) = (r0 , . . . , rc ) (c ≥ 0) from the root r to the rightmost leaf of T . 2.2
Patterns, Tree Matching, and Occurrences
For k ≥ 0, a k-unordered pattern (k-pattern, for short) is a labeled unordered tree having exactly k nodes, that is, VT = {1, . . . , k} such that root(T ) = 1 holds. An unordered database (database, for short) is a finite collection D = {D1 , . . . , Dn } ⊆ U of (ordered) trees, where each Di ∈ D is called a data tree. We denote by VD the set of the nodes of D and ||D|| = |VD | = D∈D |VD |. The semantics of unordered and ordered tree patterns are defined through tree matching [3]. Let T and D ∈ U be labeled unordered trees over L, which are called the pattern tree and the data tree, respectively. Then, we say that T occurs in D as an unordered tree if there is a mapping ϕ : VT → VD satisfying the following (1)–(3) for every x, y ∈ VT :
50
T. Asai et al.
(1) ϕ is one-to-one, i.e., x = y implies ϕ(x) = ϕ(y). (2) ϕ preserves the parent relation, i.e., (x, y) ∈ ET iff (ϕ(x), ϕ(y)) ∈ ED . (3) ϕ preserves the labels, i.e., labelT (x) = labelD (ϕ(x)). The mapping ϕ is called a matching from T into D. Then we can extend the matching from matching ϕ : VT → VD into a data tree to a matching ϕ : VT → 2VD into a database. MD (T ) denotes the set of all matchings from T into D. Then, we define four types of occurrences of U in D as follows: Definition 1. Let k ≥ 1, T ∈ U be a k-unordered pattern, D be a database. For any matching ϕ : VT → VD ∈ MD (T ) from T into D, we define: 1. 2. 3. 4.
The total occurrence of T is the k-tuple T oc(ϕ) = ϕ(1), . . . , ϕ(k) ∈ (VD )k . The embedding occurrence of T is the set Eoc(ϕ) = {ϕ(1), . . . , ϕ(k)} ⊆ VD . The root occurrence of T : Roc(ϕ) = ϕ(1) ∈ VD The document occurrence of T is the index Doc(ϕ) = i such that Eoc(ϕ) ⊆ VDi for some 1 ≤ i ≤ |D|.
Example 2. In Fig. 1, we see that the pattern tree S has eight total occurrences ϕ1 = 1, 2, 3, 4, 10, 11 , ϕ2 = 1, 2, 4, 3, 10, 11 , ϕ3 = 1, 2, 3, 4, 10, 13 , ϕ4 = 1, 2, 4, 3, 10, 13 , ϕ5 = 1, 10, 11, 13, 2, 3 , ϕ6 = 1, 10, 13, 11, 2, 3 , ϕ7 = 1, 10, 11, 13, 2, 4 , and ϕ8 = 1, 10, 13, 11, 2, 4 in the data tree D, where we identify the matching ϕi and T oc(ϕi ). On the other hand, there are four embedding occurrences Eoc(ϕ1 ) = Eoc(ϕ2 ) = {1, 2, 3, 4, 10, 11}, Eoc(ϕ3 ) = Eoc(ϕ4 ) = {1, 2, 3, 4, 10, 13}, Eoc(ϕ5 ) = Eoc(ϕ6 ) = {1, 2, 3, 10, 11, 13}, and Eoc(ϕ7 ) = Eoc(ϕ8 ) = {1, 2, 4, 10, 11, 13}, and there is one root occurrence ϕ1 (1) = ϕ2 (1) = · · · = ϕ8 (1) = 1 of T in D. Now, we analyze the relationship among the above definitions of the occurrences by introducing an ordering ≥occ on the definitions. For any types of occurrences τ, π ∈ {T oc, Eoc, Roc, Doc}, we say π is stronger than or equal to τ , denoted by π ≥occ τ , iff for every matchings ϕ1 , ϕ2 ∈ MD (T ) from T to D, π(ϕ1 ) = π(ϕ2 ) implies τ (ϕ1 ) = τ (ϕ2 ). For an unordered pattern T ∈ U, we denote by T OD (T ), EOD (T ), ROD (T ), and DOD (T ) the set of the total, the embedding, the root and the document occurrences of T in D, respectively. The first lemma describes a linear ordering on classes of occurrences and the second lemma gives the relation between the relative size of the occurrences. Lemma 1. T oc ≥occ Eoc ≥occ Roc ≥occ Doc. Lemma 2. Let D be a database and T be a pattern. Then, |T OD (T )| = k Θ(k) |EOD (T )|
and
|EOD (T )| = nΘ(k) |ROD (T )|
over all pattern T ∈ U and all databases D ∈ 2U satisfying k ≤ cn for some 0 < c < 1, where k is the size of T and n is the size of a database. Proof. Omitted. For the proof, please consult the technical report [5].
Discovering Frequent Substructures in Large Unordered Trees
51
Fig. 2. The depth-label sequences of labeled ordered trees
For any of four types of the occurrences τ ∈ {T oc, Eoc, Roc, Doc}, the τ -count of an unordered pattern U in a given database D is |τ D (U )|. Then, the (relative) τ -frequency is the ratio f reqD (T ) = |τ D (T )|/||D|| for τ ∈ {T oc, Eoc, Roc} and f reqD (T ) = |DocD (T )|/|D| for τ = Doc. A minimum frequency threshold is any number 0 ≤ σ ≤ 1. Then, we state our data mining problems as follows. Frequent Unordered Tree Discovery with Occurrence Type τ Given a database D ⊆ U and a positive number 0 ≤ σ ≤ 1, find all unordered patterns U ∈ U appearing in D with relative τ -frequency at least σ, i.e., f reqD (T ). In what follows, we concentrate on the frequent unordered tree discovery problem with embedding occurrences with Eoc although T oc is more natural choice from the view of data mining. However, we note that it is easy to extend the method and the results in this paper for coarser occurrences Roc and Doc by simple preprocessing. The following substructure enumeration problem, is a special case of the frequent unordered tree discovery problem with embedding occurrences where σ = 1/||D||. Substructure Discovery Problem for Unordered Trees Given a data tree D ∈ U, enumerate all the labeled unordered trees T ∈ U embedded in D, that is, T occurs in D at least once. Throughout this paper, we adopt the first-child next-sibling representation [2] as the representation of unordered and ordered trees in implementation. For the detail of the representation, see some textbook, e.g., [2].
3
Canonical Representation for Unordered Trees
In this section, we give the canonical representation for unordered tree patterns according to Nakano and Uno [12]. 3.1
Depth Sequence of a Labeled Unordered Tree
First, we introduce some technical definitions on ordered trees. We use labeled ordered trees in T as the representation of labeled unordered trees in U, where U ∈ U can be represented by any T ∈ T such that U is obtained from T by ignoring its sibling relation BT . Two ordered trees T1 and T2 ∈ T are equivalent
52
T. Asai et al.
each other as unordered trees, denoted by T1 ≡ T2 , if they represent the same unordered tree. We encode a labeled ordered tree of size k as follows [4,11,20]. Let T be a labeled ordered tree of size k. Then, the depth-label sequence of T is the sequence C(T ) = ((dep(v1 ), label(v1 )), . . . , (dep(vk ), label(vk )) ∈ (N×L)∗ , where v1 , . . . , vk is the list of the nodes of T ordered by the preorder traversal of T and each (dep(vi ), label(vi )) ∈ N×L is called a depth-label pair . Since T and N × L have one-to-one correspondence, we identify them in what follows. See Fig. 2 for examples of depth-label sequences. Next, we introduce the total ordering ≥ over depth-label sequences as follows. For depth-label pairs (di , i ) ∈ N×L (i = 1, 2), we define (d1 , 1 ) > (d2 , 2 ) iff either (i) d1 > d2 or (ii) d1 = d2 and 1 > 2 . Then, C(T1 ) = (x1 , . . . , xm ) is heavier than C(T2 ) = (y1 , . . . , ym ), denoted by C(T1 ) ≥lex C(T2 ), iff C(T1 ) is lexicographically larger than or equal to C(T2 ) as sequences over alphabet N×L. That is, C(T1 ) ≥lex C(T2 ) iff there exists some k such that (i) xi = yi for each i = 1, . . . , k − 1 and (ii) either xk > yk or m > k − 1 = n. By identifying ordered trees and their depth-label sequences, we may often write T1 ≥lex T2 instead of C(T1 ) ≥lex C(T2 ). Now, we give the canonical representation for labeled unordered trees as follows. Definition 2 ([12]). A labeled ordered tree T is in the canonical form or a canonical representation if its depth-label sequence C(T ) is heaviest among all ordered trees over L equivalent to T , i.e., C(T ) = max{ C(S) | S ∈ T , S ≡ T }. The canonical ordered tree representation (or canonical representation, for short) of an unordered tree U ∈ U, denoted by COT (U ), is the labeled ordered tree T ∈ T in the canonical form that represents U as unordered tree. We denote by C the class of the canonical ordered tree representations of labeled unordered trees over L. The next lemma gives a characterization of the canonical representations for unordered trees [12]. Lemma 3 (Left-heavy condition [12]). A labeled ordered tree T is the canonical representation of some unordered tree iff T is left-heavy, that is, for any node v1 , v2 ∈ V , (v1 , v2 ) ∈ B implies C(T (v1 )) ≥lex C(T (v2 )). Example 3. Three ordered trees T1 , T2 , and T3 in Fig. 2 represents the same unordered tree, but not as ordered trees. Among them, T1 is left-heavy and thus it is the canonical representation of a labeled unordered tree under the assumption that A > B > C. On the other hand, T2 is not canonical since the depth-label sequence C(T2 (2)) = (1B, 2A, 2B, 3A) is lexicographically smaller than C(T2 (6)) = (1A, 2B, 2A) and this violates the left-heavy condition. T3 is not canonical since B < A implies C(T3 (3)) = (2B, 3A)
Discovering Frequent Substructures in Large Unordered Trees 1
ri
code
Li 0
r1
s1 s2
r2
i
Ri top
left right
r0
S
53
RMB(S) ri
si+1
ri
ri+1
Left [i]
Li
Ri
Li
Right [i]
Ri
rg = rml(S)
Fig. 3. Notions on a canonical representation Fig. 4. Data structure for a pattern
3.2
The Reverse Search Principle and the Rightmost Expansions
The reverse search is a general scheme for designing efficient algorithm for hard enumeration problems [6,16]. In reverse search, we define the parent-child relation P ⊆ S × S on the solution space S of the problem so that each solution X has the unique parent P (X). Since this relation forms a search tree over S, we enumerate the solutions starting from the root solutions and by computing the children for the solutions. Iterating this process, we can generate all the solutions without duplicates. Let T be a labeled ordered tree having at least two nodes. We denote by P (T ) the unique labeled ordered tree derived from T by removing the rightmost leaf rml(T ). We say P (T ) is the parent tree of T or T is a child tree of P (T ). The following lemma is crucial to our result. Lemma 4 ([12]). For any labeled ordered tree T ∈ T , if T is in canonical form then so is its parent P (T ), that is, T ∈ C implies P (T ) ∈ C. Proof. For a left-heavy tree T ∈ C, the operation to remove the rightmost leaf from T does not violate the left-heavy condition of T . It follows from Lemma 3 that the lemma holds.
Definition 3 (Rightmost expansion [3,11,20]). Let S ∈ T be a labeled ordered tree on L. Then a labeled ordered tree T ∈ T is the rightmost expansion of S if T is obtained from S by attaching the new node v as the rightmost child of a node on the rightmost branch RM B(S) of S. If (dep(v), label(v)) = (d, ) then we call T the (d, )-expansion of S. We define the (0, )-expansion of ⊥ to be the single node tree with label . Since newly attached node v is the last node in preorder on T , we denote the (d, )-expansion of S by S ·(d, ). We sometimes write v = (dep(v), label(v)). If we can compute the set of the child trees of a given labeled ordered tree S ∈ T then we can enumerate all the labeled ordered trees in T . The method is called the rightmost expansion and has been independently studied in [3,11,20].
54
T. Asai et al.
Algorithm Unot(D, L, σ) Input: the database D = {D1 , . . . , Dm } (m ≥ 0) of labeled unordered trees, a set L of labels, and the minimum frequency threshold 0 ≤ σ ≤ 1. Output: the set F ⊆ C of all frequent unordered trees of size at most Nmax . Method: 1. F := ∅; α := |D|σ ; //Initialization 2. For any label ∈ L, do: T := (0, ); /* 1-pattern with copy depth 0 */ Expand(T , O, 0, α, F ); 3. Return F; //The set of frequent patterns Fig. 5. An algorithm for discovering all frequent unordered trees
4
Mining Frequent Unordered Tree Patterns
In this section, we present an efficient algorithm Unot for solving the frequent unordered tree pattern discovery problem w.r.t. the embedding occurrences. 4.1
Overview of the Algorithm
In Fig. 5, we show our algorithm Unot for finding all the canonical representations for frequent unordered trees in a given database D. A key of the algorithm Unot is efficient enumeration of all the canonical representations, which is implemented by the subprocedures FindAllChildren in Fig. 5 to run in O(1) time per pattern. Another key is incremental computation of their occurrences. This is implemented by the subprocedure UpdateOcc in Fig. 8 to run in O(bk 2 m) time per pattern, where b is the maximum branching factor in D, k is the maximum pattern size, m is the number of embedding occurrences. We give the detailed descriptions of these procedures in the following subsections. 4.2
Enumerating Unordered Trees
First, we prepare some notations (Fig. 3.1). Let T be a labeled ordered tree with the rightmost branch RM B(T ) = (r0 , r1 , . . . , rg ). For every i = 0, 1, . . . , g, if ri has two or more children then we denote by si+1 the child of ri preceding ri+1 , that is, si+1 is the second rightmost child of ri . Then, we call Li = T (si+1 ) and Ri = T (ri+1 ) the left and the right tree of ri . If ri has exactly one child ri+1 then we define Li = ∞ , where ∞ is a special tree such that ∞ >lex S for (T ) (T ) any S ∈ T . For a pattern tree T ∈ T , we sometimes write Li and Ri for Li and Ri by indicating the pattern tree T . By Lemma 3, an ordered tree is in canonical form iff it is left-heavy. The next lemma claims that the algorithm only checks the left trees and the right trees to check if the tree is in canonical form. Lemma 5 ([12]). Let S ∈ C be a canonical representation and T be a child tree of S with the rightmost branch (r0 , . . . , rg ), where g ≥ 0. Then, T is in canonical form iff Li ≥lex Ri holds in T for every i = 0, . . . , g − 1.
Discovering Frequent Substructures in Large Unordered Trees
55
Procedure Expand(S, O, c, α, F) Input: A canonical representation S ∈ U , the embedding occurrences O = EOD (S), and the copy-depth c, nonnegative integer α, and the set F of the frequent patterns. Method: – If (|O| < α) then return; Else F := F ∪ {S}; – For each S ·(i, ), cnew ∈ FindAllChildren(S, c), do; • T := S ·(i, ); • P := UpdateOcc(T, O, (i, )); • Expand(T, P, cnew , α, F); Fig. 6. A depth-first search procedure Expand
Let T be a labeled ordered tree with the rightmost branch RM B(T ) = (r0 , r1 , . . . , rg ). We say C(Li ) and C(Ri ) have a disagreement at the position j if j ≤ min(|C(Li )|, |C(Ri )|) and the j-th components of C(Li ) and C(Ri ) are different pairs. Suppose that T is in canonical form. During a sequence of rightmost expansions to T , the i-th right tree Ri grows as follows. 1. Firstly, when a new node v is attached to ri as a rightmost child, the sequence is initialized to C(Ri ) = v = (dep(v), label(v)). 2. Whenever a new node v of depth d = dep(v) > i comes to T , the right tree Ri grows. In this case, v is attached as the rightmost child of rd−1 . There are two cases below: (i) Suppose that there exists a disagreement in C(Li ) and C(Ri ). If r dep(v) ≥ v then the rightmost expansion with v does not violate the left-heavy condition of T , where rdep(v) is the node preceding v in the new tree. (ii) Otherwise, we know that C(Ri ) is a prefix of C(Li ). In this case, we say Ri is copying Li . Let m = |C(Ri )| < |C(Li )| and w be the m-th component of C(Li ). For every new node v, T ·v is a valid expansion if w ≥ v and r dep(v) ≥ v. Otherwise, it is invalid. (iii) In cases (i) and (ii) above, if rdep(v)−1 is a leaf of the rightmost branch of T then r dep(v) is undefined. In this case, we define r dep(v) = ∞ . 3. Finally, T reaches C(Li ) = C(Ri ). Then, the further rightmost expansion to Ri is not possible. If we expand a given unordered pattern T so that all the right trees R0 , . . . , Rg satisfy the above conditions, then the resulting tree is in canonical form. Let RM B(T ) = (r0 , r1 , . . . , rg ) be the rightmost branch of T . For every i = 0, 1, . . . , g − 1, the internal node ri is said to be active at depth i if C(Ri ) is a prefix of C(Li ). The copy depth of T is the depth of the highest active node in T . To deal with special cases, we introduce the following trick: We define the leaf rg to be always active. Thus we have that if all nodes but rg are not active
56
T. Asai et al.
Procedure FindAllChildren(T, c) : Method : Return the set Succ of all pairs S, c , where S is the canonical child tree of T and c is its copy depth generated by the following cases: Case I : If C(Lk ) = C(Rk ) for the copy depth k: – The canonical child trees of T are T ·(1, 1 ), . . . , T ·(k + 1, k+1 ), where label(ri ) ≥ i for every i = 1, . . . , k + 1. The trees T ·(k + 2, k+2 ), . . . , T · (g + 1, g+1 ) are not canonical. – The copy depth of T ·(i, i ) is i − 1 if label(ri ) = i and i otherwise for every i = 1, . . . , k + 1. Case II : If C(Lk ) = C(Rk ) for the copy depth k: – Let m = |C(Rk )|+1 and w = (d, ) be the m-th component of C(Lk ) (the next position to be copied). The canonical child trees of T are T·(1, 1 ), . . . , T·(d, d ), where label(ri ) ≥ i for every i = 1, . . . , d − 1 and ≥ d holds. – The copy depth of T ·(i, i ) is i − 1 if label(ri ) = i and i otherwise for every i = 1, . . . , d − 1. The copy depth of T ·(d, d ) is k if w = v and d otherwise. Fig. 7. The procedure FindAllChildren
then its copy-depth is g. This trick greatly simplies the description of the update below. Now, we explain how to generate all child trees of a given canonical representation T ∈ C. In Fig. 7, we show the algorithm FindAllChildren that computes the set of all canonical child trees of a given canonical representation as well as their copy depths. The algorithm is almost same as the algorithm for unlabeled unordered trees described in [12]. The update of the copy depth is slightly different from [12] by the existence of the labels. Let T be a labeled ordered tree with the rightmost branch RM B(T ) = (r0 , r1 , . . . , rg ) and the copy depth k ∈ {−1, 0, 1, . . . , g − 1}. Note that in the procedure FindAllChildren, the case where all but rg are inactive, including the case for chain trees, is implicitly treated in Case I. To implement the algorithm FindAllChildren to enumerate the canonical child trees in O(1) time per tree, we have to perform the following operation in O(1) time: updating a tree, access to the sequence of left and right trees, maintenance of the position of the shorter prefix at the copy depth, retrieval of the depth-label pair at the position, and the decision of the equality C(Li ) = C(Ri ). To do this, we represent a pattern T by the structure shown in Fig. 4. – An array code : [1..size] → (N × L) of depth-label pairs that stores the depth-label sequence of T with length size ≥ 0. – A stack RM B : [0..top] → (N × N × {=, =}) of the triples (lef t, right, cmp). For each (lef t, right, cmp) = RM B[i], lef t and right are the starting positions of the subsequences of code that represent the left tree Li and the right tree Ri , and the flag cmp ∈ {=, =} indicates whether Li = Ri holds. The length of the rightmost branch is top ≥ 0. It is not difficult to see that we can implement all the operation in FindAllChildren of Fig. 7 to work in O(1) time, where an entire tree is not output
Discovering Frequent Substructures in Large Unordered Trees
57
Algorithm UpdateOcc(T, O, d, ) Input: the rightmost expansion of a pattern S, the embedding occurrence list O = EOD (S), the depth d ≥ 1 and a label ∈ L of the rightmost leaf of T . Output: the new list P = EOD (T ). Method: – P := ∅; – For each ϕ ∈ O, do: + x := ϕ(rd−1 ); /* the image of the parent of the new node rd = (d, ) */ + For each child y of x do: − If labelD (y) = and y ∈ E(ϕ) then ξ := ϕ·y and f lag := true; − Else, skip the rest and continue the for-loop; − For each i = 1, . . . , d − 1, do: If C(Li ) = C(Ri ) but ξ(lef ti ) = ξ(righti ) then f lag := f alse, and then break the inner for-loop; − If f lag = true then P = P ∪ {ξ}; – Return P; Fig. 8. An algorithm for updating embedding occurrence lists of a pattern
but the difference from the previous tree. The proof of the next lemma is almost same to [12] except the handling of labels. Lemma 6 ([12]). For every canonical representation T and its copy depth c ≥ 0, FindAllChildren of Fig. 7 computes the set of all the canonical child trees of T in O(1) time per tree, when only the differences to T are output. The time complexity O(1) time per tree of the above algorithm is significantly faster than the complexity O(k2 ) time per tree of the straightforward algorithm based on Lemma 3, where k is the size of the computed tree. 4.3
Updating the Occurrence List
In this subsection, we give a method for incrementally computing the embedding occurrences EOD (T ) of the child tree T from the occurrences EOD (S) of the canonical representation S. In Fig. 8, we show the procedure UpdateOcc that, given a canonical child tree T and the occurrences EOD (S) of the parent tree S, computes its embedding occurrences EOD (T ). Let T be a canonical representation for a labeled unordered tree over L with domain {1, . . . , k}. Let ϕ ∈ MD (T ) be a matching from T into D. Recall that the total and the embedding occurrences of T associated with ϕ is T O(ϕ) = ϕ(1), . . . , ϕ(k) and EO(ϕ) = {ϕ(1), . . . , ϕ(k)}, respectively. For convenience, we identify ϕ to T O(ϕ). We encode an embedding occurrence EO in one of the total occurrences ϕ corresponding to EO = EO(ϕ). Since there are many total occurrences corresponding to EO, we introduce the canonical representation for embedding occurrences similarly as in Section 3.
58
T. Asai et al. A
⊥
A
B
B [omit] A
B
A A
B
B
A
B A
A A
A
B A
A
A
A
A
B
A
A B A [omit][omit]
B
A
A
B
B
A [omit]
B
B
B
A
B B
A A [omit]
B
B
B
A A B
A
B
A
B
B B
A A
A
A A
A
B
A
B
B
A
B A
B
B
B
B
B
A
B
B
B
B
A
A
A B
A
A
A
A
B
B
Fig. 9. A search tree for labeled unordered trees
Two total occurrences ϕ1 and ϕ2 are equivalent each other if EO(ϕ1 ) = EO(ϕ2 ). The occurrences ϕ1 is heavier than ϕ2 , denote by ϕ1 ≥lex ϕ2 , if ϕ1 is lexicographically larger than ϕ2 as the sequences in N∗ . We give the canonical representation for the embedding occurrences. Definition 4. Let T be a canonical form of a labeled unordered tree and EO ⊆ VD be its embedding occurrence in D. The canonical representation of EO, denoted by CR(EO), is the total occurrence ϕ ∈ MD (T ) that is the heaviest tuple in the equivalence class { ϕ ∈ MD (T ) | ϕ ≡ ϕ }. Let ϕ = ϕ(1), . . . , ϕ(k) be an total occurrence of T over D. We denote by P (ϕ) the unique total occurrence of length k − 1 derived from ϕ by removing the last component ϕ(k). We say P (ϕ) is a parent occurrence of ϕ. For a node v ∈ VT in T , we denote by ϕ(T (v)) = ϕ(i), ϕ(i + 1), . . . , ϕ(i + |T (v)| − 1) the restriction of ϕ to the subtree T (v), where i, i+1, . . . , i+|T (v)|−1 is the nodes of T (v) in preorder. Now, we consider the incremental computation of the embedding occurrences. Lemma 7. Let k ≥ 1 be any positive integer, S be a canonical tree, and ϕ be its canonical occurrence of S in D. For a node w ∈ VD , let T = S ·v be a child tree of S with the rightmost branch (r0 , . . . , rg ). Then, the mapping φ = ϕ·w is a canonical total occurrence of T in D iff the following conditions (1)–(4) hold. (1) labelD (w) = labelT (v). (2) For every i = 1, . . . , k − 1, w = ϕ(i). (3) w is a child of ϕ(rd−1 ), where rd−1 ∈ RM B(S) is the node of depth d − 1 on the rightmost branch RM B(S) of S. (4) C(Li ) = C(Ri ) implies φ(root(Li )) = φ(root(Ri )) for every i = 0, . . . , g − 1. Proof. For any total occurrence ϕ ∈ MD (T ), if ϕ is in canonical form then so is its parent P (ϕ). Moreover, ϕ is the canonical form iff ϕ is partially left-heavy, that is, for any nodes v1 , v2 ∈ VT , both of (v1 , v2 ) ∈ B and T (v1 ) = T (v2 ) imply
ϕ(T (v1 )) ≥lex ϕ(T (v2 )). Thus the lemma holds.
Discovering Frequent Substructures in Large Unordered Trees
59
Lemma 7 ensures the correctness of the procedure UpdateOcc of Fig. 8. To show the running time, we can see that the decision C(Li ) = C(Ri ) can be decidable in O(1) time using the structure shown in Fig. 4. Note that every canonical tree has at least one canonical child tree. Thus, we obtain the main theorem of this paper as follows. Theorem 1. Let D be a database and 0 ≤ σ ≤ 1 be a threshold. Then, the algorithm Unot of Fig. 5 computes all the canonical representations for the frequent unordered trees w.r.t. embedding occurrences in O(kb2 m) time per pattern, where b is the maximum branching factor in VD , k is the maximum size of patterns enumerated, and m is the number of embeddings of the enumerated pattern. Proof. Let S be a canonical tree and T be a child tree of S. Then, the procedure UpdateOcc of Fig. 8 computes the list of all the canonical total occurrences of T in O(k bm ) time, where k = |T | and m = |EO(S)|. From Lemma 6 and the fact |EO(T )| = O(b|EO(S)|), we have the result.
Fig. 9 illustrates the computation of the algorithm Unot that is enumerating a subset of labeled unordered trees of size at most 4 over L = {A, B}. The arrows indicates the parent-child relation and the crossed trees are non-canonical ones. 4.4
Comparison to a Straightforward Algorithm
We compare our algorithm Unot to the following straightforward algorithm Naive. Given a database D and a threshold σ, Naive enumerates all the labeled ordered trees over L using the rightmost expansion, and then for each tree it checks if it is in canonical form applying Lemma 3. Since the check takes O(k 2 ) time per tree, this stage takes O(|L|k 3 ) time. It takes O(nk ) time to compute all the embedding occurrences in D of size n. Thus, the overall time is O(|L|k 3 +nk ). On the other hand, Unot computes the canonical representations in O(kb2 m) time, where the total number m of the embedding occurrences of T is m = O(nk ) in the worst case. However, m will be much smaller than nk as the pattern size of T grows. Thus, if it is the case that b is a small constant and m is much smaller than n, then our algorithm will be faster than the straightforward algorithm.
5
Conclusions
In this paper, we presented an efficient algorithm Unot that computes all frequent labeled unordered trees appearing in a collection of data trees. This algorithm has a provable performance in terms of the output size unlike previous graph mining algorithms. It enumerates each frequent pattern T in O(kb2 m) per pattern, where k is the size of T , b is the branching factor of the data tree, and m is the total number of occurrences of T in the data trees. We are implementing a prototype system of the algorithm and planning the computer experiments on synthesized and real-world data to give empirical evaluation of the algorithm. The results will be included in the full paper.
60
T. Asai et al.
Some graph mining algorithms such as AGM [8], FSG [9], and gSpan [19] use various types of the canonical representation for general graphs similar to our canonical representation for unordered trees. AGM [8] and FSG [9] employ the adjacent matrix with the lexicographically smallest row vectors under the permutation of rows and columns. gSpan [19] uses as the canonical form the DFS code generated with the depth-first search over a graph. It is a future problem to study the relationship among these techniques based on canonical coding and to develop efficient coding scheme for restricted subclasses of graph patterns. Acknowledgement. Tatsuya Asai and Hiroki Arimura would like to thank Ken Satoh, Hideaki Takeda, Tsuyoshi Murata, and Ryutaro Ichise for the fruitful discussions on Semantic Web mining, and to thank Takashi Washio, Akihiro Inokuchi, Michihiro Kuramochi, and Ehud Gudes for the valuable discussions and comments on graph mining. Tatsuya Asai is grateful to Setsuo Arikawa for his encouragement and support for this work.
References 1. K. Abe, S. Kawasoe, T. Asai, H. Arimura, and S. Arikawa. Optimized Substructure Discovery for Semi-structured Data, In Proc. PKDD’02, 1–14, LNAI 2431, 2002. 2. Aho, A. V., Hopcroft, J. E., Ullman, J. D., Data Structures and Algorithms, Addison-Wesley, 1983. 3. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, S. Arikawa, Efficient Substructure Discovery from Large Semi-structured Data, In Proc. SIAM SDM’02, 158–174, 2002. 4. T. Asai, H. Arimura, K. Abe, S. Kawasoe, S. Arikawa, Online Algorithms for Mining Semi-structured Data Stream, In Proc. IEEE ICDM’02, 27–34, 2002. 5. T. Asai, H. Arimura, T. Uno, S. Nakano, Discovering Frequent Substructures in Large Unordered Trees, DOI Technical Report DOI-TR 216, Department of Informatics, Kyushu University, June 2003. http://www.i.kyushu-u.ac.jp/doitr/trcs216.pdf 6. D. Avis, K. Fukuda, Reverse Search for Enumeration, Discrete Applied Mathematics, 65(1–3), 21–46, 1996. 7. L. B. Holder, D. J. Cook, S. Djoko, Substructure Discovery in the SUBDUE System, In Proc. KDD’94, 169–180, 1994. 8. A. Inokuchi, T. Washio, H. Motoda, An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data, In Proc. PKDD’00, 13–23, LNAI, 2000. 9. M. Kuramochi, G. Karypis, Frequent Subgraph Discovery, In Proc. IEEE ICDM’01, 2001. 10. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, H. Ueda, Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents, In Proc. PAKDD’02, 341–355, LNAI, 2002. 11. S. Nakano, Efficient generation of plane trees, Information Processing Letters, 84, 167–172, 2002. 12. S. Nakano, T. Uno, Efficient Generation of Rooted Trees, NII Technical Report NII-2003-005E, ISSN 1346-5597, Natinal Institute of Informatics, July 2003. 13. S. Nestrov, S. Abiteboul, R. Motwani, Extracting Schema from Semistructured Data, In Proc. SIGKDD’98 , 295–306, ACM, 1998.
Discovering Frequent Substructures in Large Unordered Trees
61
14. S. Nijssen, J. N. Kok, Effcient Discovery of Frequent Unordered Trees, In Proc. MGTS’03, September 2003. 15. A. Termier, M. Rousset, M. Sebag, TreeFinder: a First Step towards XML Data Mining, In Proc. IEEE ICDM’02, 450–457, 2002. 16. T. Uno, A Fast Algorithm for Enumerating Bipartite Perfect Matchings, In Proc. ISAAC’01, LNCS, 367–379, 2001. 17. N. Vanetik, E. Gudes, E. Shimony, Computing Frequent Graph Patterns from Semistructured Data, In Proc. IEEE ICDM’02, 458–465, 2002. 18. K. Wang, H. Liu, Schema Discovery from Semistructured Data, In Proc. KDD’97, 271–274, 1997. 19. X. Yan, J. Han, gSpan: Graph-Based Substructure Pattern Mining, In Proc. IEEE ICDM’02, 721–724, 2002. 20. M. J. Zaki. Efficiently Mining Frequent Trees in a Forest, In Proc. SIGKDD 2002, ACM, 2002.
Discovering Rich Navigation Patterns on a Web Site 1,2
1
Karine Chevalier , Cécile Bothorel , and Vincent Corruble
2
1 France Telecom R&D (Lannion), France {karine.chevalier, cecile.bothorel}@rd.francetelecom.com 2 LIP6, Pole IA, Université Pierre et Marie Curie (Paris VI), France {Karine.Chevalier, Vincent.Corruble}@lip6.fr
Abstract. In this paper, we describe a method for discovering knowledge about users on a web site from data composed of demographic descriptions and site navigations. The goal is to obtain knowledge that is useful to answer two types of questions: (1) how do site users visit a web site? (2) Who are these users? Our approach is based on the following idea: the set of all site users can be divided into several coherent subgroups; each subgroup shows both distinct personal characteristics, and a distinct browsing behaviour. We aim at obtaining associations between site usage patterns and personal user descriptions. We call this combined knowledge 'rich navigation patterns'. This knowledge characterizes a precise web site usage and can be used in several applications: prediction of site navigation, recommendations or improvement in site design.
1
Introduction
The World Wide Web is a powerful medium through which individuals or organizations can convey all sorts of information. Many attempts have been made to find ways to describe automatically web users (or more generally internet users) and how they use Internet. This paper focuses on the study of web users at the level of a given web site: are there several consistent groups of site users based on demographic descriptions? If this is the case, does each group show a distinct way of visiting the web site? These questions are important for site owners and advertisers, but also in a social research perspective: it is interesting to test if there is some dependence between demographic descriptions and ways to navigate on a site. Our project addresses the discovery of knowledge about users and their different site usage patterns for a given site. We aim at obtaining associations between site usage patterns (through navigation patterns) and personal user descriptions. We call this combined knowledge ’rich navigation patterns’. These particular patterns underline, on a given site, different ways of visiting the site for specific groups of users (users that share similar personal descriptions). Our aim is to test the assumption that there is some links between navigations on a site and users’ characteristics and to study the relevance of correlating these two very different types of data. If our results confirm that there are some relations between users’ G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 62–75, 2003. © Springer-Verlag Berlin Heidelberg 2003
Discovering Rich Navigation Patterns on a Web Site
63
personal characteristics and site navigations, this knowledge would help to describe in a rich manner site visitors, and open avenues to assist site navigation. For instance, it can be helpful when we want to assist a user for which no information (i.e. personal information or last visits on the site) is known. Based on his current navigation on a web site, and a set of rich navigation patterns obtained before, we can infer some personal information about him. We can then recommend him to visit some documents adapted to his inferred profile. This paper is organized in the following way: the second section presents some tools and methods to understand and describe site users, the third section describes rich navigation patterns and a method to discover them and the fourth section shows our evaluations performed on the rich navigation patterns extracted from several web sites.
2
Knowledge Acquisition about Site Users
There are many methods to measure and describe the audience and the traffic on a site. A first way to know users who access a given site is to use surveys produced by organizations like NetValue [13], Media Metrix [12] and Nielsen//NetRating [14]. Their methods consist in analyzing Internet activities of a panel over a long period of time and infer knowledge on the entire population. This is a user-centric approach. Panels are built up in order to represent at best the current population of Internet users. Some demographic data on each user of the panel (like age, gender or level of Internet practice…) are collected and all theirs activities on Internet are recorded. The analysis of these data provides Internet usage qualification and quantification. We will consider here only the information related to users and web usage. This approach gives general trends: it indicates for instance who use the web, what sort of site they visit, etc., but no processing is performed to capture precisely usage patterns on a given site. In the best case, one advantage of this approach is that site owners can get a description of his site users, but this point is true only for sites with large audience, the other sites have a low chance to have their typical users within the panel so as to obtain a meaningful description of their users. We can point out several interesting aspects in the user-centric approach. Firstly those methods are based on the extrapolation of observations made on a panel of users to the entire set of users. This means that it can be sufficient to make an analysis on only one part of users. Secondly, the approach relies on the assumption that there are some links between some features of users profile and their Internet usage. The second way to know users who have access to a given site is to perform an analysis at the site level. This is the site-centric approach. It consists in collecting all site navigations and then analysing these data in order to obtain traffic measure on the site and retrieve the statically dominant paths or usage patterns from the set of site sessions. A session corresponds to a user site visit; a session can be considered as a page sequence (in chronological order). Users' sessions are extracted from log files that contain all HTTP requests done on the site. Further information on problems and techniques to retrieve sessions from log files can be found in [5]. There are many industrial tools (WebTrends[16], Cybermetrie [7]) that implement the site-centric approach.
64
K. Chevalier, C. Bothorel, and V. Corruble
Here, we focus our attention on methods that retrieve automatically site usage patterns. Most of the methods are based on frequency measures: a navigation path is retrieved from the set of site sessions because it has a high probability of being followed [3][4], a sequence of pages is selected because it appears frequently in the set of site sessions (WebSPADE algorithm [8] adaptation SPADE algorithm [17]) or a site usage pattern is revealed because it is extracted from a group of sessions that were brought together by a clustering method [11]. Cooley and al. suggest filtering the frequent page sets in order to keep the most relevant set [6]. They consider a page set as interesting if it contains pages that are not directly connected (there is no link between them and no similarity between their content). Those methods allow catching different precise site usage patterns in terms of site page visited but they capture only common site usage patterns (site usage patterns shared by the greatest number of users). If a particular group of users shows a specific usage pattern of the web site, and it is not composed of enough users, their specific usage will not be highlighted. In that case, important information can be missed: particular (and significant) behaviours could be lost among all navigations. One way to overcome this limitation is to rely on some assumptions and methodologies of the user-centric approach that we described above. Firstly, it could be interesting to assume that there are some correlations between the users’ personal descriptions and the way they visit a given site. We can then build groups of users based on personal characteristics and then apply the site usage patterns extraction on smaller sets of sessions in order to capture navigation patterns specific to subgroups of users. This strategy reveals site usage patterns which are less frequent but associated to a specific group of users who share similar personal descriptions. We can then answer questions such as: “Do young people visit the same pages on a given site?” Secondly, in the same manner that the user-centric approach extrapolates knowledge learned on a panel of Internet users to the entire set of Internet users, we could restrict our search on data coming from a subset of site users and interpret the results obtained on this subset as valid for all the site users.
3
Discovering Rich Navigation Patterns
Our research project addresses the problem of knowledge discovery about a set of web site users and their site uses. We explore the possibility of correlating users' personal characteristics with their site navigation. Our objective is to provide a rich usage analysis of a site, i.e. usage patterns that are associated to personal characteristics and so offer a different, deeper understanding of the site usage. This has the following benefits: • It provides the site manager with the means to understand his/her site users. • It lets us envisage applications to personalization, such as navigation assistance to help new visitors. We want to add meaning to site usage patterns, and find site usage patterns which are specific to a subgroup of site users. We explore the possibility to correlate user de-
Discovering Rich Navigation Patterns on a Web Site
65
scriptions and site usage patterns. Our work relies on the assumption that navigation "behaviors" and users’ personal descriptions are correlated. If valid, this assumption has two consequences: (1) Two users similar in socio-demographic terms have got similar navigation on a web site; (2) Two users similar in their navigations on a web site have got similar personal description. Our approach supposes the availability of data that are richer than classical site logs. They are composed of site sessions and personal descriptions of reference users. Reference users form a subset of users who have accepted to provide a list of personal characteristics (like age, job…) and some navigation sessions on the web site, i.e. they are used as a reference panel for the entire population of the web site. From this data, we wish to obtain knowledge that is specific to the web site from which the data is obtained. In an application, by using this knowledge, we can infer some personal information about a new visitor, and propose page recommendations based on his navigation even if he gives no personal information. We choose to build this knowledge around two distinct elements: - A personal user characteristic is an element that describes in a personal way a user, for instance: age is between 15 and 25 years old, gender is man… - A navigation pattern represents a site usage pattern. Navigation patterns are sequences or sets of web pages that occur frequently in users sessions. For instance, our data shows, on the boursorama.com site (a French Stock Market site), the following frequent sequence of pages: access to a page about quoted shares and later on, consultation of a page that contains advices for making investment. We call the association of both elements of knowledge a 'rich navigation pattern', i.e. a navigation pattern associated to personal user characteristics. After describing our way to discover navigation patterns in the next subsection, we detail different rich navigation patterns that we want to learn and finally we present a way to extract them from a set of data composed of reference user's description and their site sessions. 3.1 Discovering Navigation Patterns Navigation patterns are sequences or sets of pages that occur frequently in session sets. We used an algorithm to retrieve frequent sets of pages, that take into account principles of algorithms such as FreeSpan [10] (PrefixSpan [15], WebSPADE [8] and SPADE [19]) that improve Apriori [1]. These algorithms are based on the following idea: "a frequent set is composed of frequent subsets". Here, a session is considered as a set of pages. We chose to associate to each pageset a list of session ids in which the pageset occurs in order to avoid scanning the whole set of sessions each time the support of pageset have to be calculated [10][15][8][19].
66
K. Chevalier, C. Bothorel, and V. Corruble Table 1. Initialisation phase for each session s in S do for each page pg ∈ s do Add s to the session set of the page pg. L1 ={ } for each page pg do if numberUser(pg)>minOccurrence then L1 =L1∪{pg} return L1
An initialisation phase (table 1) creates the frequent sets composed of one page. The session set S is scanned in order to build for each page pg a set that contains all sessions in which pg occurs. Then, only the web pages that appear in the navigations of more than minOccurrence users, are kept in L1 (set of large 1-pageset). A session is Table 2. Building (k+1) pagesets // Main loop k = 1 while (|Lk|>1) do Lk+1 =BuildNext(Lk) k++ end_while // BuildNext(Lk): Lk+1 = { } i = 1 while (i<|Lk|) do j=i+1 while ((j<|Lk|) and same_first_pages(Lk(i),Lk(j)) do Let page_of(n)=page_of(Lk(i))∪page_of(Lk(j)) S(n)=S(Lk(i))∩S(Lk(j)) if numberUser(n)>minOccurence then Lk+1 =Lk+1∪{n} j++ end_while i++ end_while return Lk+1 With : Lk = set of large k-pageset S(a) = set of sessions in which pageset a occurs Page_of(a) : set of pages that compose the pageset a Same_first_pages(a,b) : returns true if k-pageset a and k-pageset b share the same (k-1) first pages.
Discovering Rich Navigation Patterns on a Web Site
67
associated to a user id, so we are able to compute for each page, the number of different users that have consulted this page. A k-pageset is a set of k pages; a large kpageset is a frequent k-pageset (a k-pageset that occurs in more minOccurrence users). The main phase consists in generating all frequent pagesets. New (k+1)-pagesets are built from frequent k-pagesets. If two k-pagesets share the same first (k-1) pages then they could form a (k+1)-pageset (let be n). The session set of n is the intersection of session sets of the two k-pagesets. Then among the generated (k+1)-pagesets, we keep only the (k+1)-pagesets that occur in more minOccurrence users. And so on … The table 2 describes this phase. We prefer calculating the support of a pageset based on the number of users (rather than the number of sessions) in order to avoid the situation where the site usage of a very active user hides the site usage of others users. In the same way, we use a similar algorithm to retrieve frequent sequences of pages, taking then into account the order of the visited pages. Here, a session is considered as a sequence of pages. To restrict our search for frequent sequence of pages, we use the frequent set of pages retrieved during the first step (because the set of pages that compose a frequent sequence is a frequent set). 3.2 Rich Navigation Patterns Based on the assumption that there are some correlations between personal user descriptions and site usage, we extract different kinds of rich navigation patterns: - A navigation pattern enriched with a user characteristic. This knowledge provides the following kind of rule: "if a user visits this page then with a confidence of 77%, this user is about 20 years old." This enriched navigation patterns are association rules in the form of "navigation pattern Æ user description" (NPÆUD). - A navigation pattern, which is specific to a group of site users (that share similar personal description). This knowledge provides the following kind of rule: "if a user is a woman, with a confidence of 79%, she will visit this page of the site, after this page". This specialized navigation patterns are association rules in the form of "user description Æ navigation pattern" (UDÆNP). - More complex rules that involve in the condition rule, at the same time, user characteristics and navigation patterns. Association rules in the form of (NP+UDÆNP) or (NP+UDÆUD). The first one provides the following kind of rule "if a user is a man and if he visits this two given site pages, then with a given confidence he will consult this other site page too". The second one illustrates the following kind of rule: "if a user is between 30 and 40 years old and if he follows this navigation pattern then the user is a man with a given confidence". - And so on: rules in the form of (NP*+UD*ÆNP) or (NP*+UD*ÆUD). To sum up, rich navigation patterns are rules composed of a condition that involves user characteristics and/or navigation patterns and of a conclusion that can be either a user characteristic or a navigation pattern. The set of rich navigation patterns learned aims to provide a rich site usage of the given site because it gives some combined information about who are the site users and which site pages they visit.
68
K. Chevalier, C. Bothorel, and V. Corruble
Our goal is to use those rich navigation patterns in a real time personalization system dedicated to a site. We wish to extract rich navigation patterns from data (personal information and site navigations) provided by a panel of users. Then, when a new unknown visitor is visiting the site, based on his navigation, some personal information about him can be inferred, allowing some form of page recommendation, even if he provides no personal information. The visitor gets the benefit of reference users’ experience on the site. Our study opens perspective on personalization techniques, which does not require any personal information from users. The next section presents a method called SurfMiner that implements our approach to extract rich navigation patterns. 3.3 SurfMiner Method: Discovering Rich Navigation Patterns The SurfMiner method aims to extract rich navigation patterns. There are two important phases in this method: the clustering phase and the characterization phase. The Clustering phase consists in grouping reference users according to their site usage patterns (through navigation patterns) on one hand and according to their personal characteristics on the other hand. It is important to note that a user can appear in many groups. - Building users clusters according to personal data. Several partitions have to be done because we don’t know which characteristics are more relevant to group reference users with regard to the whole users description and the site usage. We choose to create a partition of users for each attribute. In the case of a nominal (or ordinal) attribute, the algorithm builds a user group for each value of this attribute and filters out all groups that contain not enough users (the support of the cluster must be above a fixed minimum support). In the case of a continuous-valued attribute, the method consists in applying an agglomerative hierarchical clustering that uses an average similarity between clusters. At each step (after each merging of clusters), we keep all clusters that contain enough users. - Building users clusters according to site usage patterns. It consists in discovering navigation patterns present in the set of site sessions. For each navigation pattern, in order to form a users group we bring together all users that have followed the navigation pattern. At the end of this phase, a user group will be described by a navigation pattern: users of the group share the fact that they have followed this navigation pattern. The Characterization phase aims on one hand to characterize each group of users by personal characteristic and, on the other hand to extract navigation patterns from user’ sessions of each user cluster discovered in the primary phase (we consider this as a characterization of users groups with navigation patterns). - Characterizing users groups according to personal characteristics. For each users cluster (built during the Clustering phase), we select personal characteristics
Discovering Rich Navigation Patterns on a Web Site
69
shared by a sufficient number of users. This number indicates our confidence in this characterization. - Characterizing users groups according to site usage patterns. For each group of users (built during the Clustering phase), we extract navigation patterns from the set of all the sessions done by the users of the considered group. The discovered navigation patterns depict the site usage specific to the group of users from which they have been extracted. This Characterization phase can at times produce irrelevant results, such as characterizations that could equally be observed in the entire set of users. In other words, those characterizations do not necessarily bring additional information compared to what can be inferred about all users. So it is necessary to use a relevance criterion turning down these useless characterizations. We chose to evaluate the relevance of a characterization by comparing the proportion of users that have the personal characteristic (respectively users that follow the navigation pattern) in the specific users group (the one that we want to characterize) and the proportion of users that share the personal characteristic (respectively users that follow the navigation pattern) in the whole population of users. If the two proportions are significantly different, the characterization is deemed relevant. In order to determine this, we compute the z-score between the two proportions. The z-score between two proportions is significant when z≥ZLWKDULVN Table 3. SurfMiner method //Initialization Gusers=Clustering(Users) RichPatterns={} nbLoop=1 //Main Loop While nbLoop≤2 do NewGusers={} While Gusers≠{} do g=removeFirstElement(Gusers) RichPatterns=RichPatterns ∪ Characterization(g) NewGusers= NewGusers ∪ Clustering(g) End while Gusers=NewGusers NbLoop++ End while With Users= set that contains all users, each user is described by a set of personal description and a set of sessions Gusers= set of subgroup of users g a subgroup of users RichPatterns = set that contains rich navigation patterns
Set of reference users
Clustering of users
users groups
Characterisation of users groups
rules that rules that conclude on a conclude on a navigation pattern personal description
70
K. Chevalier, C. Bothorel, and V. Corruble
The previous table and figure sum up the SurfMiner method: the initialization step builds, from the entire set of reference users, different subgroups of users described by one element (that could be a navigation pattern or a personal user characteristic). This is done by applying the clustering phase on the data coming from the entire set of reference users. The main step then aims on one hand at characterizing each discovered group of users by navigation patterns or user characteristics. This is done during the characterization phase for each discovered group of users; this phase builds the different rich navigation patterns. On the other hand, the main step applies the clustering phase on each discovered group of users in order to create new subgroups of users. These subgroups of users are described by several elements (that could be navigation patterns or personal user characteristics). These new subgroups are placed again in the loop to create new rich navigation patterns. At the end of the first cycle, the SurfMiner method has extracted navigation patterns with a user characteristic (NPÆUD) and navigation pattern specific to a user community (UDÆNP). This first turn could as well find some associations between two navigation patterns (NPÆNP) and between two user characteristics (UDÆUD). The second cycle reveals more complex and still richer patterns such as (NP+UDÆNP) or (NP+UDÆUD). And so on. In the next section, we describe the application of our method on rich data that contains users’ descriptions and sessions' sets. Then we present some evaluations done on the knowledge discovered with the method.
4
Results and Evaluation
An experiment and an evaluation of the SurfMiner method is being carried out on data of Internet usage. This data comes from Internet traffic log of a group of a thousand people extracted from a NetValue panel in 2000. They are used and enriched in the framework of a partnership between France Télécom R&D and NetValue about Internet usage [2]. 35 attributes describe users: (1) some characteristics are directly provided by users, such as age, professional activities, town size or date of their fist connexion to Internet; (2) others characteristics are deduced from their Internet use, such as the total number of Internet sessions done during the year 2000, the number of search engine requests done in 2000, or their type according to the Internet communication services which they used. The navigation data were previously processed in order to add to each HTTP request a session id and the user id. Each HTTP request is composed of a date, the url requested,… A session is a sequence of urls in chronological order. ([2] gives a complete description of tools and methods used to perform those treatments). We constructed 5 sets of data that correspond to sets of sessions done in 2000 on 5 different web sites (which are "anpe.fr", a job center site, "boursorama.com" a site about the Stock Market, "liberation.fr" a daily news site, "mp3.com" a site about music and "voila.fr", a general French portal.) and a set of users descriptions, corresponding to those sessions. We randomly decompose data into two distinct sets:
Discovering Rich Navigation Patterns on a Web Site
71
- A training set used to learn rich navigation patterns. This set contains descriptions and navigation tracks of around 80% of users panel randomly selected. - A test set that allows us to evaluate the learned rich navigation patterns. This set contains descriptions and navigation tracks of the remaining 20% of users panel. We apply the SurfMiner method on the training data. Here are some examples of rich navigation patterns obtained. This first example shows a "NPÆUD" rule with a given confidence. The rule means that users who consult the index page to look for a job offer and also the page that sort job offers relatively to the region, should be, with a confidence of 87%, women. set{www.anpe.fr/recherch/index.htm, www.anpe.fr/regions/accueil.ht m}
⇒ {GENDER=woman} with confidence=0.87
Intuitively, the first rule can be interpreted as meaning that job research for which the geographical criterion is chosen is performed most often by women. The job site might therefore automatically adapt its response based on the hypothesized gender of the user. set{genres.mp3.com/blues}
⇒ {NB_Children=0} with confidence=0.89
set{genres.mp3.com/blues}
⇒ {AGE=[44,73]} with confidence=0.89
These two examples mean that a large majority of users who visit frequently the area dedicated to blues music do not have children and are between 44 and 73 years old. The fourth example is a "UDÆNP" rule. It means that users with age between 35 and 49 follow, in 55% of the cases, the given page sequence. This sequence corresponds to the consultation of the welcome page of the ANPE site (a job center site) described by the two first urls and then the access to an index page of job offers. {AGE BRACKET=35-49 years}
⇒ sequence{www.anpe.fr -> www.anpe.fr/accht.htm -> www.anpe.fr/offremp/index.htm} with confidence=0.55
The last example shows a "UD+NPÆUD" rule. It says that the men who visit the index area of hip-hop-rap music on the mp3.com site, live (in 75% of the cases) in a household composed of 4 or 5 individuals. page(genres.mp3.com/music/hip_hop_rap/) AND {GENDER=man}
⇒ {SIZE_HOUSEHOLD=4 or 5} with confidence=0.75
The interpretation of rich navigation patterns is sometimes difficult because the log data was recorded in year 2000 and we cannot now follow the same navigation in
72
K. Chevalier, C. Bothorel, and V. Corruble
order to check the page content (the sites changed their structure and their contents) and propose a deep interpretation. The SurfMiner method has two main parameters: the minimum support and the minimum confidence of a rule. The table 1 shows the number of rich navigation patterns generated for each site when the minimum support is fixed to 5% and the minimum confidence to 50%. Table 4. Number of generated rules according to a minimum support of 5% and a confidence minimum of 50%
As noticed in [9], the creation of the navigation patterns at the url-level is not always a good idea. It might be too precise: similar access patterns can emerge at a higher level. The consequence of that is that some site usage patterns remain invisible to the urllevel. According to this remark, we tried first to generalize (in a simple manner) the access on the site: the syntax of an url for the http protocol is: http://:<port>/<path>/{?<searchpart>}0/1 or http://:<port>/<path>/{#:<port>/<path>/. And Secondly we do a stronger generalisation on urls that considers only http://:<port>/<path>. We applied SurfMiner on this transformed data in order to observe the advantages of those generalizations. The modified navigation data allow us to obtain more rules. The two generalizations allow us to obtain navigation at different points of view and it gives us a complete view of site usage patterns. A remark must be done on the more complex rules. Table 1 describes the number of complex rules obtained with the sites mp3.com and liberation.. A few navigation patterns that could not be inferred on the entire population of users are present among the complex rules. In other words, some navigation patterns have appeared thanks to the construction of groups of users built around several characteristics. It is quite encouraging even if the number of specific navigation patterns is not high. We present in this subsection initial evaluations done on NPÆUD rules. To evaluate the prediction quality of NPÆUD rules, we observe the percentage of good predictions on the test data. The test consists in using the set of NPÆUD rules on each session (and each point of the session) to predict the user profile, responsible of the session. A rule is activated
Discovering Rich Navigation Patterns on a Web Site
73
when a navigation pattern (condition of the rule) is observed in the considered part of the session. If the user description (consequence of the rule) belongs to the user profile then the prediction is correct else; the prediction is considered as not correct. We compare this result with a "frequency-based" prediction. The "frequency-based" prediction reflects the most frequent value (in the whole population of reference users) of each characteristic. When a SurfMiner rule is activated and predicts the value of a user characteristic then a "frequency-based" prediction is generated on the same characteristics. The Fig. 1 presents the percentage of correct predictions with NPÆDU rules comparing to the percentage of correct “frequency-based” predictions. We could note that rules generated by SurfMiner show better predictions than “frequency-based” ones. The knowledge provided by the "NPÆUD" rules can be very useful in a personalization application. These rules associate to a followed sequence (or a set) of pages a user characteristic. So, based on the current navigation of a user, we can infer a part of his profile. Thanks to the inferred profile, the personalisation system can then adapt for instance the web page to this user. 70,00% 60,00% 50,00% 40,00%
% correct predictions with NP->UD rules
30,00%
% correct frequencybased predictions
20,00% 10,00% 0,00% voilà
MP3
Libération
anpe
boursorama
Fig. 1. Percentage of correct predictions of NPÆDU rules
The primary results presented here are encouraging. But it is necessary to look at things in perspective. Indeed some "NPÆUD" rules are useless. For instance, we obtain a rule saying that users who consult a particular sequence of pages on the site ANPE (job center site), should be with a confidence of 97,5%, between 21 and 55 years! (The global population’s age bracket is 9 to 63, and is concentrated between 21 and 55). It is difficult to produce a bad prediction with this rule… This kind of rule distorts our results. We are improving the SurfMiner method in order to filter out these obvious rules. We have many plans to explore the possibility of correlation between users' personal characteristics and their site navigation. We are in the process of modifying our protocol to perform a complete evaluation of rich navigation patterns.
74
5
K. Chevalier, C. Bothorel, and V. Corruble
Conclusion and Perspective
This paper suggests a new way to obtain knowledge about users of a web site. Our approach seeks to extract navigation patterns associated with some users’ characteristics. It consists in introducing some user descriptions (like demographic characteristics) in addition to web navigation (sessions) into the process of navigation patterns discovery. The validity of our approach relies on the assumption that there are some correlations between some site usage and some users' personal data. We think that it is interesting to obtain a precise usage of a given site, especially for site with an audience composed of miscellaneous person. Through rich site usage patterns, we can understand better site users and the way they use a given site. We test the approach on 5 rich sets of data. Each set contains users descriptions and navigations on a given web site. It draws our attention to a few points. Firstly, some improvements have to be done on the SurfMiner method. Secondly, some preprocessing has to be performed on each set of data (that corresponds to one web site) in order to filter users that do not visit the site regularly, and therefore distort our results. The first evaluations have been performed on the method in order to evaluate the prediction quality of navigation patterns enriched with users' description. We are now performing a global evaluation of all rich navigation patterns. The SurfMiner method needs a rich set of data. Obtaining or building these data can represent a major difficulty. We have been working about practical ways to acquire this kind of data for any site. If our approach shows good results, the project will have direct application to real time personalization systems, in particular for tools dedicated to navigation assistance, service or page recommendation. In an applicative perspective, SurfMiner may be applied, for any particular site, on a panel of users for which the site can collect personal information and site navigations. Thanks to these reference users, SurfMiner learns rich navigation patterns dedicated to the site. Then, when a new unknown user visits the site, some personal information about him will be inferred based on his navigation. Let us therefore insist here that the page recommendation module using our “rich navigation patterns” will be applicable for this new user, even if he/she has not given any personal information. Our study opens perspective on personalization techniques, which does not require any personal information from users.
References 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In proceedings of th the 20 VLDB conference Santiago, Chile, 1994. 2. Beaudouin V., Assadi H., Beauvisage T., Lelong B., Licoppe C., Ziemalicki C., Arbues L., Lendrevie J.. Parcours sur Internet: analyse des traces d’usage. Rapport RP/FTR&D/7495, (2002) France Telecom R&D, Net Value, HEC. 3. Borges, J., Levene, M. Mining Association Rules in Hypertext Databases. In Proceedings of Conference on Knowledge Discovery and Data Mining, 1998.
Discovering Rich Navigation Patterns on a Web Site
75
4. Borges, J., Levene, M. Data Mining of User Navigation Patterns. In Proceedings of the Workshop on Web Usage Analysis and User Profiling, pages 31–36. August 15,1999, San Diego, CA. 5. Cooley, R., Mobasher, B., Srivastava, J. Data Preparation for Mining World Wide Web Browsing Patterns. In Knowledge and Information System, 1(1): 5–32, 1999. 6. Cooley, R., Tan, P., Srivastava, J. (1999). WebSIFT: The Web Site Information Filter System. In Proceedings of the Web Usage Analysis and User Profiling Workshop, August 1999. 7. Cybermétrie. Cybermétrie La mesure collective des sites de l’Internet en France, Source : Médiamétrie. http://www.mediametrie.fr/web/produits/cybermetrie.html 8. Demiriz, A., Zaki, M. webSPADE: A Parallel Sequence Mining Algorithm to Analyze the Web Log Data. Submitted to KDD'02. 9. Fu, Y., Sandhu, K., Shih, M. Clustering of Web users based on access patterns. In proceedings of the 1999 KDD Workshop on Web Mining, San Diego,1999. 10. Han, J., Pei, J., Mortazavi-Asl, B., Chen Q., Dayal, U., Hsu, M. FreeSpan: Frequent Pattern Projected Sequential Pattern Mining. In Proceedings of international Conference on KDD, Boston, August 2000. 11. Hay, B., Wets, G., Vanhoof, K. Clustering navigation patterns on a website using a Sequence Alignment Method. In proceedings of IJCAI's Workshop on Intelligent Techniques for Web Personnalisation, Seattle, Washington 4–6 August 2001. 12. Media Metrix mediametrix.htm (comScore) http://www.comscore.com/products/mmetrix/ 13. NetValue, http://www.netvalue.fr/ 14. Nielsen//NetRating. http://www.nielsen-netratings.com/ 15. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen Q., Dayal, U., Hsu, M. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In Proceedings of ICDE'01 , Germany, April 2001. 16. WebTrends. http://www.webtrends.com/ 17. Zaki, M. SPADE: An Efficient Algorithm for Mining Frequent Sequences. In Machine Learning, pp 31–60, vol. 42 nos. 1, Jan/Feb 2001.
Mining Frequent Itemsets with Category-Based Constraints Tien Dung Do, Siu Cheung Hui, and Alvis Fong Nanyang Technological University, School of Computer Engineering, Singapore {PA0001852A,asschui,ascmfong}@ntu.edu.sg
Abstract. The discovery of frequent itemsets is a fundamental task of association rule mining. The challenge is the computational complexity of the itemset search space. One of the solutions for this is to use constraints to focus on some specific itemsets. In this paper, we propose a specific type of constraints called category-based as well as the associated algorithm for constrained rule mining based on Apriori. The Categorybased Apriori algorithm reduces the computational complexity of the mining process by bypassing most of the subsets of the final itemsets. An experiment has been conducted to show the efficiency of the proposed technique.
1
Introduction
The main task of association rule mining is to discover frequent itemsets, which is generally very costly. Given a set I of m items, the number of distinct subsets is 2m . For an average number of items of a typical market basket, say, up to 100 one has 2100 ≈ 1030 subsets. Clearly, visiting all possible subsets to find frequent ones is impractical. To reduce the combinational search space, all association rule mining algorithms are based on the property that “any subset of a frequent itemset is frequent”. It means that an itemset can be frequent only if all subsets of it are frequent. The itemset is then called a candidate. Most algorithms [1,2] use the property on a “generate and test” framework, in which only candidates are generated and then counted for supports. This approach has been proved to be efficient in many cases. However, there are still computational problems related to the large number of frequent itemsets and the size of the frequent itemsets themselves [3]. In practice, users are often interested only in a subset of rules, in which the items satisfy a given constraint. For example, a user of a market basket database may just want to find out rules involving items on coffee and tea. The association rule mining then deals only with transactions consisting of both items coffee and tea which are much fewer than the whole set of transactions. The mining process, therefore, is simpler. Constraints were first incorporated to association rule mining in [4], where a constraint was considered as a logical expression. In [5,6,7], constraints were classified and algorithms for each constraint class were given. Ng et al. [5] introduced two classes of constraints, namely anti-monotone G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 76–86, 2003. c Springer-Verlag Berlin Heidelberg 2003
Mining Frequent Itemsets with Category-Based Constraints
77
and succinct. A larger class of constraints called convertible was discussed in Pei et al. [6]. Similar to the property on frequent itemsets, the anti-monotone constraint can be implied from an itemset to its subsets, so that it can be used to limit the number of candidates. The convertible constraint is a property on an order of items. The succinct constraint allows us to remove items that fail succinctness before mining commences. It gives the forms of the itemsets. For example, itemsets may be defined as a combination of items from categories Grain and Diary in a market basket (see Table 1) that could be {rice, milk}, {rice, yogurt} or {rice, milk, yogurt}. Table 1. An example of items and categories. Category Drink Item
Grain Dairy
coffee,tea rice
milk, yogurt
In this paper, we propose a specific type of constraints called category-based constraints. The constraints stipulate the category pattern of itemsets. This is similar to the succinct constraint except that there is only one item from each category for the category-based constraint. For example in Table 1, the constraint [Grain, Diary] defines that the itemsets have two items belonging to the categories Grain and Diary, such as {rice, milk} or {rice, yogurt}. This kind of constraints can be used with any dataset, in which the data is categorized. For example, in a tabular database where data is stored according to attributes, a set of attributes can be used to formulate a constraint. The category-based constraints can also be applied to speed up the mining process of the succinct constraints. In addition, we also propose an algorithm for mining frequent itemsets with category-based constraints based on the Apriori [1] algorithm. The proposed Category-based Apriori algorithm is fundamentally different from the previous algorithms in that, instead of using sets of items, it is based on lists of items. The precious property of lists is that the number of sub-elements is much fewer than that of sets. For example, a set of 20 items has more than one million subsets while a list of the same size has only 210 sub-lists1 . For a clearer comparison, let’s consider the set {1, 2, 3, 4}. The 2-item subsets are {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4} and {3, 4}. In contrast, the list [1, 2, 3, 4] has only three sub-lists of two items: [1, 2], [2, 3] and [3, 4]. 1
The number of subsets of a set of 20 items (except the empty one) is 220 − 1 ≈ 106 . 1 2 sub-lists of one item and C20 sub-lists with more than one item, each There are C20 of which corresponds to a pair of distinct first and last items. Therefore, the number 1 2 of sub-lists of a list of 20 items will be C20 + C20 = 210.
78
T.D. Do, S.C. Hui, and A. Fong
The rest of the paper is organized as follows. In the next section, we give the related work on the Apriori algorithm and constraint mining. Section 3 then introduces category-based constraints and discusses the Category-based Apriori algorithm for frequent itemset mining. Section 4 describes the experiment that evaluates the proposed algorithm. In Section 5, we give a discussion on other properties of the proposed algorithm. Section 6 concludes the paper.
2
Related Work
This section gives an overview on related work, the Apriori algorithm and using constraints on frequent itemset mining. 2.1
Frequent Itemset Mining
Let I = {i1 , . . . , im } be a set of m distinct literals called items. A transaction is a set of items in I. A transaction database consists of a set of transactions. An itemset S ⊆ I is a subset of the set of items. A k-itemset is an itemset of size k. An itemset S is contained in transaction T if and only if S ⊆ T . Given a database D, the support sup(S) of itemset S is the number of transactions in D containing S. Given a support threshold ξ(1 ≤ ξ ≤ |D|), an itemset S is frequent provided that sup(S) ≥ ξ. The task of frequent itemset mining is to find all frequent itemsets with a given support threshold. The property of frequent itemsets that “any subset of a frequent itemset is frequent” can be stated formally as follows [3]: Lemma 1. For any itemset B ⊆ I with sup(B) ≥ ξ and any other itemset A ⊆ I with A ⊆ B, then sup(A) ≥ ξ. Figure 1 gives the Apriori algorithm. It counts the supports of candidate itemsets using the “breadth-first” search. In the first pass, it counts the supports for every 1-itemsets (C1 = I) and gathers frequent 1-itemsets L1 . At pass k(k ≥ 2), candidates Ck , which are generated from frequent itemsets collected in the last pass (Lk−1 ), are counted. The property of frequent itemsets in Lemma 1 guarantees that these candidates cover all frequent k-itemsets. In other words, the approach, while narrowing down the counting space, does not miss any frequent itemsets. 2.2
Frequent Itemset Mining with Constraints
A constraint C on items is a Boolean expression over the powerset of the set of items: C: 2I → {true, f alse}. An itemset S satisfies a constraint C if and only if C(S) = true. Mining frequent itemsets with item constraint C becomes finding all itemsets S: sup(S) ≥ ξ and C(S) = true. A simple approach for constraint rule mining is to find all the frequent itemsets first and then discard those, which do not satisfy the constraints. However, the computational cost remains the same as the traditional mining problem.
Mining Frequent Itemsets with Category-Based Constraints
79
Overall process //pass 1 C1 = I; L1 = {s | s ∈ C1 & sup(s) ≥ ξ}; k = 2; //pass k = ∅) { while(Lk−1 gen(k); Lk = {s | s ∈ Ck & sup(s) ≥ ξ}; k = k + 1;} Answer = k Lk Gen(k) insert into Ck select p.item1 , p.item2 , ..., p.itemk−1 , q.itemk−1 from Lk−1 p, Lk−1 q where p.item1 = q.item1 ,..., p.itemk−2 = q.itemk−2 , p.itemk−1 < q.itemk−1 ; forall itemset c ∈ Ck do forall (k-1)-subset s of c do if !(s ∈ Ck−1 ) then delete c from Ck ; Fig. 1. Apriori algorithm.
A more efficient approach is to use the constraints in every pass of the mining process to prune unsatisfied itemsets. This will considerably reduce the candidates for each scan pass. However, it can lead to a wrong result. Let’s consider an example on finding rules containing items coffee and tea. Obviously, there is no 1-itemset containing both items coffee and tea. If the constraint is applied at the first pass, then L1 will be empty and the process terminates. This is because the constraint is not consistent with the Apriori condition. A constraint C can be obtained in every pass of the mining process if it has a similar property with frequent itemsets given in Lemma 1. This constraint was introduced in [5] as anti-monotone [3]: C(B) & A ⊆ B ⇒ C(A)
(1)
In [5], an interesting constraint called succinct was also defined. A succinct constraint can be expressed under MGF (Member Generating Function) in the = 0, 1 ≤ j ≤ k} form {X1 ∪ . . . ∪ Xn |Xi ⊆ σpi (item), 1 ≤ i ≤ n, and ∃k ≤ n : Xj for some n ≥ 1 and some selection predicates p1 , . . . , pn . The MGF form can be =∅ decomposed into Cα = X1 ∪ . . . ∪ Xk and Cβ = Xk+1 ∪ . . . ∪ Xn in which Xi with i = 1, . . . , k. In [5], the first k passes are conducted to find frequent itemsets Lk satisfying Cα . After that, Ci and Li (i = k + 1, k + 2, . . .) are computed per normal, with a modification in candidate generation. Notice that succinct is not an anti-monotone constraint. Thus the modification in the Gen procedure is
80
T.D. Do, S.C. Hui, and A. Fong
needed to avoid missing frequent itemsets in subsequent passes as shown in the example of finding rules containing coffee and tea above.
3
Category-Based Constraints
Items are categorized with the set of categories {c1 , . . . , cn }. Each item belongs to one or more categories. An example is given in Table 1. The categorization shows, for example, that the items coffee and tea belong to the category Drink. We represent a category as the set of items which belong to the category: c = {i|i ⊆ I and i belongs to c}. Then, the item i belongs to the category c if and only if i ∈ c. For the categorization shown in Table 1, we have the categories and items as Drink = {coffee, tea}, Grain = {rice} and Dairy = {milk, yogurt}. Definition 1. A category-based constraint C is represented as a list of categories C = [cr1 , . . . , crs ], with 1 ≤ rj ≤ n for j = 1, . . . , s. An itemset S satisfies the constraint C if there is an order of items in S : [item1 , . . . , items ] such that itemj ∈ crj with j = 1, . . . , s. For the example given in Table 1, the itemsets which satisfy the constraint [Drink, Grain] would be {coffee, rice} and {tea, rice}. The constraint [Dairy, Dairy] has only one set {milk, yogurt}. Category-based constraints are normally observed in tabular databases. For example, in a table of human resource information with attributes of {Name, Education, Occupation, Sex}, one may be interested in relations between some attributes, for instance, Education and Occupation, or Sex, Occupation and Education. Category-based constraints can be expressed as [Education, Occupation] or [Education, Occupation, Sex]. In another situation where data is classified, one may need to find out the relations between some certain classes. For example, in a mining task for textual documents of news about Middle East, one may extract terms and classify them in groups such as Country {US, Iran, Iraq}, Organization {UN, OPEC} and Conflict {bombing, threaten, protest}. News on conflicts happening in certain countries may be found under the constraint [Country, Country, Conflict] or [Country, Organization, Conflict]. 3.1
Properties of Lists in Category-Based Constraints
We define the sub-list relation, the length and set values, and the plain and frequent properties of a list. Note that the list [le , . . . , lf ] with e ≤ f denotes a list of elements with continuous indices from e to f (e.g. [l3 , . . . , l6 ] = [l3 , l4 , l5 , l6 ]). – Sub-list (⊆): L1 ⊆ L2 with L2 = [l1 , . . . , lk ] if and only if e, f such that 1 ≤ e ≤ f ≤ k and L1 = [le , . . . , lf ]. – The length of a list is the number of elements of the list. For example, let L = [l1 , . . . , lk ], the length of L is k. L is called a k-list. – Plain list is a list with no repetition of elements. If L = [l1 , . . . , lk ] is a plain = lf . list, then ∀e, f : 1 ≤ e, f ≤ k & e = f ⇒ le
Mining Frequent Itemsets with Category-Based Constraints
81
– The set of a list is the set of all elements of the list: If L = [l1 , . . . , lk ], then the set of L is {l1 ∪. . .∪lk }. If L is a plain list, then the set of L is {l1 , . . . , lk } (the set of a plain k-list is a k-set). – Let L be a list of items. L is frequent if and only if the set of it is a frequent itemset. We define the belong-to and sub-belong-to relations on lists of items and categories as follows. Note that A and B are lists of items and M and N are lists of categories. Definition 2. A belong-to M if and only if length(A) = length(M ) = k and the j th item of A belongs to j th category of M with j = 1, . . . , k (i.e. A = [i1 , . . . , ik ]; M = [c1 , . . . , ck ] and ij ∈ cj ∀j = 1, . . . , k). Definition 3. A sub-belong-to N if and only if ∃M such that A belong-to M and M ⊆ N . For example, [tea, rice] belong-to [Drink, Grain]. Besides, [tea, rice] subbelong-to [Drink, Grain, Dairy] because it belong-to [Drink, Grain] and [Drink, Grain] is a sub-list of [Drink, Grain, Dairy]. We can, from the definitions above, also imply that if A sub-belong-to N , and A and N have the same length, then A belong-to N . Lemma 2. Let C be a list of categories. Sub-belong-to C is an anti-monotone constraint on item-lists with the relation sub-list: (B sub-belong-to C) and A ⊆ B ⇒ (A sub-belong-to C). Proof. Let B = [i1 , . . . , ik ]. A ⊆ B ⇒ e, f such that 1 ≤ e ≤ f ≤ k and A = [ie , . . . , if ]. B sub-belong-to C ⇒ ∃N such that B belong-to N and N ⊆ C. From the definition of belong-to we have N = [c1 , . . . , ck ] and ij ∈ cj ∀j = 1, . . . , k. Let M = [ce , . . . , cf ], we can imply that A belong-to M and M ⊆ N . M ⊆ N and N ⊆ C ⇒ M ⊆ C. Combining A belong-to M with M ⊆ C, we have A sub-belong-to C. 3.2
Category-Based Apriori Algorithm
We can observe from Definition 1 and Definition 2 that an itemset S satisfies the constraint C = [cr1 , . . . , crs ] if and only if there is a plain list L such that S is the set of L and L belong-to C. L belong-to C can be expressed as L subbelong-to C and length(R) = |C|. Now, the problem of mining frequent itemsets with category-based constraint C can be stated as finding all lists L satisfying the following conditions: – L is frequent and plain. – L is sub-belong-to C. – length(L) = |C|.
(2) (3) (4)
According to Lemma 2, condition (3) is an anti-monotone constraint. Therefore, it can be exploited in the Apriori algorithm. Condition (4) can simply be
82
T.D. Do, S.C. Hui, and A. Fong
obtained by getting only the resultof pass s (Ls with s = |C|) instead of the combination of results of all passes ( k Lk ). The plain property of L is guaranteed by the Gen procedure. In the Gen procedure in the Apriori algorithm, a k-itemset is generated as a candidate if all of the k subsets of size (k − 1) are frequent. When lists of items are used instead of itemsets, for a k-list L = [e1 , . . . , ek ] having only two sub-lists of (k − 1) items (i.e. L1 = [e1 , . . . , ek−1 ] and L2 = [e2 , . . . , ek ] - they are called immediate sub-lists of L), a k-list is generated as a candidate if the two immediate sub-lists of it are frequent. Let Rki = [cri , . . . , cr(k+i−1) ] be the sub-list of length k and start at the ith category of the constraint C = [cr1 , . . . , crs ]. The two immediate sub-lists of Rki should have the length of (k − 1), and start at the ith and (i + 1)th categories of C, therefore, they are R(k−1)i and R(k−1)(i+1) . According to Definition 3, any k-list candidate, as it sub-belong-to C, belong-to a sub-list of length k of C. Let A = [e1 , . . . , ek ] be a candidate in pass k such that A belong-to Rki . A is generated in the Gen function from its two immediate sub-lists: A1 = [e1 , . . . , ek−1 ] and A2 = [e2 , . . . , ek ]. We can imply from the definition of sub-list that A1 and A2 belong-to the two sub-lists of Rki accordingly (A1 belong-to R(k−1)i and A2 belong-to R(k−1)(i+1) ). With k ≤ s, the constraint C has t = s − k + 1 sub-lists of size k : Rk1 = [cr1 , . . . , crk ], . . . , Rkt = [crt , . . . , crs ]. In pass k, the candidates are stored in t distinct variables Ck1 , . . . , Ckt such that candidates in Cki belong-to sub-list Rki of C with i = 1, . . . , t. The frequent k-item-lists are also stored accordingly in Lk1 , . . . , Lkt . We can imply that candidates in Cki , in which candidates belongto Rki , are generated from L(k−1)i and L(k−1)(i+1) , in which item-lists belongto R(k−1)i and R(k−1)(i+1) . An example of candidate generation and frequent itemset counting is given in Figure 2 with s = 3. The main process is also modified for multiple lists of candidates and frequent sets in a pass. Figure 3 shows the Category-based Apriori algorithm (or AprioriCB ).
Gen(1)
Counting
Gen(2)
Counting
Gen(3)
C11 = Cr1
L11
C21
L21
C12 = Cr2
L12
C22
L22
C13 = Cr3
L13
Counting C31
Result L31
Fig. 2. Candidate generation and counting with s = 3, C = [Cr1 , Cr2 , Cr3 ].
4
Experiment
We have conducted an experiment on a 1.4GHz Pentium PC with 400MB of memory running with Windows 2000 to measure the efficiency of the proposed algorithm. The dataset used in the experiment is extracted from a Census database
Mining Frequent Itemsets with Category-Based Constraints
83
Category-based Apriori with C = [cr1 ,..., crs ] //pass 1 for(i=1 to s) C1i =Cri ; for(i=1 to s) L1i ={l | l ∈ C1i && sup(s) ≥ ξ}; k=2; //pass k while(L(k−1)i = ∅ ∀i = 1,..., s-(k-1)+1) && (k ≤ s) { gen(k); for(i=1 to s-k+1) Lki = {l | l ∈ Cki && sup(s) ≥ ξ}; k = k + 1; } Answer = Ls1 ; Gen(k) for(i=1 to s-k+1) insert into Cki select p.item1 , p.item2 ,..., p.itemk−1 , q.itemk−1 from L(k−1)i p, L(k−1)(i+1) q where p.item2 = q.item1 ,..., p.itemk−1 = q.itemk−2 and p.item1 = q.itemk−1 ; Fig. 3. Category-based Apriori Algorithm.
from UCI Machine Learning Repository (URL: http://www.ics.uci.edu/mlearn/ MLRepository.html). The dataset consists of 5,000 transactions of employee’s information of 14 attributes (6 continuous and 8 nominal). The category-based constraint C includes all 8 nominal attributes [Work-class, Education, Maritalstatus, Occupation, Relationship, Race, Sex, Native-country] with 102 distinct values (items). As continuous attributes need to be converted into nominal ones before the mining process, we have decided to use only nominal attributes and discard the six continuous ones in this experiment. All the testing algorithms are written in C++. With category-based constraint C = [cr1 , . . . , crs ], the constrained association rule mining algorithm CAP discussed in [5] can be applied to mine frequent itemsets. C was reduced optimally into a “weaker” constraint Cγ stipulating that an itemset S = {e1 , . . . , et } satisfies the constraint if (i) all items of S belong to IC = cr1 ∪ . . . ∪ crs and (ii) t ≤ s and there is a permutation2 [cu1 , . . . , cus ] of C such that ei ∈ cui with i = 1, . . . , t. For example in Table 1, if constraint C = [Drink, Grain], then IC = Drink ∪ Grain = {coffee, tea, rice}. The 2itemset candidates satisfying (i) would be itemsets that contain items within IC which could be {coffee, tea}, {coffee, rice} and {tea, rice}. Condition (ii) discards {coffee, tea} because it requires the two items to follow the pattern [Drink, Grain] or [Grain, Drink]. 2
Permutation refers to a certain order of items in a list. For example, if C = [Drink, Drink, Grain], then there are three permutations of C: [Drink, Drink, Grain], [Drink, Grain, Drink] and [Grain, Drink, Drink].
84
T.D. Do, S.C. Hui, and A. Fong
In the tabular database, condition (ii) is a consequence of (i) because in each transaction, items belong to distinct categories (attributes). In [5], condition (i) is held by replacing the 1-itemset candidate C1 by C1C = {e|e ∈ C1 & e ∈ IC } in the Apriori algorithm. We call this implementation of CAP with constraint Cγ as CAP CB . The category-based Apriori (called AprioriCB ) algorithm was given in Figure 3. Figure 4 shows the time taken by the two algorithms with the support threshold decreased from 100 down to 5 (i.e. 2% to 0.1% of the number of transactions). The figure also shows that the running time of AprioriCB is much smaller than CAP CB and is linear with a certain decreasing rate of the support threshold. The running time of CAP CB increases drastically at small thresholds.
250
Time (sec)
200
150
CAP
CB
Apriori
100
CB
50
0 100
50
20
10
5
Support threshold
Fig. 4. Running time for CAP CB and AprioriCB .
Table 2 gives the number of candidates for each pass with the support threshold value of 10. The table shows that the number of candidates is decreased substantially in AprioriCB from 4 up to 10 times at each step (except the first and last one). This was explained earlier by a small proportion of the number of sub-elements of lists to that of sets. In this case (8-itemsets), the proportion is (C81 + C82 )/(28 − 1) ≈ 1/7. This rate grows exponentially with the increasing size of itemsets. For example with 20-itemsets, as we have calculated in Section 1, the proportion would be 210/106 = 1/5000. Table 2. Number of candidates for each pass with support threshold = 10. Pass
1 CB
2
3
4
5
6
7
8
CAP 102 2485 5072 6368 5178 2545 659 72 AprioriCB 102 446 1396 1020 520 306 176 75
Mining Frequent Itemsets with Category-Based Constraints
5
85
Discussion
By using category patterns, AprioriCB has reduced considerably the number of itemsets processed in every pass. The counting and generating procedures, the two major time-consuming processes, in which the computational complexity is directly proportional to the number of the itemsets are much faster. This advantage is especially significant in the case of mining large number of frequent itemsets and frequent itemsets with many items. The AprioriCB algorithm can be applied partly to mine succinct constraints. In the MGF form Cα = X1 ∪ . . . ∪ Xk (Xi = ∅ with i = 1, . . . , k), an itemset S following this form contains at least one item in Xi with i = 1, . . . , k. It means S has at least a subset of size k satisfying the category-based constraint [X1 , . . . , Xk ]. The mining algorithm with category-based constraints, thus, can be used to efficiently find frequent itemsets Lk satisfying the constraint. After that, the rest can be done following the algorithm in [5]. In a “typical” category-based constraint, itemsets have to contain exactly one item in each category of the category pattern. This strict condition enables AprioriCB to search a frequent itemset while eliminating most of the subsets of it (checking only 210 out of one million subsets of a 20-itemset as mentioned earlier). This condition is satisfied in some cases such as succinct constraint or categorized data when the user knows exactly the “pattern” of itemsets he or she is looking for. Generally, one cannot find itemsets that consist of items that he or she does not know the corresponding categories. Moreover, every category of a constraint is mandatory or, in other words, it has to be present in every resulting set. For example, with the constraint [Drink, Grain, Dairy], one would not get the set {coffee, milk} because the category Grain does not present in the set. To make category-based constraints more flexible, the proposed approach should be extended. With some modifications, category-based constraints can be used to mine items of unidentified categories. In addition, the modifications also allow a category of a constraint to be optional. – The constraint can be expanded to find relations of some given categories with an “unknown” one by adding a category called Any - the set including every item or, in other words, Any = I. For example, one may want to find relations between the categories Drink and Grain with another category. The constraint can then be expressed as [Drink, Grain, Any]. – A category in a category-based constraint may be optional by adding the “null” item to the category with the convention that item null is obtained in every transaction. In this case, with the constraint [Drink, Grain, Dairy], by adding the null item to the category Grain, one may get the results of itemset {coffee, null, milk}. Thus, {coffee, milk} is an itemset under the pattern [Drink, Grain, Dairy]. It means that the category Grain is optional - may or may not be present in the resulting sets. Notice that mining itemsets under category patterns with one or more items belonging to a category is straightforward once the category-based itemsets are
86
T.D. Do, S.C. Hui, and A. Fong
found. The process is the same as using category-based constraint mining for succinct constraints discussed above. In addition, the definition of category-based constraint and the implementation of the algorithm are flexible to the relations between categories with items and constraints. They allow an item to belong to some categories and some categories to appear in one constraint as well.
6
Conclusion
This paper has made two contributions. First, the concepts on category-based constraints have been defined formally. This type of constraints can be used with items, which are categorized into some categories. It can also be used in tabular databases or succinct constraints where a category-based form is available. The second contribution is the proposed Category-based Apriori algorithm for mining frequent itemsets with the category-based constraints. It is based on lists of data instead of sets as in other association rule mining algorithms. The method is fast because it reduces the search space considerably.
References [1] Agrawal R., Srikant R.: Fast algorithms for mining association rules. In Proc. of the 20th Int’l Conf. on Very Large Databases (VLDB ’94), Santiago, Chile, pp. 487–499, June 1994. [2] Sarasere A., Omiecinsky E., Navathe S.: An efficient algorithm for mining association rules in large databases. In 21st Int’l Conf. on Very Large Databases (VLDB), ZTrich, Switzerland, pp. 432–444, Sept. 1995. [3] Hegland M: Algorithms for Association rules. In Proc. of Advanced Lectures on Machine Learning. LNAI 2600, pp. 226–234, 2003. [4] Srikant R., Vu Q., Agrawal R.: Mining Association Rules with Item Constraints. In Proc. of the 3rd Int’l Conference on Knowledge Discovery in Databases and Data Mining, New-port Beach, California, pp. 67–73, Aug 1997. [5] Ng R., Lakshmanan L.V.S., Han J., Pang A.: Exploratory mining and pruning optimizations of constrained association rules. In Proc. of SIGMOD, pp. 13–24, 1998. [6] Pei J., Han J., Lakshmanan L.V.S.: Mining frequent itemsets with convertible constraints. In Proc. of ICDE, pp. 433–442, 2001. [7] Lakshmanan L.V.S., Ng R., Han J., Pang A.: Optimization of constrained frequent set queries with 2-variable constraints. In Proc. of SIGMOD, pp. 157–168, 1999.
Modelling Soil Radon Concentration for Earthquake Prediction Sašo 'åHURVNL/MXSþRTodorovski, Boris Zmazek, Janja 9DXSRWLþDQG,YDQKobal Jo ef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia {saso.dzeroski, ljupco.todorovski, boris.zmazek, janja.vaupotic, ivan.kobal}@ijs.si
Abstract. We use regression/model trees to build predictive models for radon concentration in soil gas on the basis of environmental data, i.e., barometric pressure, soil temperature, air temperature and rainfall. We build model trees (one per station) for three stations in the Krško basin, Slovenia. The trees predict radon concentration with a (cross-validated) correlation of 0.8, provided radon is influenced only by environmental parameters (and not seismic activity). In periods with seismic activity, however, this correlation is much lower. The increase in prediction error appears a week before earthquakes with local magnitude 0.8 to 3.3.
1 Introduction Radon in groundwater was monitored for the first time in Uzbekistan after the Tashkent earthquake (Ulomov and Mavashev, 1971) and it became known that seismogenic processes influence the behaviour of underground fluids (Scholz et al., 1973; Mjachkin et al., 1975). Since then, temporal variations of radon concentration in soil gas and in groundwater have been studied and related to seismic activity in many countries (Teng, 1980; Sultankhodjaev, 1984; Wakita, 1979; Wakita et al., 1988; Singh et al., 1993; Singh and Virk, 1994; Igarashi et al., 1995). The first radon measurements in Slovenia aimed at predicting earthquakes were made in 1982. In four thermal waters, radon concentrations were determined weekly, 2and Cl , SO4 , hardness and pH, monthly (Zmazek et al., 2000a). This frequency of analyses, however, was not high enough to follow seismic activity properly. So, in 1998, we extended our study from thermal waters (Zmazek et al., 2000c, 2002b, 2002c) to soil gas (Zmazek et al., 2000b, 2002a) and increased the sampling frequency up to once an hour. In both cases, meteorological parameters such as barometric pressure, air temperature, soil temperature and rainfall were also taken into account. As a general practice (Yasuoka and Shinogi, 1997; Singh et al., 1999; Virk et al., 2001), anomalies in radon concentration observed in our previous study (Zmazek et al., 2002a) were identified and an attempt was made to relate them to seismic activity. Some of these anomalies with respect to barometric pressure and air temperature can be seen in Figure 1. For small earthquakes, it is often impossible to identify an anomaly as resulting solely from seismic activity and not from meteorological or hydrological parameters. Therefore, the application of methods for data analysis appears to G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 87–99, 2003. © Springer-Verlag Berlin Heidelberg 2003
88
S. 'åHURVNLHWDO
be essential in relating radon concentrations to seismic activities. In practice, advanced statistical methods are most often used (Di Bello et al., 1998; Cuomo et al., 2000; Biagi et al., 2001; Belayev, 2001). More recently, neural networks have also been used to analyse data on radon concentrations and environmental parameters (Negarestani et al., 2001) In this paper, we apply data mining methods to relate radon concentrations and environmental parameters. Data mining methods have been successfully applied to many problems in environmental sciences; for an overview see (D eroski, 2002). Data from our previous study (Zmazek et al., 2002a) are used. We attempt to separate the effects of environmental parameters and seismic activity on radon concentration in soil gas. We use regression/model trees to build predictive models for radon concentration from environmental data. We then test the hypothesis that these models perform much worse (in terms of accuracy) during seismically active periods as compared to seismically inactive periods, which paves the road for earthquake prediction. The remainder of the paper is organized as follows. Section 2 describes the data used. Section 3 describes the data mining methodology employed. The results are presented and discussed in Section 4. Section 5 concludes and discusses directions for further work.
2 The Data The data were collected at six stations in the Krško basin. This basin is located in eastern Slovenia, close to the border with Croatia. The only nuclear power plant in Slovenia is located in this basin. Stations 1 and 4 are located in the active fault zone of the Orlica fault, at a distance about 4000 m from each other, while the other stations are at distances from 150 to 2500 m on either side of the fault zone. At each station, a borehole was made, in which radon concentration in soil gas, barometric pressure and temperature have been measured and recorded once an hour since April 1999 (Zmazek et al., 2002a). For that purpose, barasol probes have been used. The barasol probes at stations 1 and 2 are fixed at depths of about 5m, while at the other boreholes they are located at a depth of 60-90 cm. The borehole wall is protected with a plastic tube and the top is isolated and covered with a plastic cap and soil to reduce hydro-meteorological effects on the measurements. For this paper, the data collected from station 1, in the period April 1999 - February 2002, and from stations 5 and 6, in the period June 2000 – February 2002, were taken into account. Meteorological data (including data on rainfall), were provided by the Office of Meteorology of the Environmental Agency of the Republic of Slovenia. Seismic data were provided by the Office of Seismology of the same Agency. Only earthquakes potentially responsible for strain effects in the investigated area (Dobrovolsky et al., 1979) have been considered. During the period of our measure ments at station 1, 21 earthquakes occurred with local magnitude 0.8 to 3.3 and at an epicentre distance of 1 to 31 km. In the shorter period of measurements at stations 5 and 6, only six earthquakes occurred, with local magnitude 1.9 to 3.0 and at an epicentre distance 1 to 29 km.
Modelling Soil Radon Concentration for Earthquake Prediction
89
Fig. 1. Radon concentration (24 hour averages) recorded at stations 1, 5 and 6 for the period from June 2000 to January 2002. Local seismicity, barometric pressure, soil temperature, air temperature and rainfall are shown for the same period.
90
S. 'åHURVNLHWDO
All of the data used in our analyses are displayed in Figure 1. The top 3 graphs depict the radon concentration in soil gas and soil temperature for Stations 1, 5, and 6, respectively. Barometric pressure (continuous line) and earthquake magnitude (spikes) are depicted next. Finally, the bottom graph depicts air temperature (continuous line) and rainfall (spikes).
3 Data Mining Methodology Our ultimate goal in studying radon concentration and relating it to environmental parameters and seismic activity is to be able to predict earthquakes. As a step in this direction, we apply data mining methods to relate radon concentrations and environmental parameters. We attempt to separate the effects of environmental parameters and seismic activity on radon concentration in soil gas. We use regression/model trees to build predictive models for radon concentration from environmental (meteorological) data. We then test the hypothesis that these models perform much worse (in terms of accuracy) during seismically active (SA) periods as compared to seismically inactive (SI) periods, which paves the road for earthquake prediction. In the remainder of this section, we first describe how the data were prepared for analysis and what the data mining task addressed was. We then describe the data mining methods used and finally state the exact setup for the data mining experiments performed. 3.1 Data Preparation and Data Mining Task The experimental data for the period from June 2000 to February 2002 (described in Section 2 and depicted in Figure 1) was used. The data comes from three stations. Since the three stations differ from a geological viewpoint, we treat the data from each station separately and have three different datasets (with the same attributes). The data from the boreholes comes at an hourly rate, but the meteorological data is only collected once a day. We thus first aggregate the hourly collected data to daily averages and consider the daily radon concentration average, average daily barometric pressure, average daily air temperature, average daily soil temperature, and daily amount of rainfall. We add the difference between daily soil and daily air temperature, as well as the gradient (difference between the current and previous day) of daily barometric pressure, as additional attributes. The daily radon concentration is the dependent variable (class), while the remaining variables are considered as independent variables (attributes). To test the hypothesis about the predictability of radon concentration in periods with and without seismic activities, the following procedure was applied. Each dataset was split into two parts. In the first part (labelled SA), data for the periods with seismic activity were included, i.e., periods of seven days before and after an earthquake. Data for the remaining days were included in the second part, belonging to the seismically inactive periods (labelled SI). To select an appropriate data mining method, we cross-validate the candidate methods on all of the data (SA+SI). We then build predictive models with the se-
Modelling Soil Radon Concentration for Earthquake Prediction
91
lected method on the entire SI dataset (for each of the three stations) and evaluate these on the seismically active (SA) data. We evaluate our hypothesis by comparing cross-validated performance on the SI data to the performance on the SA data. 3.2 Data Mining Methods Since radon concentration is a numeric variable, we have approached the task of predicting radon concentration from meteorological data using regression (or function approximation) methods. We used regression trees (Breiman et al., 1984), as implemented with the WEKA data mining suite (Witten and Frank, 1999). For comparison, we also took into consideration two other regression methods, more traditional statistic method of linear regression (LR) and instance based regression (IB) (Aha and Kibler, 1991), also implemented within the WEKA data mining suite. Below we give a brief introduction to regression trees. Regression trees are a representation for piece-wise constants or piece-wise linear functions. Like classical regression equations, they predict the value of a dependent variable (called class) from the values of a set of independent variables (called attributes). Data presented in the form of a table can be used to learn or automatically construct a regression tree. In that table, each row (example) has the form (x1, x2,..., xN, y), where xi are values of the N attributes (e.g., air temperature, barometric pressure, etc.) and y is the value of the class (e.g., radon concentration in soil gas). Unlike classical regression approaches, which find a single equation for a given set of data, regression trees partition the space of examples into axis-parallel rectangles and fit a model to each of these partitions. A regression tree has a test in each inner node that tests the value of a certain attribute and, in each leaf a model for predicting the class. The model can be a linear equation or just a constant. Trees having linear equations in the leaves are also called model trees (MT). Given a new example for which the value of the class should be predicted, the tree is interpreted from the root. In each inner node, the prescribed test is performed and, according to the result of the test, the corresponding left or right sub-tree is selected. When the selected node is a leaf then the value of the class for the new example is predicted according to the model in the leaf. Tree construction proceeds recursively, starting with the entire set of training examples (entire table). At each step, the most discriminating attribute is selected as the root of the subtree and the current training set is split into subsets according to the values of the selected attribute. Technically speaking, the most discriminating discrete attribute or continuous attribute test is the one that most reduces the variance of the values of the class variable. For discrete attributes, a branch of the tree is typically created for each possible value of the attribute. For continuous attributes, a threshold is selected and two branches are created, based on that threshold. The attributes that appear in the training set are considered as thresholds. For the subsets of training examples in each branch, the tree construction algorithm is called recursively. Tree construction stops when the variance of the class values of all examples in a node is small enough (or if some other stopping criterion is satisfied). These nodes are called leaves and are labelled with a model (constant or linear equation) for predicting the class value.
92
S. 'åHURVNLHWDO
An important mechanism used to prevent trees from over-fitting data is tree pruning. Pruning can be employed during tree construction (pre-pruning) or after the tree has been constructed (post-pruning). Typically, a minimum number of examples in branches can be prescribed (for pre-pruning) and a confidence level for the estimates of predictive error in leaves (for post-pruning). A number of systems exist for inducing regression trees from examples, such as CART (Breiman et al., 1984) and M5 (Quinlan, 1992). M5 is one of the best known programs for regression tree induction. We used the system M5’ (Wang and Witten, 1997), a reimplementation of M5 within the WEKA data mining suite (Witten and Frank, 1999). 3.3 Experimental Setup For building predictive models, we used tools from the WEKA data mining suite (Witten and Frank, 1999): regression/model tree building with M5’, linear regression and instance-based learning. The settings for the individual methods were as described below. In M5’, both regression trees (which predict a constant value in each leaf node) and model trees (which use linear regression for prediction in each leaf node) were built. The other parameters of M5’ were set to their default values. For linear regression, four different options were possible. These are no selection of input/independent variables and three different methods to select and/or eliminate predictive variables (as available in WEKA). Finally, a single instance based regression method was used with six different settings for the number of nearest neighbours parameter: 1, 5, 10, 25, 50 and 99. The predictive performance of the regression methods was assessed using two different measures. Firstly, the correlation coefficient (r) expresses the level of correlation between the measured and predicted values of radon concentration. Higher values of the correlation coefficient denote a better correlation. Secondly, the root mean squared error (RMSE) measures the discrepancy between measured and predicted values of radon concentration. Smaller RMSE values indicate lower discrepancies. In order to estimate the performance of predictors on measurements that were not used for training the predictor, a standard 10-fold cross validation method was applied. To select among the different data mining methods, we evaluate their predictive performance on the entire datasets for each station (including both SI and SA data) by cross-validation. The results are shown in Table 1. Model trees outperform the other regression methods considered, both on average and for each individual station. They perform better in terms of the correlation coefficient and of the root mean squared error. Therefore, for the further evaluations performed to test the hypothesis, we have used model trees to predict the radon concentration.
Modelling Soil Radon Concentration for Earthquake Prediction
93
Table 1. Correlation coefficient (R) and root mean squared error (RMSE)estimated by crossvalidation, of different regression methods for predicting radon concentration in soil gas at three stations in the Krško basin. (MT - model trees; RT - regression trees; LR - linear regression four different methods; IB - instance based regression, six different settings for the number of nearest neighbours parameter).
Method MT RT LR 1 LR 2 LR 3 LR 4 IB 1 IB 5 IB 10 IB 25 IB 50
station 1 0.80 0.78 0.72 0.72 0.72 0.72 0.53 0.58 0.57 0.58 0.61
station 5 R 0.81 0.68 0.40 0.38 0.40 0.38 0.34 0.52 0.50 0.43 0.44
station 6
station 1
0.76 0.73 0.66 0.66 0.66 0.66 0.55 0.60 0.56 0.59 0.62
20063 21100 22960 23008 22960 23008 38100 30693 29832 29298 28473
station 5 RMSE 12769 16208 20048 20195 20048 20195 22332 18732 19076 19885 20157
station 6 3508 3651 4019 4019 4019 4019 4903 4342 4489 4441 4453
4 Results and Discussion Having selected model trees as the data mining method of choice, we induce model trees on the entire SI datasets (one for each station). We then estimate the performance of these trees on unseen SI data (by 10-fold cross validation). We also measure their performance on the SA data in order to evaluate the practicability of predicting radon concentration in the SA periods. If our hypothesis is true (i.e. radon concentration becomes unpredictable during seismic activity periods), the first performance figures should be higher than the second. Table 2 summarises the performance (correlation coefficients and RMSE) of the model trees. The results clearly confirm our hypothesis: the correlation in the SA periods is much lower than in the SI periods, for all three stations. The drop in the correlation coefficient ranged from 17% to 73%. The RMSE in the SA periods is higher than in the SI periods. The RMSE increase ranges from 13% to 72%. The confirmation of the hypothesis allows us to predict seismic activity in the following manner. A model tree is built that predicts the concentration of radon in soil gas on the basis of data measured during the SI periods. We then follow the discrepancy between the measured values of radon concentration and the values predicted by the model tree. If the discrepancy is low, no seismic activity is anticipated; if it starts to increase, however, an increase in seismic activity may be expected. From our results, station 6 appears to be the best location for earthquake prediction. However, only six earthquakes occurred in the vicinity of this location during the period of study (June 2000 - February 2002). Before one of these earthquakes, a very large difference between the measured and the predicted radon concentrations was observed (Figure 4). During our measurements at location 1 (April 1999 - February 2002), 21 earthquakes occurred, but anomalies occurring before some of the earthquakes at larger
S. 'åHURVNLHWDO
94
distances from the station are not very marked. From a geological point of view, station 6 is the only station on Triassic limestone; the other two are on Miocene limestone. On the other hand, at station 1, the hole is much deeper (5 m) than the holes at stations 5 and 6 (60-90 cm). This could be the reason that certain environmental parameters, such as barometric pressure, have more influence on radon concentration. For this reason, during some of the seismic activities, much smaller anomalies in measured radon concentration have been observed for station 1. Table 2. Comparison of predictability of radon concentration based on the assumptions that radon changes appear 7 days before an earthquake. The performance on SI periods is estimated by 10-fold cross-validation. The performance in SA periods is measured directly.
Station 1 5 6
SI periods r 0.83 0.81 0.80
RMSE 18719 13243 3076
SA periods r 0.69 0.54 0.22
RMSE 23536 14910 5299
Performance change % r -17.3 -33.7 -72.9
RMSE 25.7 12.6 72.3
Our results of applying model trees to relate radon data to seismic activity are encouraging. The induced trees predict anomalies before and/or during all earthquakes. The radon concentration predicted before an earthquake is mostly lower than the measured radon concentration (the predicted value is higher than the measured on only five occasions). The drop in the correlation coefficient between the two appears on average 10.2 ± 4.8 days before an earthquake. The duration of an anomaly is 13 ± 7 days. The time of appearance of an anomaly (before an earthquake), as well as its duration and magnitude, grow with the magnitude of the earthquake and become smaller as the distance to the earthquake increases. The average difference between the measured and the predicted radon concentrations during these anomalies is 54 ± 35 % of the measured values. On 14 occasions, differences between the measured and the predicted radon concentrations can be observed in the SI periods, but these differences of radon concentration, on average 30%, last for less than 3 days on average. The predicted radon concentrations are compared to the measured values for the three measuring stations in Figures 2 to 4. Figure 2 shows three measurement periods for station 1, with six earthquakes. In Figure 2(a), two days before the first earthquake, the measured radon concentration first increases, then it starts to decrease a day before the earthquake and returns to the predicted level three days after the earthquake. The situation is reversed for the second earthquake on May 20, 1999. Here, six days before the earthquake, the measured radon concentration first decreases, and then two days before the earthquake it starts to increase, to eventually match the predicted level. In Figure 2(b), the measured radon concentration increases before the earthquake. Then, at the time of the earthquake, it first suddenly decreases and then abruptly increases to match the predicted value eight days after the earthquake.
Modelling Soil Radon Concentration for Earthquake Prediction
140
Station 1
95
(a)
Radon / kBqm-3
120 100 80 60 ML = 0.8; 3 km
ML = 0.9; 3 km
40
predicted measured
20 11.4.99
Radon / kBqm-3
150
18.4.99
25.4.99
2.5.99
9.5.99
16.5.99
23.5.99
Station 1
30.5.99 (b)
140 130 120 M L = 2.6; 8 km
110 100 19.7.99 130
predicted measured
26.7.99
2.8.99
9.8.99
Station 1
16.8.99
23.8.99
30.8.99
(c)
ML = 3.2; 11 km
120
6.9.99
Radon / kBqm-3
110 100 90 80 70 60 ML = 1.8; 9 km
predicted measured
50 40 23.3.00
30.3.00
6.4.00
13.4.00 20.4.00 time scale
27.4.00
4.5.00
11.5.00
Fig. 2. Predicted and measured radon concentration at station 1 for different periods: (a) from April 11 to May 30, 1999; (b) from July 19 to September 6, 1999; (c) from March 23 to May 11, 2000. Earthquakes with their magnitudes and distances between the station and the epicentres are also shown.
S. 'åHURVNLHWDO
96
Two earthquakes in Figure 2(c) are also preceded by increases in the measured radon concentration. But here, the measured and predicted concentrations start to match before the seismic events. Station 5
Radon / kBqm
-3
120 100 80 60 40 20
predicted measured
0 20. 11. 01 27. 11. 01 4. 12. 01 11. 12. 01 18. 12. 01 25. 12. 01 time scale
1. 1. 02
8. 1. 02
Fig. 3. Predicted and measured radon concentration at Station 5: November 20, 2001 – January 8, 2002.
In Figure 3, the situation for station 5 is shown for the last third of the measurement period. No seismic activities are observed during this period. The predicted radon concentration closely matches the measured radon concentration. Observations at station 6 are presented in Figure 4. While seismic activity in Figure 4(a) and 4(c) is accompanied by an increase in the measured radon concentration, it is accompanied by a decrease in the measured radon concentration in Figure 4(b). In the above examples, a seismic event was always preceded by a time run of measured radon concentration, which was not predicted well using model trees. We were able to detect anomalies related to earthquakes as weak as ML = 0.8. A more extended analysis would be needed in order to explain why at the same measuring station, radon concentration may either increase or decrease before or during an earthquake, or to relate the nature of an anomaly to the local magnitude or epicentre of an earthquake. This analysis is beyond the scope of this paper.
5 Conclusions and Further Work We use regression/model tree induction to predict radon concentration in soil gas from measured environmental data, i.e., barometric pressure, air temperature, soil temperature, rainfall, difference between air and soil temperature, daily changes of barometric pressure, and rainfall. During seismically inactive periods, when the variation of radon concentration with time is affected merely by the environmental parameters and not by seismic activity, radon concentrations have been predicted with correlations over 0.8. If the prediction is significantly worse, an increased seismic ac-
Modelling Soil Radon Concentration for Earthquake Prediction
6WDWLRQ
97
D
P T % N QR GD 5
0/ NP
0 / NP SUHGLFWHG PHDVXUHG
6WDWLRQ
E
P T % N QR GD 5
0/ NP
SUHGLFWHG PHDVXUHG
6WDWLRQ
F
P T % N Q RG D 5
0/ NP
SUHGLFWHG PHDVXUHG
WLPHVFDOH
Fig. 4. Predicted and measured radon concentration at station 6 for different periods: (a) from July 3 to August 21, 2000; (b) from April 27 to June 15, 2001; (c) from August 25 to October 13, 2001. Earthquakes with their magnitudes and distances between the station and the epicentres are also shown.
98
S. 'åHURVNLHWDO
tivity can be expected. With this method, we can detect radon anomalies preceding earthquakes with local magnitude lower than 3.3 by roughly a week. Much work remains to be done. At the top of the to-do list is testing the induced model trees on new measurements at the same locations. Next on the list is the application of the same methodology to measurements collected at other locations. These include measurements of radon in soil gas, but also in thermal waters. A better understanding of the mechanisms that lead to the radon concentration anomalies and their relation to earthquakes is also needed. An evaluation of the ability of our method to predict earthquakes (in terms of probability of correct/false alarms) is necessary. Another possible direction for further work is to try to find relations between the magnitude and direction of an anomaly in the radon concentration (actual concentration is lower/higher than predicted), on one hand, and the magnitude and location (relative to the measurement station) of the corresponding earthquake. Finally, instead of model tree induction, one might consider other approaches to predicting radon concentrations, e.g., the discovery of polynomial equations (D eroski and Todorovski, 1995).
References Aha, D., Kibler, D., 1991. Instance based learning algorithms. Machine Learning 6, 37–66. Belayev, A.A., 2001. Specific Features of radon earthquake precursors. Geochem. Int. 12, 1245–1250. Biagi, P.F., Ermini, A., Kingsley, S.P., Khatkevich, Y.M., Gordeev, E.I., 2001. Difficulties with interpreting changes in groundwater gas content as earthquake precursors in Kamchatka, Russia. J. Seismol. 5, 487–497. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Wadsworth, Belmont. Cuomo, V., Di Bello, G., Lapenna, V., Piscitelli, S., Telesca, L., Macchiato, M., Serio, C., 2000. Robust statistical methods to discriminate extreme events in geoeletrical precursory signals: implications with earthquake prediction. Nat. Hazard 21, 247–261. Di Bello, G., Ragosta, M., Heinicke, J., Koch, U., Lapenna, V., Piscitelli, S., Macchiato, M., Martinelli, G., 1998. Time dinamics of background noise in geoelectrical and geochemical signals: An application in a seismic area of Southern Italy. Il Nuovo Cimento 6, 609–629. Dobrovolsky, I.P., Zubkov, S.I., Miachkin, V.I., 1979. Estimation of the Size of Earthquake Preparation Zones. Pure Appl. Geophys. 117, 1025–1044. 'åHURVNL6$SSOLFDWLRQVRI.''PHWKRGVLQHQYLURQPHQWDOVFLHQFHV,QKloesgen, W., Zytkow, J. (Eds.), Handbook of Data Mining and Knowledge Discovery. Oxford University Press. 'åHURVNL6Todorovski. L. 1995. Discovering dynamics: from inductive logic programming to machine discovery. Journal of Intelligent Information Systems}, 4:89–108. Gosar, A., 1998. Seismic reflection surveys of the Krško basin structure: implications for earthquake hazard at the Krško nuclear power plant, southeast Slovenia. J. Appl. Geophys. 39, 131–153. Igarashi, G., Saeki, S., Takahata, N., Sumikawa, K., Tasaka, S., Sasaki, Y., Takahashi, M., Sano, Y., 1995. Ground-water radon anomaly before the Kobe earthquake in Japan. Science 269, 60–61. Lapajne, J.K., Fajfar, P., 1997. Seismic hazard reassessment of an existing NPP in Slovenia. Nucl. Eng. Design 175, 215–226. Mjachkin, V.I., Brace, W.E., Sobolev, G.A., Dieterich, J.H., 1975. Two models for earthquake forerunners, Pure Appl. Geophys. 113, 169–181.
Modelling Soil Radon Concentration for Earthquake Prediction
99
Negarestani, A., Setayeshi, S., Ghannadi-Maragheh, M., Akashe, B., 2001. Layered neural networks based analysis of radon concentration and environmental parameters in earthquake prediction. J. Environ. Radioact. 62, 225–233. Quinlan, J.R., 1992. Learning with continuous classes. In: Proceedings of the Fifth Australian Joint Conference on Artificial Intelligence. World Scientific, Singapore, 343–348. Scholz, C. H., Sykes, L.R., Agrawal, Y. P., 1973. Earthquake prediction: A physical basis. Science 181, 803–810. Singh, M., Ramola, R.C., Singh, B., Singh, S., Virk, H.S., 1993. Radon anomalies: correlation with seismic activities in Northern India. In: Proceedings of the Second Workshop on Radon Monitoring in Radioprotection. Environmental and/or Earth Sciences, Trieste, 25 November –6 December, 1991, World Scientific Publishing, Singapore, 359–377. Singh, M., Virk, H.S., 1994. Investigation of radon-222 in soil-gas as an earthquake precursor. Nucl. Geophys. 8, 185–193. Singh, M., Kumar, M., Jain, R.K., Chatrath, R.P., 1999. Radon in ground water related to seismic events. Radiat. Measure. 30, 465–469. Sultankhodajev, A.N., 1984. Hydrogeoseismic precursors to earthquakes, In: Earthquake Prediction, UNESCO, Paris, 181–191. Teng, T.L., 1980. Some recent studies on groundwater radon content as an earthquake precursor. J. Geophys. Res. 85, 3089–3099. Ulomov, V.I., Mavashev, B.Z., 1971. Forerunners of the Tashkent earthquake. Izv. Akad. Nauk Uzb. SSR 188–200. Virk, H.S., Walia, V., Kumar, N., 2001. Helium/radon precursory anomalies of Chamoli earthquake, Garhwal Himalaya, India. J. Geodyn. 31, 201–210. Wakita, H., 1979. Earthquake Prediction by Geochemical Techniques. Rec. Progr. Nat. Sci. Jpn. 4, 67–75. Wakita, H., Nakamura, Y., Sano, Y., 1988. Short-term and intermediate-term geochemical precursors. Pure Appl. Geophys. 126, 267–278. Wang, Y., Witten, I.H., 1997. Induction of model trees for predicting continuous lasses. In Proceedings of the Poster Papers of the European Conference on Machine Learning. University of Economics, Faculty of Informatics and Statistics, Prague. Witten, I.H., Frank, E., 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco. Yasuoka, Y., Shinogi, M., 1997. Anomaly in atmospheric radon concentration: a possible precursor of the 1995 Kobe, Japan, earthquake. Health Phys. 72, 759–761. Zmazek, B., 9DXSRWLþ - äLYþLü 0 Premru, U., Kobal, I., 2000a. Radon monitoring for earthquake prediction in Slovenia. Fizika B (Zagreb) 9, 111–118. Zmazek, B., 9DXSRWLþ-Bidovec, M., Poljak, M., äLYþLü0Pineau, J. F., Kobal, I., 2000b. Radon monitoring in soil gas tectonic faults in the Krško basin. In: Book of Abstracts: New Aspects of Radiation Measurements, Dosimetry and Spectrometry. 2nd Dresden Symposium on Radiation Protection, September 10–14, 2000. Zmazek, B., 9DXSRWLþ-äLYþLü0Martinelli, G., Italiano, F., Kobal, I., 2000c. Radon, temperature, electrical conductivity and 3He/4He measutements in three thermal springs in Slovenia. In: Book of Abstracts: New Aspects of Radiation Measurements, Dosimetry and Spectrometry. 2nd Dresden Symposium on Radiation Protection, September 10–14, 2000. Zmazek, B., äLYþLü 0 9DXSRWLþ - Bidovec, M., Poljak, M., Kobal, I., 2002a. Soil radon monitoring in the Krško basin, Slovenia. Appl. Radiat. Isot. 56, 649–657. Zmazek, B., 9DXSRWLþ - Kobal, I., 2002b. Radon, temperature and electric conductivity in slovenian thermal waters as potential earthquake precursors. 1st Workshop on Natural Radionuclides in Hydrology and Hydrogeology, Centre Universitaire de Luxembourg,. Zmazek, B., Italiano, F., äLYþLü09DXSRWLþ-Kobal, I., Martinelli, G., 2002c. Geochemical monitoring of thermal waters in Slovenia: relationships to seismic activity. Appl. Radiat. Isot. 57, 919–930.
Dialectical Evidence Assembly for Discovery 1
2
Alistair Fletcher and John Davis 1
CSIRO Petroleum Resources, PO Box 1130, Bentley WA 6102, Australia [email protected] 2. Department of Civil Engineering, University of Bristol, University Walk, Bristol, BS8 1TR, U.K. [email protected]
Abstract. We propose and demonstrate a dialectical framework for the assessment and assembly of evidence in discovery processes. This paper addresses the stimulation and capture of dialectical argumentation in the context of complex (wicked) problems. Holonic representation is developed to structure processes hierarchically for the modeling of complex problems. Interval Probability Theory (IPT) is modified to produce an evidential reasoning calculus to represent dialectical argument through the separation of evidence for and evidence against a hypothesis. Support for and against any hypothesis can then be assembled through a weighting of the relevance, importance and degree of evidence. Uncertainty surrounding any hypothesis can be decomposed into randomness, vagueness, conflict, incompleteness and relevance and can be managed within the framework. The framework is illustrated with real examples of discovery from two energy related complex (wicked) problems.
1 Introduction We live in a complex, uncertain, dynamic, interdependent and evolving world. Numerous technical, financial, legal, political and social issues need to be addressed. The information available through electronic and other media is enormous and far beyond our capability of analysis and understanding. Extraction and discovery of useful and relevant knowledge is essential for equitable resolution of many complex problems. Indeed these problems require resolution rather than solution. Decision support requires the assessment and evaluation of knowledge within a meaningful context and is closely allied with discovery. Complex and intractable problems have been termed wicked by Conklin and Weil [1] and messy by Schon [2]. Wicked problems are characterized by an evolving set of interlocking issues (often political and social) and constraints. Linear approaches based on traditional science and their associated tools, which work well for tame problems (the opposite of wicked), often have very little to offer when confronted by wicked problems. We believe there are two distinct, often polarized, approaches to discovery involving wicked problems. On the one hand, a narrowly reductionist and quantitative modeling approach designed to address ‘hard’ issues, sometimes becoming a slave to numerical techniques and abstractions, resulting in ‘paralysis through analyses’. On G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 100–113, 2003. © Springer-Verlag Berlin Heidelberg 2003
Dialectical Evidence Assembly for Discovery
101
the other hand, a perception that some problems are too complex to analyze systematically and are best addressed experientially, sometimes resulting in fragmented and inconsistent approaches decided on ad hoc or purely personal bias. In other words, we are often trapped between approaches designed to address clearly defined, well structured, ‘technical’ problems sufficiently ‘well insulated’ from, or of limited relevance to the wider picture, and the complex, dynamic and messy real world. We believe there is a coherent middle way to discovery that combines quantitative and qualitative methods. This tackles the intellectual problems of data interpretation, analysis and synthesis involved in conceiving ideas for effective discovery, together with dialectical approaches to address the logic and context of discovery. Initial work on decision-making has been published. A software assisted methodology known as Juniper has been described [3], together with the underlying mathematics [4], [5], [6]. Applications of this methodology to a variety of engineering and scientific problems, including asset management [7], dam safety [8], earthquake vulnerability [9], carbon dioxide disposal [10], [11], hydroinformatics [12], [13], coastal defenses [14], [15], and oil exploration prospects [16] have been published. Our recent work has addressed discovery as a component of decision-support with the development of holonic representation and dialectical argument for wicked problems. In particular, in the context of hypothesis and evidence, we: • Apply the concept of holons as introduced by Koestler [17] to structure processes hierarchically for the modeling of wicked problems; • Develop Interval Probability Theory [5] as an uncertainty calculus capable of representing dialectical argumentation; • Develop the social context for the stakeholders to model their wicked problems; • Develop narrative for the interpretation and communication of wicked problems. This paper focuses on the stimulation and capture of dialectical argumentation as a primary route of discovery when addressing wicked problems.
2 Summary of the Juniper Approach to Decision-Making The development of the Juniper framework produced a descriptive rather than prescriptive framework [3] for the evaluation of hypothesis with evidence either supporting or negating the hypothesis. The descriptive framework allowed freedom to explore the issues under various uncertainties, without biasing the problem structure, or selecting potential solutions and techniques without due regard for their limitations and assumptions. The framework is ‘open world’ which allows issues of completeness, alternative hypotheses, and relevance to be addressed. Central to our approach to handling wicked problems was the management of uncertainty. We characterized uncertainty as composed of five key components, as shown in Table 1, building on the work of Smithson [18] and Krause [19].
102
A. Fletcher and J. Davis Table 1. Categories of Uncertainty
Randomness Vagueness Conflict Incompleteness
Relevance
Lack of a specific pattern in the data Imprecision of definition Equivocation, ambiguity, anomaly or inconsistency in the combination of data or evidence That which we do not know, know we do not know, and do not know we do not know. Includes what is too complicated and/or what is too expensive to model Issues and information that may or may not impact on the proposition being addressed.
We identify four types of model as summarized in Table 2. Detailed discussion of the implications, appropriate techniques and areas of application of the four models has been presented [11], [16]. Our key message, in line with researchers such as Casti [20], is that many messy or wicked problems are characterized by vagueness and incompleteness and not by any inherent randomness. Wicked problems are usually Type 3 or 4. Interval Probability Theory (IPT) has proven to be a relatively straightforward approach to evidential reasoning [5], which retains the desirable properties that have made evidence theory more attractive than conventional Bayesian approaches [3], [19], [21], [22]. In particular, the simplicity and representation of uncertainty have proven of value in communication of issues around wicked problems.
3 Dialectical Framework for Discovery In applying Juniper to real applications we realized the central role of debate and argument in discovery processes when wicked problems were addressed. We subsequently focused on development of holonic representation of processes and subprocesses together with dialectical argumentation as approaches to wicked problems. Applications within the area of decision-making [23], [24], have been of value. Applications in the area of discovery science are believed to be of equal importance as presented here. 3.1
Holonic Modeling of ‘Wicked’ Problems
Koestler [17] was the first to suggest the term holon to describe the idea that something can simultaneously be a whole and yet part of something larger. This idea has been extended so that the holons are processes. This gives richness to the description not available if the holons are treated as physical entities. We regard holons as processes and as parts of other holons. Simultaneously a holon is a whole and made up of sub-process holons. All holons have action and reaction, whilst some (social holons) have intentionality. Holons change through time and a description at a point in time is a ‘snapshot’ of the state of the process. A person is an example of a holon. A person is composed of various systems including a nervous system, a
Dialectical Evidence Assembly for Discovery
103
circulatory system and a skeletal system. However, a person is also part of a family, company or university, and a nation state. A key feature of holonic modeling is the capture of emergence of properties as the hierarchy is ascended. Table 2. Four Types of Model Parameters
Structure
Consequences
Precisely Defined
Precisely Defined
All the consequences of adopting a solution are known
Distributions
Precisely Defined
All consequences of adopting a solution have been precisely identified but only the probabilities of occurrence are known
Type 3
Relations and Ranges
Imprecise but substantially complete
All the consequences of adopting a solution have been approximately identified so the possibilities of ill defined or fuzzy consequences are known
Type 4
Imprecise and Incomplete
Imprecise and Incomplete
Where only some of the consequences (precise or fuzzy) of adopting a solution have been identified
Type 1 Type 2
We define a system as a hierarchy of interacting process holons. At the top of a model there is one process holon. Each process holon consists of sub-process holons and sub-sub-process holons according to the level of precision of definition. A layer of holons is at a similar level of precision of definition. The holons interact at the same level to form a description of the whole system at that level. The layers above are more general, have greater scope and are less precisely defined. The layers below are more specific, have less scope and are more precisely defined. We can look for the processes that are necessary for the success of a higher level process along with those that are sufficient. Necessity is defined such that the failure of a sub-process to meet its objectives automatically leads to failure of the parent process. Sufficiency is defined such that the success of a sub-process meeting its objectives automatically guarantees success of the parent process. We also look for processes that are partially sufficient and partially necessary. We summarise the value of holonic modeling as: • Helping us describe complex systems simply; • Can be used for both ‘hard’ physical systems and ‘soft’ systems involving people, and to combine them; • Enables us to clarify relationships and accountability; • Useful in mapping paths of change from where we are now to where we want to get to; • Are a means of identifying added value as an emergent property in dialectical argument; • Are particularly useful in managing co-operative systems. Each of the processes has goals, objectives and / or criteria of success. We are interested in how much we can depend on the process to meet its objectives. If we
104
A. Fletcher and J. Davis
consider the success of the objective to be a hypothesis, we seek to assemble evidence to support the hypothesis. We use an interval probability approach to express the weight of this evidence and then propagate its effect through the hierarchy as described below. 3.2
Dialectics and Interval Probability Theory
Dialectic, the method of seeking knowledge by question and answer, was first practiced systematically by Zeno, the disciple of Parmenides [25]. However, it was Socrates who developed and extended the method and much of Plato’s philosophy makes use of dialectic. The dialectic method is not suitable for empirical science but large numbers of questions are probably best addressed by this approach. Those questions for which we already have enough knowledge to come to a meaningful conclusion, but have failed, through confusion of thought or lack of analysis to make the best logical use of what we know, are best addressed by dialectic. Wherever the debated topic is logical rather than factual argumentation has value – and the value lies in exposure of logical error (which often enables their perpetrators to hold the comfortable position on every subject in turn). The dialectic method tends to promote logical consistency. The German philosopher Hegel developed a particular form of dialectical argumentation. The Hegelian dialectic [25] is a triadic movement of thesis, anti-thesis and synthesis, where the synthesis then becomes the thesis as the cycle evolves. Three axioms are usually associated with dialectical process: • Axiom of transformation (where changes in quantity result in changes in quality); • Axiom of interaction between opposites (where opposing forces produce transformations of the system that includes both forces); • Axiom of the negation of the negation (where conflict between thesis and antithesis produces something different – the synthesis). Although much debate has raged over the implications and methods of the Hegelian system in particular – with claims including the system is either absurd, or trivial, or esoteric, or pointless – we believe there is substantial value in the approach. In particular, if we take the dialectical method as regulative or programmatic, and take note of internal and external conflicts and of the way in which conflicts are resolved by the adjustment of conflicting aspects of each other in new unities, important and valuable insights can be gained. This is the essence of problem resolution. Within our original framework, dialectic argument was captured in the attributes of the process [3]. We explicitly recorded reasons and justifications for beliefs. Now we employ the framework itself in a dialectical manner making use of the IPT uncertainty calculus [23], [24]. The algorithms have been developed to approximate and represent dialectical argumentation through the independent handling of evidence for and evidence against a hypothesis, and the subsequent combination of evidence. We explicitly employ the basic IPT property that evidence for a process meeting its objectives is separate from the evidence against it doing so. This differentiates IPT from the classical way of giving a single probability figure for success and assuming the evidence against it is 1 – (single figure). The graphical interpretation of this is shown in Figure 1.
Dialectical Evidence Assembly for Discovery
105
Although some work has further developed IPT enabling judgments to be made through conditional probabilities [12], [15], we have developed a much simpler implementation of IPT. In wicked problems we have found the numerical outputs to be of limited value. Instead, we have linked the representational character of IPT to linguistic concepts of necessity, sufficiency and dependency as defined in holonic terms in Section 3.1. Through a process of dialectical argument issues relating to wicked problems are captured, debated and resolved. The role we assign to IPT is twofold: • The visual representation of evidence for, evidence against and either lack of evidence or conflict of evidence gives a powerful summary of the state of each sub-process; • The pooling of belief of various aspects of the complex problem propagated up the holonic structure. Evidence that A is successful
Lack of Evidence
6Q$
Evidence that A is not successful
6S$
Sn(A) = Evidence that A is successful 1 - Sp(A) = Evidence that A is not successful Sp(A) - Sn(A) = Uncertainty in the evidence Fig. 1. Interval probability representation of uncertainty
We use the concepts of sufficiency as a measure of the amount of influence a given sub-system has on the performance of its parent, and necessity as a measure of the extent to which failure of a sub-system causes failure of its parent: • How firm is the evidence that the parent will succeed if the child succeeds – sufficiency; • How firm is the evidence that the parent will fail if the child fails – necessity. In this implementation, the sufficiency acts only on the ‘green’ whilst necessity acts only on the ‘red’ of the representation given in Figure 1. The white represents the ‘lack of evidence’ or uncertainty. In this implementation we allow the green and red bars to cross, resulting in an amber bar, indicating the extent of the cross over. We do not normalise out this crossing of the evidence or conflict. Instead we examine the source of the conflict through dialectical argumentation. 3.2.1
Mathematical Approximations
Interval probability theory, as developed by Cui and Blockley [4], allows support for a hypothesis E to be separated from support for the negation of the hypothesis. It is this decoupling of thesis and antithesis and the specific appraisal of the antithesis which has proved so useful in examining real world problems.
106
A. Fletcher and J. Davis
p ( E ) ∈ [S n ( E ), S p ( E )]
(1)
where Sn(E) is the lower bound, and Sp(E) is the upper bound of the probability p(E). The negation is
p ( E ) ∈ [1 − S p ( E ), 1 − S n ( E )]
(2)
An interval probability can be interpreted as a measure of belief, so that Sn(E) represents the extent to which it is certainly believed that E is true or dependable, 1 − S p ( E ) = S n ( E ) represents the extent to which it is certainly believed that E is false or not dependable, and the value S p ( E ) − S n ( E ) represents the extent of uncertainty of belief in the truth or dependability of E. Three extreme cases illustrate the meaning of this interval measure of belief:
p ( E ) ∈ [0,0]represents a belief that E is certainly false or not dependable, p ( E ) ∈ [1,1] represents a belief that E is certainly true or dependable, and p ( E ) ∈ [0,1] represents a belief that E is unknown. The degree of dependence between two propositions E1 and E2 is defined by the parameterρ:
ρ=
p ( E1 ∩ E 2 ) Min ( p ( E1 ), p ( E 2 ))
(3)
Thus ρ = 1 indicates that E1 ⊂ E2 or E2 ⊂ E1, whilst if E1 and E2 are independent ρ = Max(p(E1),p(E2))
(4)
so that
(1 ∩ E2) = p(E1).p(E2)
(5)
The minimum value of ρ is given by p ( E1 ) + p ( E 2 ) − 1 ρ = Max ,0 Min ( p ( E1 ), p( E 2 ))
where ρ = 0 indicates that E1 and E2 are mutually exclusive.
(6)
Dialectical Evidence Assembly for Discovery
107
If ρ is defined as an interval number [ρl, ρu] then
S n ( E1 ∩ E 2 ) = ρl min(Sn(E1), Sn(E2)
(7)
S p ( E1 ∩ E 2 ) = ρu min(Sp(E1), Sp(E2))
(8)
S n ( E1 ∪ E 2 ) = Sn(E1) + Sn(E2) - ρl min(Sn(E1), Sn(E2))
(9)
S p ( E1 ∪ E 2 ) = Sp(E1) + Sp(E2) - ρu min(Sp(E1), Sp(E2))
(10)
The reasoning behind logical inference together with the approximations has been presented [3], [15], and is not reproduced here. The algorithms associated with the dialectical framework have been presented in the context of decision-making under uncertainty [16] and are summarized as follows. The theorem of total probability and its inverse:
(
)()
(11)
(
)()
(12)
p (H ) = p (H | E ).p (E )+ p H | E .p E
( ) (
)
p H = p H | E .p (E )+ p H | E .p E
together with the bounds introduced by Dubois and Prade [26] are employed, in conjunction with the expressions for Sn(E) and Sp(E) derived above, as the basis for the implementation. Through algebraic manipulation as justified in [16] we obtain Sn (H) = Sn (E)(Sn (H | E) − Sn (H | E)) + Sn (H | E)
(
)
Sp (H) = 1 − Sp (H) = Sp (H | E) 1 − Sp (E) + Sp (H | E)Sp (E)
(13)
(14)
We now have the ‘sufficiency’ p ( H | E ) balanced by the necessity p ( H | E ) . The difficulty with Sn ( H | E ) and Sp ( H | E ) suggests setting the most conservative bounds on them which is letting them equal zero. This simplifies the user input significantly while remaining conservative in the propagation of evidence. To further ease the burden in the case of multiple sub-processes the following approximations are introduced:
108
A. Fletcher and J. Davis
1. Rather than using interval values [ρl, ρu] of the dependency parameter ρ, a point value is used. 2. Rather than using pair-wise assignments of ρ, a single value is used to approximate the level of dependency between the whole body of evidence. n 3. Rather than assigning the full 2 conditional probability measures, the user inputs only two measures for each item of evidence p(H|Ei) and p ( H | E i ) (i.e. 2n measures), which are termed the sufficiency and necessity of each item of evidence. A key use of the implementation is the introduction of dialectic into the modeling process. All holons and sub-holons are explicitly addressed with regards evidence for (as assessed through sufficiency) and evidence against (as assessed through necessity). Dependency here is used as a measure of commonality (or degree of overlap) of evidence at any given level of the hierarchy. At each stage of the modelling process the changes introduced are justified and arguments recorded as part of the attributes giving a dialectical for and against the change. 3.3 Context, Attribute, and Value
One of the most important developments over the last two decades has been the recognition of the limits of formal methods in philosophy and science. The one sided view that reason is exclusively identified with mathematical and logical rationality has been challenged by philosophers like Stephen Toulmin [27], [28] and others. One topic above all else captures the core difference between the rival views of Reason [28]. The analysis of rival arguments in terms of abstract concepts, and the insistence on explanations in terms of universal laws – with formal, general, timeless, context-free, and value-neutral arguments – is nowadays the business of Logic; the study of factual narratives about particular situations, in the form of substantive, timely, local, situation-dependent, and ethically-loaded argumentation, is considered to be at best a matter for Rhetoric. Toulmin puts this both succinctly and dramatically: “before Galileo, Descartes and Hobbs, human adaptability and mathematical rigour were regarded as twin aspects of human reason. From the 1620’s on, this balance was upset, as the prestige of mathematical proofs led philosophers to disown non-formal kinds of human argumentation”. To place Toulmin’s thesis in context: for the last four hundred years, the ideas of “reasonableness” and “rationality”- closely related in antiquity – were separated, as an outcome of the emphasis that seventeenth-century natural philosophers placed on formal deductive techniques. This emphasis did a grave disservice to our commonsense ways of thought, and led to confusion about some highly important questions: above all, the relationship of the social sciences to the moral and other value-laden problems that arise in the practical professions (i.e. management and decision-making). We acknowledge Toulmin’s philosophy explicitly in our approach by placing equal importance on formal logic, dialectic and rhetoric, believing that complex problems can only be adequately addressed by modeling the problem employing a full range of techniques.
Dialectical Evidence Assembly for Discovery
109
4 Case Studies: Energy Related Wicked Problems Discovery through dialectics was crucial in the following case studies. In each case the holonic system is made up of pieces of evidence described as processes, for instance – establishing the existence of a certain condition. The process has an objective: the process establishes that the condition exists. It’s associated proposition is the ontological: this condition exists. The players can then assert the evidence for and against the process meeting its objective. The calculus enables us to ask the question – “how certain are we that the process will meet its objective”? or “how dependable is the process”? 4.1 Case Study 1: Is It Safe to Drill?
Drilling for oil is a source of many safety, technical and economic concerns. One particular problem is overpressure, where extra high geological pressures sub-surface can result in serious technical and economic problems, and in extreme cases, serious safety implications. A case study employing the original Juniper technology has been published [29] addressing the technical issues encountered in assessing overpressure for oil exploration on the NW Shelf of Australia. In this paper we detail the key role dialectical argumentation played in the process of resolving the issues. In brief, the key problem was the different results obtained by the geologists and drilling engineers in assessing the potential of overpressure as shown in Figure 2. High sufficiency is given to the assessment of overpressure by the geoscientists (S=0.8) as this is primarily their responsibility, but high necessity is given to both geoscientists and drilling engineers (N=0.8). In other words, drillers can highlight the possible failure of the overall process. Low dependency (0.2) reflects the largely independent reasoning routes followed by the geoscientists and drillers. The geologists, using local and regional geological evidence constructed a model that indicated overpressure was unlikely. The drilling engineers, using information of near-by geographical failures resulting from overpressure concluded overpressure was very likely to be a problem. The result is conflict as indicated by the amber bar appearing in the top process. By means of Juniper, a dialectical debate regarding the nature of the evidence for overpressure along with explicit consideration of what constituted definite evidence against overpressure was conducted. It transpired that the drillers were explicitly concerned about issues that constituted the evidence against the geological model (the anti-thesis). This legitimate concern arose from the training and experience of the drilling fraternity – an essentially pragmatic domain. In this case, the failure of a nearby well due to overpressure dictated caution, with planning taking account of established overpressure. The geologists on the other hand coming from a scientific background tended to focus on the evidence for. Their geological model was constructed from the available data. The antithesis in any form of alternative hypothesis that could also be supported by the data was not included in the conventional scientific modeling. This lack of drive for Popperian falsification is an interesting comment on current scientific practice in industry and possibly reflects time constraints and economic factors.
110
A. Fletcher and J. Davis
N.0.8 S.0.8 Assessing overpressure by geoscientists
Assessing overpressure
Assessing resolved overpressure
0.2
0.4 N.0.8 S.0.5
N.0.8 S.0.3 Assessing overpressure by drillers
Fig. 2. Conflicting evidence of overpressure
Incorporating geographical information
N.0.7 S.0.5
N.0.7 S.0.5
Defining concept of risk
Re-evaluating previous wells
Fig. 3. Dialectical resolution of overpressure
The problem was resolved by the creation of a new geological model that incorporated geological features (such as pressure transmitting faults and carrier beds) that had been of concern to drillers. These factors allow explicit evaluation of evidence against the geological model within the model framework. This new model incorporated significant discovery of additional technical factors in a new hypothesis (synthesis) of the sub-surface. Figure 3 shows the creation of this new understanding also required development of a joint understanding of the nature of risk, together with the need to place the interpretation of this work in context with previous operations. In other words, careful consideration of meanings and definitions in light of this new synthesis was required, together with how previous experience was incorporated in the discovery process. The lesson learnt from this case study was the importance of establishing the antithesis before producing a synthesis when discussing a cross-disciplinary process. The facility within the Juniper framework to ask explicitly for the evidence against establishes the practice of asking for the antithesis as a natural way of working. 4.2 Case Study 2: Is This New Product Environmentally Friendly?
Gas to liquid (GTL) technology holds the promise of converting natural gas into a diesel fuel with no sulfur or particulate contaminates – but is the process environmentally friendly? GTL technology has been evaluated using our dialectical framework (24) where full details of the case study are provided. The question as to the environmental impact of GTL technology followed classic Hegelian dialectics in this study (24). Initial perceptions of GTL technology proceeded from the thesis that GTL has ultra-low sulfur and particulates and was thus a major advance in fossil fuel usage. The anti-thesis was developed by some environmental groups who focused on the increased carbon dioxide production associated with GTL technology, and hence the increased green house gas production as shown in Figure 4. From the polarised opposites of GTL technology offering a wonderful pollution free fuel or an even worse fossil fuel contribution to global warming, a synthesis evolved. The synthesis was to acknowledge the positive
Dialectical Evidence Assembly for Discovery
111
Assessing environmental impact of GTL
Assessing environmental impact of GTL
0.5 N.0.8 S.0.6
0.5 N.0.8 S.0.6
N.0.4 S.0.6
Assessing S content
N.0.4 S.0.6
N.0.8 S.0.6
Assessing particulates
Assessing carbon dioxide
N.0.8 S.0.6 0.5
Assessing S content
Assessing particulates
Assessing carbon dioxide
N.0.4 S.0.8 Carbon capture at source
Fig. 4. Conflict of GTL environmental impact
N.0.4 S.0.8 Carbon capture at application
Fig. 5. Dialectical resolution though capture synthesis in GTL study
emissions reduction of GTL fuels but also acknowledge the negative increased carbon dioxide production. Debate then progressed with the study group developing the thesis that although whole life GTL carbon dioxide production was higher than conventional fuels, the real issue was that of capture of carbon as shown in Figure 5. With GTL, excess carbon production was all associated with the plant and the options of geological or deep-ocean disposal were viable, unlike the more intractable problem of carbon dioxide associated with use in transport systems. Thus GTL was actually better for greenhouse gasses than conventional fuels if carbon was captured. The anti-thesis developed by environmentalists was this was still an ‘end of pipe’ solution. The use of fossil fuels will still be environmentally detrimental and GTL was only repackaging traditional oil industry products. The synthesis and environmental debate regarding the value of GTL is very much active and continues to evolve. However, the resolution of the problems begins to become more apparent when the dialectic framework is made explicit. It is important to see here that both the thesis and antithesis are valid – a truly Hegelian position. If both sides can see this in the framework then the effort turns to finding a resolution rather than trying to destroy the ‘opponents’ thesis.
5 Conclusions We have presented a dialectical framework for the assessment and assembly of evidence in discovery processes. The stimulation and capture of dialectical argumentation in the context of complex (wicked) problems was addressed through: 1. Holonic representation developed to structure processes hierarchically for the modeling of complex problems;
112
A. Fletcher and J. Davis
2. Interval Probability Theory (IPT) developed as an evidential reasoning calculus to represent dialectical argument through the separation of evidence for and evidence against a hypothesis. Algorithms were developed that approximate the IPT calculus, allowing users to characterize problems in linguistic terms, such as sufficiency, necessity and dependency, thus avoiding judgments of conditional probabilities. Support for and against any hypothesis can then be assembled through a weighting of the relevance, importance and degree of evidence. Support for any hypothesis as well as generation of alternative hypotheses can proceed dialectically where conflicts in data and information are argued explicitly within the modeling framework. The framework was illustrated with examples of discovery from two complex (wicked) problems.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Conklin, E.J., Weil, W.: Wicked Problems: Naming the Pain in Organisations. (2002). http://www.gdss.com/wicked Schon, D.: The Reflective Practitioner. Basic Books, New York, (1983) Davis, J.P., Hall, J.W.: A Software-Supported Process for Assembling Evidence and Handling Uncertainty in Decision-Making. Decision Support Systems, 35, (2003), 415– 433 Cui, W., Blockley D.I.: Interval Probability for Evidential Support. Int. J. of Intelligent Systems 5, (1990), 183–192 Hall, J.W., Blockley, D.I., Davis, J P.: Uncertain Inference Using Interval Probability Theory. Int. J. of Approximate Reasoning 19, (1988), 247–264 Hall, J.W., Blockley, D.I., Davis, J.P.: Non-Additive Probabilities for Representing Uncertain Knowledge. In: Babovic, V, Larson, L.C. (eds): Proc. of Int. Conf. on Hydroinformatics, Copenhagen, (1998), Balkema, Rotterdam Davis, J.P., Fletcher A.J.P.: Managing Assets Under Uncertainty. Paper 59433 presented at SPE Asia Pacific Conference on Integrated Reservoir Modeling for Asset Management, Yokohama, Japan, April 25–26, (2000) Taylor, C.A., Hall, J.W., Davis, J.P,: Seismic Safety Assessment of Dams and Appurtenant Works for Areas of Low to Moderate Seismicity. In: 12th World Conference on Earthquake Engineering, February, (2000), Canterbury, New Zealand. Sanchez-Silva, M., Blockley, D.I., Taylor, C.A.: Uncertainty Modeling of Earthquake Hazards. Microcomputers in Civil Engineering, 11, (1996), 99–114 Fletcher, A.J.P., et. al.: A Framework for Greenhouse Gas Related Decision-Making with Incomplete Evidence. In: Greenhouse Gas Mitigation Technologies Conference 6, Sept 30–Oct 3, (2002), Kyoto, Japan Fletcher, A.J.P., et. al.: Complex Problems with Incomplete Evidence: Modeling for Decision-Making. In: Greenhouse Gas Mitigation Technologies Conference 6, Sept 30– Oct 3, (2002), Kyoto, Japan Dawson, R.J., Davis, J.P., Hall, J.W.: A Decision Support Tool for Performance-Based Management of Flood Defense Systems. In: Proc. of Int. Conf. on Hydroinformatics, Cardiff, (2002) Davis J.P.: Process Decomposition. In: Proc. of Int. Conf. on Hydroinformatics, Cardiff, (2002) Davis, J.P., and Hall, J.W.: Assembling Uncertain Information for Decision-Making. In: Babovic, V, Larson, L.C. (eds): Proc. of Int. Conf. on Hydroinformatics, Copenhagen, (1998), Balkema, Rotterdam
Dialectical Evidence Assembly for Discovery
113
15. Hall J.W., et. al.: A Decision-Support Methodology for Performance-Based Infrastructure management. Natural Hazard Review, ASCE, in press 16. Fletcher, A.J.P., Davis, J.P: Decision-Making with Incomplete Evidence. Paper SPE 77910 presented at the SPE Asia Pacific Oil and Gas Conference, Melbourne, 8–10 Oct, (2002) 17. Koestler, A.: The Ghost in the Machine. Arkana Books, ISBN 014 019192 5, London, (1967) 18. Smithson, M.J.: Ignorance and Uncertainty: Emerging Paradigms. Cognitive Science Series, Springer Verlag, New York, (1989) 19. Krause, P., Clark, and D.: Representing Uncertain Knowledge: An Artificial Intelligence Approach. Intellect Books, Oxford, ISBN 1 871516 17X, (1993) 20. Casti, J. L.: Reality Rules – Picturing the World in Mathematics, Vols 1 & 2. John Wiley and Sons, ISBN 0417 184365, London, (1992) 21. Dempster, A.P.: Upper and Lower Probability Inference for Families of Hypotheses with Monotone Density Ratios. The Annals of Math. Stat., 40(3), (1969), 953–969 22. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, (1976) 23. Fletcher, A.J.P., Davis, J.P.: A Dialectical Framework for the Representation and Management of Complex Problems. Paper submitted to Complexity International, April (2003) 24. Fletcher, A.J.P.: Application of a Dialectical Framework to the Complex Problem of Gas to Liquids Technology. Paper submitted to Complexity International, April (2003) 25. Russell, B.: History of Western Philosophy. George Allan and Unwin, London, (1961) 26. Dubois, D., Prade, H.: A Discussion of Uncertainty Handling in Support Logic Programming. Int. J. of Intelligent Systems 5, 15–42, (1990) 27. Toulmin, S.: The Uses of Argument. Cambridge University Press, (1958) 28. Toulmin, S.: Return to Reason. Harvard University Press, ISBN 0-674-00495-7, (2001) 29. Dodds, K., et. al.: An Overpressure Case History Using A Novel Risk Analysis Process. APPEA Journal, Australia, (2001)
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions Daiji Fukagawa1 and Tatsuya Akutsu2 1
Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University Yoshida-Honmachi, Sakyo-ku, Kyoto 606–8501 Japan [email protected] 2 Bioinformatics Center, Institute for Chemical Research, Kyoto University Gokasho, Uji, Kyoto 611–0011 Japan [email protected]
Abstract. A simple greedy algorithm has been known as an approximation algorithm for inference of a Boolean function from positive and negative examples, which is a fundamental problem in discovery science. It was conjectured from results of computational experiments that the greedy algorithm can find an exact (or optimal) solution with high probability if input data for each function are generated uniformly at random. This conjecture was proved only for AND/OR of literals. This paper gives a proof of the conjecture for more general Boolean functions which we call unbalanced functions. We also proved that unbalanced functions account for more than half of all Boolean functions, and the ratio of dinput unbalanced functions to all d-input Boolean functions converges to 1 as d grows. This means that the greedy algorithm can find the exact solution with high probability for most Boolean functions if input data are generated uniformly at random. In order to improve the performance for cases of small d, we develop a variant of the greedy algorithm. The theoretical results on the greedy algorithm and the effectiveness of the variant were confirmed through computational experiments.
1
Introduction
Inference of a Boolean function from positive and negative examples is a fundamental and well-studied problem in discovery science, machine learning, data mining, Bioinformatics, etc [1,4,6,9]. A lot of theoretical studies have been done on the problem in the field of computational learning theory [9]. However, most of existing studies focus on special classes of Boolean functions and many of existing algorithms are complicated. On the other hand, a simple greedy algorithm (GREEDY SGL1 , in short) has been known as an approximation algorithm for the problem [2,4,5,6]. In this algorithm, the inference problem is reduced to the set cover problem and then a well-known greedy algorithm for set cover is 1
SGL means single. We use this notation in order to discriminate from a variant (GREEDY DBL) introduced in this paper.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 114–127, 2003. c Springer-Verlag Berlin Heidelberg 2003
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
115
applied. This algorithm tries to find the minimum set of attributes (i.e., the minimum set of input variables) which can distinguish positive examples from negative examples using a Boolean function. Once a small set of relevant attributes is selected, a Boolean function can be determined by exhaustive search or other heuristic algorithms. It was proved that GREEDY SGL can finds the set of attributes whose size is at most O(log n) times larger than the minimum [2,5,6]. It should be noted that the input data set usually contains attributes that are irrelevant or are dependent on some of relevant attributes. Therefore, selection of small set of relevant attributes is important. Selection of relevant attributes is also a well-studied problem in discovery science. It is also called as feature extraction or feature selection under more general settings (see, e.g. [5,6]). At the 3rd International Conference on Discovery Science (DS 2000), Akutsu et al. proved that GREEDY SGL finds the minimum set of attributes with high probability (with probability > 1 − 1/nα for any fixed constant α where n is the number of attributes) if Boolean functions are limited to AND/OR of literals and examples are generated uniformly at random [4]. They also performed computational experiments on GREEDY SGL and conjectured that GREEDY SGL finds the minimum set of attributes with high probability for most Boolean functions if examples are generated uniformly at random. In this paper, we prove the above conjecture where the functions are limited to a large class of Boolean functions, which we call unbalanced functions. The class of unbalanced functions is so large that for each positive integer d, we can prove more than half of d-input Boolean functions are included in the class. We can also prove that the fraction of d-input unbalanced functions converges to 1 as d grows. As mentioned above, GREEDY SGL has very good average-case performance if the number of relevant attributes is not small. However, the average-case performance is not satisfactory if the number is small (e.g., less than 5). Therefore, we developed a variant of GREEDY SGL (GREEDY DBL, in short). We performed computational experiments on GREEDY DBL using artificially generated data sets. The results show that the average-case performance is considerably improved when the number of relevant attributes is small though GREEDY DBL takes much longer CPU time.
2 2.1
Preliminaries Problem of Inferring Boolean Function
We consider the inferring problem for n input variables x1 , . . . , xn and one output variables y. Let x1 (k), . . . , xn (k), y(k) be the kth tuple in the table, where xi (k) ∈ {0, 1}, y(k) ∈ {0, 1} for all i, k. Then, we define the problem of the inferring Boolean function in the following way. Input: x1 (k), . . . , xn (k), y(k)k=1,... ,m , where xi (k), y(k) ∈ {0, 1} for all i, k.
116
D. Fukagawa and T. Akutsu
Output: a set X = {xi1 , . . . , xid } with the minimum cardinality (i.e., minimum d) for which there exists a function f (xi1 , . . . , xid ) such that (∀k).(y(k) = f (xi1 (k), . . . , xid (k))) holds. Clearly, this problem is an NP-optimization problem. To guarantee the existence of at least one f which satisfies the condition above, we suppose the input data is consistent (i.e., for any k1 = k2 , (∀i)(xi (k1 ) = xi (k2 )) implies y(k1 ) = y(k2 ).) This can be tested in O(nm log m) by sorting the input tuples. In this paper, what to infer is not f , but d-input variables X. If d is bounded by a constant, we can determine f in O(m) time after determining d-input variables. 2.2
GREEDY SGL: A Simple Greedy Algorithm
The problem of inferring Boolean function is closely related to the problem of inference of functional dependencies and both of them are known to be NP-hard [12]. Therefore a simple greedy algorithm GREEDY SGL has been proposed [2]. In GREEDY SGL, the original problem is reduced to the set cover problem and a well-known greedy algorithm (see e.g. [16]) for the set cover is applied. GREEDY SGL is not only for inferring Boolean function, but for inferring other variations of the problem (e.g. the domain can be multivalued, real numbers, and so on). The following is a pseudo-code for GREEDY SGL: S ← {(k1 , k2 ) | k1 < k2 and y(k1 ) = y(k2 )} X ← {} X ← {x1 , . . . , xn } while S = {} do for all xi ∈ X do = xi (k2 )} Si ← {(k1 , k2 ) ∈ S | xi (k1 ) i∗ ← argmaxi |Si | S ← S − Si∗ X ← X − {xi∗ } X ← X ∪ {xi∗ } Output X The approximation ratio of this algorithm is at most 2 ln m+1 [2]. Since the lower bound of Ω(log m) on the approximation ratio is proved [2], GREEDY SGL is optimal except a constant factor. 2.3
Notations
Let f (x1 , . . . , xd ) (or f ) be an d-input Boolean function. Then |f | denotes the number of the assignments for the input variables such that f = 1 holds (i.e., |f | = |{(x1 , . . . , xd ) | f (x1 , . . . , xd ) = 1}|). We use f to represent the negation of f . Hence |f | is the number of the assignment such that f = 0 holds. For a input variable xi in f , fxi and fxi denote the positive and negative cofactors w.r.t. xi , respectively. Namely, fxi is f in which xi is replaced by a
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
117
constant 1. Similarly, fxi is f in which xi is replaced by a constant 0. For a set of variables X, let Π(X) denote the set of products of literals, i.e., Π(X) = {c = li1 · · · lik | lij is either xij or xij }. For a product of literals c = li1 · · · lik , extended Shannon cofactor fc is defined as (· · · ((fli1 )li2 ) · · · )lik . For a d-input (completely defined) Boolean function f , we call an input variable xi relevant iff there exists an assignment for x1 , . . . , xi−1 , xi+1 , . . . , xd such that f (x1 , . . . , xi−1 , 0, xi+1 , . . . , xd ) = f (x1 , . . . , xi−1 , 1, xi+1 , . . . , xd ) holds. Note that this definition of relevancy does not incorporate redundancy.
3
Unbalanced Functions
It is mathematically proved that for an instance of the inference problem generated uniformly at random, GREEDY SGL can find the optimal solution with high probability if the underlying function is restricted to AND/OR of literals [4]. On the other hand, the result of the computational experiments suggests that GREEDY SGL works optimally in more general situations [4]. In this section, we will define a class of Boolean functions which we call unbalanced functions. The class of unbalanced functions includes the class of AND/OR of literals and is much larger than AND/OR of literals. We will also evaluate the size of the class. More than half of all Boolean functions are included in the class of unbalanced functions. Furthermore, for large d, almost all Boolean functions are included. Note that the definition of the term “unbalanced functions” is not common. In this paper, we define the term according to Tsai et al. [15] with some modification. → {0, 1}). Definition 1. Let f be a d-input Boolean function (i.e., f : {0, 1}d We say f is balanced w.r.t. a variable xi iff |fxi | = |fxi | holds. Otherwise f is unbalanced w.r.t. xi . Definition 2. We say f is balanced iff f is balanced w.r.t. all the input variables. Similarly, we say f is unbalanced iff f is unbalanced w.r.t. all the input variables. Let us give an example. For d = 2, there exist sixteen Boolean functions. Eight of them, namely {x1 ∧ x2 , x1 ∧ x2 , x1 ∧ x2 , x2 ∧ x2 , x1 ∨ x2 , x1 ∨ x2 , x1 ∨ x2 , x1 ∨ x2 } are members of unbalanced functions. Other four functions, namely {0, 1, x1 ⊕ x2 , x1 ⊕ x2 } are members of balanced functions. The remaining four functions, {x1 , x2 , x1 , x2 } are neither balanced nor unbalanced. Next, we evaluate the size of the class of unbalanced functions and show that (d) it is much larger than AND/OR of literals. Let Bi be the set of d-input Boolean functions for which xi is balanced. For example, in the case of d = 2, we have (2) (2) B1 = {0, x2 , x1 ⊕ x2 , x1 ⊕ x2 , x2 , 1} and B2 = {0, x1 , x1 ⊕ x2 , x1 ⊕ x2 , x1 , 1}. Then, we have the following lemmas. 2d (d) (d) Lemma 1. |B1 | = ... = |Bd | = 2d−1 .
118
D. Fukagawa and T. Akutsu (d)
(d)
Proof. Since it is clear that |B1 | = ... = |Bd | holds by the similarities, all we 2d (d) are have to prove is |B1 | = 2d−1 . Assume that f is balanced w.r.t. a variable x1 . Let f = (x1 ∧ fx1 ) ∨ (x1 ∧ fx1 ) be a f ’s Shannon decomposition. fx1 is a partial function of f which defines a half part of f ’s value and fx1 independently defines the other part. Assuming that x1 is balanced, we have |fx1 | = |fx1 | by the definition. For any (d − 1)-input Boolean functions g and h, consider a d-input function f = (x1 ∧ g(x2 , . . . , xd )) ∨ (x1 ∧ h(x2 , . . . , xd )). Note there is one-to-one correspondence between (g, h) and f . Then f is balanced w.r.t. x1 iff |g| = |h|. Hence, (d) |B1 | is equal to the number of possible pairs (g, h) such that |g| = |h| holds. 2d−1 d−1 2 2d = 2d−1 , the Lemma follows. Since the number of such pairs is k=0 2 k Lemma 2. The fraction of unbalanced functions to all the Boolean functions d (d) (d) (d) converges to 1 as d grows. That is, B 1 ∧ B 2 ∧ · · · ∧ B d ∼ 22 . Proof. Using Boole-Bonferroni’s inequality [13], Stirling’s approximation and Lemma 1, we have d d d d 1 (d) 1 (d) 2 d |B | = 2d d−1 ∼ √ . B ≤ 2 22d i=1 i 22d i=1 i 2 π2d−1 This converges to 0 as d grows. Using de Morgan’s law, d d d d d (d) (d) 2d ∼ 22 . B i = 2 − Bi ∼ 22 · 1 − √ d−1 π2 i=1
i=1
Even for small d, we have the following lemma. Lemma 3. For any d, the number of d-input unbalanced functions is more than half the number of all the d-input Boolean functions. Proof. It is easy to prove that a Boolean function f is unbalanced if |f | is odd. In fact, if f is not unbalanced there exists a variable xi such that |fxi | = |fxi | holds, for which |f | = |fxi | + |fxi | must be even. Now, we can show that |{f | |f | : even}| = |{f | |f | : odd}| and hence the lemma follows. Fig.1 shows the fraction of the class of unbalanced functions. The x-axis shows d (the arity of a function) and the y-axis shows the fraction (0 to 1) of the unbalanced functions. The fractions (black circles) are calculated with the exact number of unbalanced functions for each d. Since the exact numbers are hard to compute for large d, we give the upper (stars) and lower (crosses) bounds for them. The approximation of the lower bound (dashed lines; see Lemma 2) is also drawn. Two bounds and the approximation almost join together for d > 10. The fraction of unbalanced functions is more than 90% for d > 15 and it converge to 100% for increasing d.
fraction (# of unbalanced/# of all)
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
119
1 0.8 0.6 0.4 unbalanced lower bound upper bound 1−x/√π·2x−1
0.2 0 0
5
10
15
20
25
d
Fig. 1. The fraction of unbalanced functions
4
Analysis for a Special Case
In this section, we consider a special case in which all the possible assignment for the input variables are given. Namely, m = 2n . Note that such an instance gives the complete description of the underlying function (precisely, it contains 2n /2d copies of the truth table of the function). We are interested in whether GREEDY SGL succeeds for the instances or not. Lemma 4. Consider the instance of inference problem for a Boolean function f . Assume that all the possible assignment for the input variables are given in the table. (Note that for given f , we can uniquely determine the instance of this case.) For such an instance, GREEDY SGL can find the optimal solution if f is unbalanced. Proof. Let us prove that GREEDY SGL chooses the proper variable in each iteration. Let X ∗ be the set of all relevant variables (i.e., the set of input variables of f ). In the first iteration, for a relevant variable xi ∈ X ∗ , the number of tuple pairs covered by xi is
2n 2 |Si | = |fxi | · |f xi | + |fxi | · |f xi | · . 2d Using |fxi | + |fxi | = |f |, |f xi | + |f xi | = |f | and |fxi | + |f xi | = |fxi | + |f xi |, there exists ∆ such that |fxi | =
|f | + ∆, 2
|fxi | =
|f | − ∆, 2
|f xi | =
|f | − ∆, 2
|f xi | =
|f | + ∆. 2
So, |Si | is rewritten as follows: n 2 n 2 2 2 |f | · |f | |f | · |f | (|fxi | − |fxi |)2 · d + 2∆2 · d + |Si | = = . 2 2 2 2 2
(1)
120
D. Fukagawa and T. Akutsu
On the other hand, for an irrelevant variable xj ∈ X ∗ the number of tuple pairs covered by xj is |Sj | =
|f | |f | |f | |f | · + · 2 2 2 2
n 2 n 2 2 2 |f | · |f | · · = . 2d 2 2d
(2)
Subtracting (2) from (1), 2
|Si | − |Sj | =
(|fx1 | − |fx1 |) · 2
2n 2d
2 ≥ 0.
(3)
If f is unbalanced w.r.t. xi , it holds that Si > Sj for any irrelevant variable xj ∈ X ∗ . Hence, the variable chosen in the first iteration must be relevant. Next, assuming that GREEDY SGL has succeeded up to the rth iteration and has chosen a set of variables Xr = {xi1 , ..., xir }, let us prove the success in the (r + 1)th iteration. The number of tuple pairs covered by xi ∈ X ∗ \ Xr is 2n 2 |Si | = |(fc )xi | · |(f c )xi | + |(fc )xi | · |(f c )xi | · 2d−r c 2 |fc | · |f | (|fcx | − |fcx |)2 2n c i i + · = . 2 2 2n−d c
where c denotes a product of literals consist of Xr . In the summation c , c ∈ X ∗, is considered over all the possible product of variables in Xr . For xj it holds that |fcxj | = |fcxj |. Therefore the requirement for |Si | > |Sj | is that (∃c).(fc is unbalanced w.r.t. xi ). Let us prove that this requirement is satisfied if f is unbalanced w.r.t. xi . Assuming (∀c).(|fcx i | = |fcxi |), we can obtain |fxi | = |fxi | owing to |fxi | = = c |fcxi | and |fxi | = c |fcxi |. Hence, if f is unbalanced w.r.t. xi (i.e., |fxi | |fxi |), there exists at least one c for which fc is unbalanced w.r.t. xi . As a consequence, GREEDY SGL succeed in finding the optimal solution in the (r + 1)th iteration. By induction, the variable chosen by GREEDY SGL is included in X ∗ in each iteration. If all the variables in X ∗ are chosen, GREEDY SGL will output the solution (=X ∗ ) and stop, that is, GREEDY SGL will succeed to find the optimal solution for the instance.
5 5.1
Analysis for Random Instances The Condition for Success
GREEDY SGL can find the optimal solution for some instances even if they do not conform to the special case mentioned in the previous section. First, let us see what kind of instances those are. The following lemma is an extension of the known result on the performance of GREEDY SGL for AND/OR
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
121
functions [4]. This lemma gives a characterization of the sets of examples for which GREEDY SGL succeeds. It helps to discuss the success probability of GREEDY SGL (see Sect. 5.2). Lemma 5. Consider the problem of inferring Boolean function for n input variables and one output variable. Let f be a d(< n)-input Boolean function and X be a set of {xi1 , . . . , xid }. Given an instance which has the optimal solution y = f (X), GREEDY SGL succeeds if
= |fxi |) (∀xi ∈ X).(|fxi | 2 |f | · |f | + 1 · (BMIN ) > |f | · |f | · (2CMAX )2
(4) (5)
where BMIN and CMAX are defined as follows: BA = {k | (xi1 (k), . . . , xid (k)) = A}, CA (i, p) = {k | xi (k) = p ∧ k ∈ BA },
BMIN = min |BA |, A
CMAX = max |CA (i, p)|. A,i,p
Proof. Assuming (4) and (5), let us prove that GREEDY SGL chooses the proper variable in each iteration. For each xi ∈ X, the number of tuple pairs covered by xi in the first iteration is
|Si | = |BA | · |BA | p∈{0,1}
A∈f 0 (i,p)
A∈f 1 (i,1−p)
≥ |fxi | · |f xi | + |fxi | · |f xi | · (BMIN )2 |f | · |f | |f | − |f |2 xi xi + · (BMIN )2 = 2 2 where f q (i, p) = {A | Ai = p ∧ f (A) = q}. i.e., |f 1 (i, 1)| = |fxi |, |f 1 (i, 0)| = |fxi |, |f 0 (i, 1)| = |f xi | and |f 0 (i, 0)| = |f xi |. Applying (4) to the above, we have |Si | ≥
|f | · |f | + 1 · (BMIN )2 . 2
Similarly, the number of tuple pairs covered by xj ∈ X is |Sj | = |CA (j, p)| · |CA (j, 1 − p)| ≤ 2 · |f | · |f | · (CMAX )2 . p∈{0,1} A∈f 0
A∈f 1
where f q = {A | f (A) = q}, i.e., |f 1 | = |f | and |f 0 | = |f |. Consequently, |Si | > |Sj | holds for any xi ∈ X and xj ∈ X if the condition (5) is satisfied. Let i1 = argmaxi |Si |. Thus, GREEDY SGL chooses xi1 which must be included in X. Next, assume that GREEDY SGL has chosen proper variables Xr = {xi1 , . . . , xir } before the (r + 1)th step of iteration, where r ≥ 1. Let us prove that GREEDY SGL succeeds to choose a proper variable at the (r + 1)th step, too.
122
D. Fukagawa and T. Akutsu
Suppose that fc denotes an extended Shannon cofactor of f w.r.t. c, where c is a product of variables in Xr . For each xi ∈ X \ Xr , the number of tuple pairs covered by xi is
|BA | · |BA | |Si | = c
p∈{0,1} A∈fc0 (i,p)
A∈fc1 (i,1−p)
|fc | · |f | (|fcx | − |fcx |)2 c i i · (BMIN )2 + 2 2 c
≥ (1/2) · |fc | · |f c | + 1 · (BMIN )2 ≥
c
where = {A | Ai = p ∧ fc (A) = q}. The last inequality follows from = |fcx (∃c).(|fcxi | i |). In fact, assuming (∀c).(|fcxi | = |fcxi |), we can obtain |fxi | = |fxi | since c |fc x| = |fx |, which contradicts the condition (4). ∈ X, the number of tuple pairs covered by On the other hand, for each xj xj at the (r + 1)th step is |CA (j, p)| · |CA (j, 1 − p)| |Sj | = fcq (i, p)
c
≤2·
p∈{0,1}
A∈fc0
A∈fc1
|fc | · |f c |
· (CMAX )2 .
c
Hence, GREEDY SGL will succeed at the (r + 1)th step if 1 1+ · (BMIN )2 > (2CMAX )2 , |f | · |f | c c c which is obtained from (5) using c |fc | · |f c | ≤ ( c |fc |) · ( c |f c |) = |f | · |f |. Thus, GREEDY SGL succeeds to choose the proper variable at (r + 1)th step of the iteration. By induction, GREEDY SGL can find the proper variable at each steps. Note that if the number of tuples m is sufficiently large (precisely, m = Ω(log n)), there exists at most one optimal solution with high probability [3]. Thus, the assumption for the uniqueness of the optimal solution is reasonable in this lemma. We can easily see that the conditions in Lemma 5 are not the requirement for success of GREEDY SGL. The condition (4) can be relaxed as follows: for the optimal solution (i1 , . . . , id ), there exist a permutation (i1 , . . . , id ) and a list of = |(fli ···li )xi | holds for each literals li1 , . . . , lid−1 such that |(fli ···li )xi | 1
j−1
j
1
j−1
j
j = 1, . . . , d. Theoretical results shown in this paper can be modified for a simpler algorithm which chooses d input variables corresponding to d-highest |Si |. However, the success probability of GREEDY SGL is expected to be higher [4] and GREEDY SGL can cover more cases as mentioned just above.
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
5.2
123
The Success Rate for the Random Instances
In Lemma 5, we formulated two success conditions for GREEDY SGL. To estimate the success rate of GREEDY SGL for the random instances, we will estimate the probability that these conditions hold. The probability that (4) holds is equal to the fraction of unbalanced functions for fixed d, which is already presented (see Sect. 3). It is the probability for (5) that remains to be estimated. Assuming that instances are generated uniformly at random, we can prove the following lemma as in [4]. Lemma 6. For sufficiently large m (m = Ω(α log n)), suppose that a instance of the inference problem is generated uniformly at random. Then, the instance satisfies (5) with high probability (with probability > 1 − 1/nα ). Combining Lemma 5 and 6, we have: Theorem 1. Suppose that functions are restricted to d-input unbalanced functions, where d is a constant. Suppose that an instance of the inference problem is generated uniformly at random. Then, for sufficiently large m (m = Ω(α log n)), GREEDY SGL outputs the correct set of input variables {x1 , . . . , xd } with high probability (with probability > 1 − 1/nα for any fixed constant α). Note that the number of unbalanced functions is much larger than the number of AND/OR functions for each d and the fraction to all the Boolean functions converges to 1 as d grows (see Lemma 2 and Fig. 1).
6
Computational Experiments
As proved above, GREEDY SGL has very good average-case performance if d (the number of relevant variables) is sufficiently large. However, the averagecase performance is not satisfactory if d is small (e.g., less than 5). Thus, we develop a modified version of GREEDY SGL, which we call GREEDY DBL. This is less efficient, but outperforms GREEDY SGL in the success ratio. GREEDY DBL is almost the same as GREEDY SGL. Both algorithms reduce the inference problem to the set cover problem. Recall that GREEDY SGL chooses the variable xi that maximizes |Si | at each step of the iterations. GREEDY DBL finds the pair of variables (xi , xj ) that maximizes |Si ∪ Sj | and then chooses the variable xi if |Si | ≥ |Sj | and xj otherwise. Two algorithms differ only in that respect. While GREEDY SGL takes O(m2 ng) time, it is known that there exists an efficient implementation that works in O(mng) time [4], where g is the number of the iterations (i.e., the number of variables outputted by GREEDY SGL) and d is assumed to be bounded by a constant. GREEDY DBL takes O(m2 n2 g) time since it examines pairs of variables. As in GREEDY SGL, there exists an implementation of GREEDY DBL which works in O(mn2 g) time. We implemented efficient versions of these two algorithms and compared them on their success ratios for random instances, where “success” means
124
D. Fukagawa and T. Akutsu
Table 1. The success ratio of GREEDY SGL and GREEDY DBL (%) (n = 1000, #iter=100)
m = 100 m = 300 m = 500 m = 800 m = 1000
GREEDY SGL d=1 2 3 4 46% 45% 43% 36% 47% 55% 69% 79% 61% 45% 68% 88% 56% 51% 72% 96% 58% 50% 75% 96%
GREEDY DBL d=1 2 3 4 57% 60% 75% 75% 48% 70% 88% 98% 47% 67% 80% 98% 48% 62% 84% 96% 51% 58% 82% 99%
m = 100 m = 300 m = 500 m = 800 m = 1000
GREEDY SGL d=5 6 7 8 7% 0% 0% 0% 77% 40% 5% 0% 87% 73% 34% 5% 96% 86% 71% 17% 96% 96% 74% 40%
GREEDY DBL d=5 6 7 8 26% 0% 0% 0% 98% 81% 19% 1% 99% 95% 73% 23% 100% 99% 99% 56% 100% 100% 100% 82%
GREEDY SGL (resp. GREEDY DBL) outputs the set of variables which is same as that for the underlying Boolean function (Though the optimal number of variables may be smaller than that for the underlying Boolean function, it was proved that such a case seldom occurs if m is sufficiently large [3]). We set n = 1000 and varied m from 100 up to 1000 and d from 1 up to 8. For each n, m and d, we generated an instance uniformly at random and solved it with both GREEDY SGL and GREEDY DBL. We repeated this process and counted the successful executions over 100 trials. The random instances were generated in the same way as Akutsu et al. [4] did: – First, we randomly selected d different variables xi1 , . . . , xid from x1 , . . . , xn . d – We randomly selected a d-input Boolean function (say, f ). There are 22 possible Boolean functions, each of which has the same probabilities. – Then, we let xi (k) = 0 with probability 0.5 and xi (k) = 1 with probability 0.5 for all xi (i = 1, . . . , n) and k (k = 1, . . . , m). Finally, we let y(k) = f (xi1 (k), . . . , xid (k)) for all k. Table 1 shows success ratios of GREEDY SGL and GREEDY DBL for the random instances. As expected, the success ratios of GREEDY DBL are higher than those of GREEDY SGL. It is seen that the success ratio increases as d increases for both GREEDY SGL and GREEDY DBL if m is sufficiently large. This agrees with the results on the fraction of unbalanced functions shown in Sect. 3. It is also seen that the success ratio increases as m increases. This is reasonable because Lemma 6 holds for sufficiently large m. In the case of d = 2, GREEDY SGL is expected to succeed for half of Boolean functions, because the other half includes the degenerated functions, XOR and its negation. Since GREEDY DBL can find the optimal solution even if the
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
125
underlying function is XOR or its negation, the success ratio is expected to be 62.5%. The results of the experiments agree with these expectations. Through all the computational experiments, we used a PC with a 2.8 GHz CPU and 512 KB cache memory. For a case of n = 1000, d = 3, m = 100, approximate CPU time was 0.06 sec. for GREEDY SGL and 30 sec. for GREEDY DBL. Since cases of n > 10000 may be intractable for GREEDY DBL (in fact it is expected to take more than 3000 sec.), we need to improve its time efficiency.
7
Comparison with Related Work
Since many studies have been done for feature selection and various greedy algorithms have been proposed, we compare the algorithms in this paper with related algorithms. However, it should be noted that the main result of this paper is theoretical analysis of average case behavior of a greedy algorithm and is considerably different from other results. Boros et al. proposed several greedylike algorithms [6], one of which is almost the same as GREEDY SGL. They present some theoretical results as well as experimental results using real-world data. However, they did not analyze average case behavior of the algorithms. Pagallo and Haussler proposed greedy algorithms for learning Boolean functions with a short DNF representation [14] though they used information theoretic measures instead of the number of covered pairs. A simple strategy known as the greedy set-cover algorithm [5] is almost the same as GREEDY SGL. However, these studies focus on special classes of Boolean functions (e.g., disjunctions of literals), whereas this paper studies average case behavior for most Boolean functions. The WINNOW algorithm [11] and its variants are known as powerful methods for feature selection [5]. These algorithms do not use the greedy strategy. Instead, feature-weighting methods using multiplicative updating schemes are employed. Gamberger proposed the ILLM algorithm for learning generalized CNF/DNF descriptions [7]. It also uses pairs of positive examples and negative examples though these are maintained using two tables. The ILLM algorithm consists of several steps, some of which employ greedy-like strategies. One of the important features of the ILLM algorithm is that it can output generalized CNF/DNF descriptions, whereas GREEDY SGL (or GREEDY DBL) outputs only a set of variables. The ILLM algorithm and its variants contain procedures for eliminating noisy examples [7,8], whereas our algorithms do not explicitly handle noisy examples. Another important feature of the ILLM approach is that it can be used with logic programming [10], by which it is possible to generate logic programs from examples. GREEDY SGL (or GREEDY DBL) may be less practical than the ILLM algorithm since it does not output Boolean expressions. However, it seems difficult to make theoretical analysis of average case behavior of the ILLM algorithm because it is more complex than GREEDY SGL. From a practical viewpoint, it might be useful to combine the ideas in the ILLM algorithm with GREEDY SGL.
126
8
D. Fukagawa and T. Akutsu
Concluding Remarks
In this paper, we proved that a simple greedy algorithm (GREEDY SGL) can find the minimum set of relevant attributes with high probability for most Boolean functions if examples are generated uniformly at random. The assumption on the distribution of examples is too strong. However, it is expected that GREEDY SGL will work well if the distribution is near to uniform. Even in the worst case (i.e., no assumption is put on the distribution of examples), it is guaranteed that GREEDY SGL outputs the set of attributes whose size is O(log n) times larger than the optimal [2]. Though a noisefree model is assumed in this paper, previous experimental results suggest that GREEDY SGL is still effective in noisy cases [4]. Experimental results on variants of GREEDY SGL [6] also suggest that greedy-based approach is useful in practice. Acknowledgments. We would like to thank Prof. Toshihide Ibaraki in Kyoto University for letting us know a related work [6]. This work is partially supported by Grant-in-Aid for Scientific Research on Priority Areas (C) “Genome Information Science” from the Ministry of Education, Science, Sports and Culture of Japan.
References 1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining Association Rules between Sets of Items in Large Databases. In Proc. SIGMOD Conference 1993, Washington, D.C. (1993) 207–216 2. Akutsu, T. and Bao, F.: Approximating Minimum Keys and Optimal Substructure Screens. In: Proc. COCOON 1996. Lecture Notes in Computer Science, Vol. 1090. Springer-Verlag, Berlin Heidelberg New York (1996) 290–299 3. Akutsu, T., Miyano, S., Kuhara, S.: Identification of Genetic Networks from a Small Number of Gene Expression Patterns Under the Boolean Network Model. In Proc. Pacific Symposium on Biocomputing (1999) 17–28 4. Akutsu, T., Miyano, S., Kuhara, S.: A simple greedy algorithm for finding functional relations: efficient implementation and average case analysis. Theoretical Computer Science 292 (2003) 481–495. Preliminary version has appeared in DS 2000 (LNCS 1967) 5. Blum, A., Langley, P.: Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence 97 (1997) 245–271 6. Boros, E., Horiyama, T., Ibaraki, T., Makino, K., Yagiura, M.: Finding Essential Attributes from Binary Data. Annals of Mathematics and Artificial Intelligence 39 (2003) 223–257 7. Gamberger, D.: A Minimization Approach to Propositional Inductive Learning. In: Proc. ECML 1995. Lecture Notes in Computer Science, Vol. 912. Springer-Verlag, Berlin Heidelberg New York (1995) 151–160 8. Gamberger, D., Lavrac, N.: Conditions for Occam’s Razor Applicability and Noise Elimination. In: Proc. ECML 1997. Lecture Notes in Computer Science, Vol. 1224. Springer-Verlag, Berlin Heidelberg New York (1997) 108–123
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
127
9. Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory. The MIT Press (1994) 10. Lavrac, N., Gamberger, D., Jovanoski, V.: A Study of Relevance for Learning in Deductive Databases. Journal of Logic Programming 40 (1999) 215–249 11. Littlestone, N.: Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm. Machine Learning 2 (1987) 285–318 12. Mannila, H., Raiha, K.-J.: On the Complexity of Inferring Functional Dependencies. Discrete Applied Mathematics 40 (1992) 237–243 13. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge Univ. Press (1995) 14. Pagallo, G., Haussler, D.: Boolean Feature Discovery in Empirical Learning. Machine Learning 5 (1990) 71–99 15. Tsai, C.-C., Marek-Sadowska, M.: Boolean Matching Using Generalized ReedMuller Forms. In Proc. Design Automation Conference (1994) 339–344 16. Vazirani, V.V.: Approximation Algorithms. Springer, Berlin (2001)
Performance Evaluation of Decision Tree Graph-Based Induction Warodom Geamsakul, Takashi Matsuda, Tetsuya Yoshida, Hiroshi Motoda, and Takashi Washio Institute of Scientific and Industrial Research, Osaka University 8-1 Mihogaoka, Ibaraki, Osaka 567-0047, JAPAN {warodom,matsuda,yoshida,motoda,washio}@ar.sanken.osaka-u.ac.jp
Abstract. A machine learning technique called Decision tree GraphBased Induction (DT-GBI) constructs a classifier (decision tree) for graph-structured data, which are usually not explicitly expressed with attribute-value pairs. Substructures (patterns) are extracted at each node of a decision tree by stepwise pair expansion (pairwise chunking) in GBI and they are used as attributes for testing. DT-GBI is efficient since GBI is used to extract patterns by greedy search and the obtained result (decision tree) is easy to understand. However, experiments against a DNA dataset from UCI repository revealed that the predictive accuracy of the classifier constructed by DT-GBI was not high enough compared with other approaches. Improvement is made on its predictive accuracy and the performance evaluation of the improved DT-GBI is reported against the DNA dataset. The predictive accuracy of a decision tree is affected by which attributes (patterns) are used and how it is constructed. To extract good enough discriminative patterns, search capability is enhanced by incorporating a beam search into the pairwise chunking within the greedy search framework. Pessimistic pruning is incorporated to avoid overfitting to the training data. Experiments using a DNA dataset were conducted to see the effect of the beam width, the number of chunking at each node of a decision tree, and the pruning. The results indicate that DT-GBI that does not use any prior domain knowledge can construct a decision tree that is comparable to other classifiers constructed using the domain knowledge.
1
Introduction
In recent years a lot of chemical compounds have been newly synthesized and some compounds can be harmful to human bodies. However, the evaluation of compounds by experiments requires a large amount of expenditure and time. Since the characteristics of compounds are highly correlated with their structure, we believe that predicting the characteristics of chemical compounds from their structures is worth attempting and technically feasible. Since structure is represented by proper relations and a graph can easily represent relations, knowledge discovery from graph structured data poses a general problem for G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 128–140, 2003. c Springer-Verlag Berlin Heidelberg 2003
Performance Evaluation of Decision Tree Graph-Based Induction
129
mining from structured data. Some other examples amenable to graph mining are finding typical web browsing patterns, identifying typical substructures of chemical compounds, finding typical subsequences of DNA and discovering diagnostic rules from patient history records. Graph-Based Induction (GBI) [10,3], on which DT-GBI is based, discovers typical patterns in general graph structured data by recursively chunking two adjoining nodes. It can handle graph data having loops (including self-loops) with colored/uncolored nodes and links. There can be more than one link between any two nodes. GBI is very efficient because of its greedy search. GBI does not lose any information of graph structure after chunking, and it can use various evaluation functions in so far as they are based on frequency. It is not, however, suitable for graph structured data where many nodes share the same label because of its greedy recursive chunking without backtracking, but it is still effective in extracting patterns from such graph structured data where each node has a distinct label (e.g., World Wide Web browsing data) or where some typical structures exist even if some nodes share the same labels (e.g., chemical structure data containing benzene rings etc). On the other hand, besides extracting patterns from data, the decision tree construction method [6,7] is a widely used technique for data classification and prediction. One of its advantages is that rules, which are easy to understand, can be induced. Nevertheless, to construct decision trees it is usually required that data is represented by or transformed into attribute-value pairs. However, it is not trivial to define proper attributes for graph-structured data beforehand. We have proposed a method called Decision tree Graph-Based Induction (DT-GBI), which constructs a classifier (decision tree) for graph-structured data while constructing the attributes during the course of tree building using GBI recursively and did preliminary performance evaluation [9]. A pair extracted by GBI, consisting of nodes and links among them1 , is treated as an attribute and the existence/non-existence of the pair in a graph is treated as its value for the graph. Thus, attributes (pairs) that divide data effectively are extracted by GBI while a decision tree is being constructed. To classify unseen graph-structured data by the constructed decision tree, attributes that appear in the nodes of the tree are produced from data before the classification. However, experiments using a DNA dataset from UCI repository revealed that the predictive accuracy of decision trees constructed by DT-GBI was not high compared with other approaches. In this paper we first report the improvement made on DT-GBI to increase its predictive accuracy by incorporating 1) a beam search, 2) pessimistic pruning. After that, we report the performance evaluation of the improved DT-GBI through experiments using a DNA dataset from the UCI repository and show that the results are comparable to the results that are obtained by using the domain knowledge [8]. Section 2 briefly describes the framework of DT-GBI and Section 3 describes the improvement made on DT-GBI. Evaluation of the improved DT-GBI is re1
Repeated chunking of pairs results in subgraph structure
130
W. Geamsakul et al.
ported in Section 4. Section 5 concludes the paper with a summary of the results and the planned future work.
2 2.1
Decision Tree Graph-Based Induction Graph-Based Induction Revisited
GBI employs the idea of extracting typical patterns by stepwise pair expansion as shown in Fig. 1. In the original GBI an assumption is made that typical patterns represent some concepts/substructure and “typicality” is characterized by the pattern’s frequency or the value of some evaluation function of its frequency. We can use statistical indices as an evaluation function, such as frequency itself, Information Gain [6], Gain Ratio [7] and Gini Index [2], all of which are based on frequency. In Fig. 1 the shaded pattern consisting of nodes 1, 2, and 3 is thought typical because it occurs three times in the graph. GBI first finds the 1→3 pairs based on its frequency, chunks them into a new node 10, then in the next iteration finds the 2→10 pairs, chunks them into a new node 11. The resulting node represents the shaded pattern.
1
1 3
2
7
7
1
5
2
4 6
1
4
3
2
3
8 5
11
4 6
9
8 5
9
1 3
2
7
7
5
4
11
11
3 2
10 2
11
Fig. 1. The basic idea of the GBI method
It is possible to extract typical patterns of various sizes by repeating the above three steps. Note that the search is greedy. No backtracking is made. This means that in enumerating pairs no pattern which has been chunked into one node is restored to the original pattern. Because of this, all the ”typical patterns” that exist in the input graph are not necessarily extracted. The problem of extracting all the isomorphic subgraphs is known to be NP-complete. Thus, GBI aims at extracting only meaningful typical patterns of certain sizes. Its objective is not finding all the typical patterns nor finding all the frequent patterns. As described earlier, GBI can use any criterion that is based on the frequency of paired nodes. However, for finding a pattern that is of interest any of its subpatterns must be of interest because of the nature of repeated chunking.
Performance Evaluation of Decision Tree Graph-Based Induction
131
In Fig. 1 the pattern 1→3 must be typical for the pattern 2→10 to be typical. Said differently, unless pattern 1→3 is chunked, there is no way of finding the pattern 2→10. Frequency measure satisfies this monotonicity. However, if the criterion chosen does not satisfy this monotonicity, repeated chunking may not find good patterns even though the best pair based on the criterion is selected at each iteration. To resolve this issue GBI was improved to use two criteria, one for frequency measure for chunking and the other for finding discriminative patterns after chunking. The latter criterion does not necessarily hold monotonicity property. Any function that is discriminative can be used, such as Information Gain [6], Gain Ratio [7] and Gini Index [2], and some others. GBI(G) Enumerate all the pairs Pall in G Select a subset P of pairs from Pall (all the pairs in G) based on typicality criterion Select a pair from Pall based on chunking criterion Chunk the selected pair into one node c Gc := contracted graph of G while termination condition not reached P := P ∪ GBI(Gc ) return P Fig. 2. Algorithm of GBI
The improved stepwise pair expansion algorithm is summarized in Fig. 2. It repeats the following four steps until chunking threshold is reached (normally minimum support value is used as the stopping criterion). Step 1. Extract all the pairs consisting of connected two nodes in the graph. Step 2a. Select all the typical pairs based on the typicality criterion from among the pairs extracted in Step 1, rank them according to the criterion and register them as typical patterns. If either or both nodes of the selected pairs have already been rewritten (chunked), they are restored to the original patterns before registration. Step 2b. Select the most frequent pair from among the pairs extracted in Step 1 and register it as the pattern to chunk. If either or both nodes of the selected pair have already been rewritten (chunked), they are restored to the original patterns before registration. Stop when there is no more pattern to chunk. Step 3. Replace the selected pair in Step 2b with one node and assign a new label to it. Rewrite the graph by replacing all the occurrence of the selected pair with a node with the newly assigned label. Go back to Step 1. The output of the improved GBI is a set of ranked typical patterns extracted at Step 2a. These patterns are typical in the sense that they are more discriminative than non-selected patterns in terms of the criterion used.
132
W. Geamsakul et al. DT-GBI(D) Create a node DT for D if termination condition reached return DT else P := GBI(D) (with the number of chunking specified) Select a pair p from P Divide D into Dy (with p) and Dn (without p) Chunk the pair p into one node c Dyc := contracted data of Dy for Di := Dyc , Dn DTi := DT-GBI(Di ) Augment DT by attaching DTi as its child along yes(no) branch return DT Fig. 3. Algorithm of DT-GBI
2.2
Feature Construction by GBI
Since the representation of decision tree is easy to understand, it is often used as the representation of classifier for data which are expressed as attribute-value pairs. On the other hand, graph-structure data are usually expressed as nodes and links, and there is no obvious components which corresponds to attributes and their values. Thus, it is difficult to construct a decision tree for graphstructured data in a straight forward manner. To cope with this issue we regard the existence of a subgraph in a graph as an attribute so that graph-structured data can be represented with attribute-value pairs according to the existence of particular subgraphs. However, it is difficult to identify and extract those subgraphs selectively which are effective for classification task beforehand. If pairs are extended in a step-wise fashion by GBI and discriminative ones are selected and further extended while constructing a decision tree, discriminative patterns (subgraphs) can be constructed simultaneously during the construction of a decision tree. In our approach attributes and their values are defined as follows: – attribute: a pair in graph-structured data. – value for an attribute: existence/non-existence of the pair in a graph. When constructing a decision tree, all the pairs in data are enumerated and one pair is selected. The data (graphs) are divided into two groups, namely, the one with the pair and the other without the pair. The selected pair is then chunked in the former graphs. and these graphs are rewritten by replacing all the occurrence of the selected pair with a new node. This process is recursively applied at each node of a decision tree and a decision tree is constructed while attributes (pairs) for classification task is created on the fly. The algorithm of DT-GBI is summarized in Fig. 3. Since the value for an attribute is yes (contains pair) and no (does not contain pair), the constructed decision tree is represented as a binary tree.
Performance Evaluation of Decision Tree Graph-Based Induction
133
The proposed method has the characteristic of constructing the attributes (pairs) for classification task on-line while constructing a decision tree. Each time when an attribute (pair) is selected to divide the data, the pair is chunked into a larger node in size. Thus, although initial pairs consist of two nodes and the link between them, attributes useful for classification task are gradually grown up into larger pair (subgraphs) by applying chunking recursively. In this sense the proposed DT-GBI method can be conceived as a method for feature construction, since features, namely attributes (pairs) useful for classification task, are constructed during the application of DT-GBI.
3 3.1
Enhancement of DT-GBI Beam Search for Expanding Search Space
Since the search in GBI is greedy and no backtracking is made, which patterns are cs extracted by GBI depends on which pair is selected for chunking in Fig. 3. Thus, there can be many patterns which are c11 c12 c13 c14 c15 not extracted by GBI. To relax this problem, a beam search is incorporated to GBI c21 c22 c23 c24 c25 within the framework of greedy search [4] to extract more discriminative patterns. A certain fixed number of pairs ranked c31 c32 c33 c34 c35 from the top are allowed to be chunked individually in parallel. To prevent each branch from growing exponentially, the total number of pairs to chunk is fixed at Fig. 4. An Example of state transition each level of branch. Thus, at any itera- with beam search when the beam width tion step, there is always a fixed number = 5 of chunking that is performed in parallel. An example of state transition with beam search is shown in Fig.4 in the case where the beam width is 5. The initial condition is the single state cs. All pairs in cs are enumerated and ranked according to both the frequency measure and the typicality measure. The top 5 pairs according to the frequency measure are selected, and each of them is used as a pattern to chunk, branching into 5 children c11 , c12 , . . . , c15 , each rewritten by the chunked pair. All pairs within these 5 states are enumerated and ranked according to the two measures, and again the top 5 ranked pairs according to the frequency measure are selected. The state c11 is split into two states c21 and c22 because two pairs are selected, but the state c12 is deleted because no pair is selected. This is repeated until the stopping condition is satisfied. Increase in the search space improves the pattern extraction capability of GBI and thus that of DT-GBI.
134
3.2
W. Geamsakul et al.
Pruning Decision Tree
Recursive partitioning of data until each subset in the partition contains data of a single class often results in overfitting to the training data and thus degrades the predictive accuracy of decision trees. To avoid overfitting, in our previous approach [9] a very naive prepruning method was used by setting the termination condition in DT-GBI in Fig. 3 to whether the number of graphs in D is equal to or less than 10. On the other hand, a more sophisticated postpruning method, is used in C4.5 [7] (which is called “pessimistic pruning”) by growing an overfitted tree first and then pruning it to improve predictive accuracy based on the confidence interval for binomial distribution. To improve predictive accuracy, pessimistic pruning in C4.5 is incorporated into the DT-GBI by adding a step for postpruning in Fig. 3.
4
Performance Evaluation of DT-GBI
The proposed method is tested using a DNA dataset in UCI Machine Learning Repository[1]. A promoter is a genetic region which initiates the first step in the expression of an adjacent gene (transcription). The promoter dataset consists of strings that represent nucleotides (one of A, G, T, or C). The input features are 57 sequential DNA nucleotides and the total number of instances is 106 including 53 positive instances (sample promoter sequence) and 53 negative instances (nonpromoter sequence). This dataset was explained and analyzed in [8]. The data is so prepared that each sequence of nucleotides is aligned at a reference point, which makes it possible to assign the n-th attribute to the n-th nucleotide in the attribute-attribute value representation. In a sense, this dataset is encoded using domain knowledge. This is confirmed by the following experiment. Running C4.5[7] gives a prediction error of 16.0% by leaving one out cross validation. Randomly shifting the sequence by 3 elements gives 21.7% and by 5 elements 44.3%. If the data is not properly aligned, standard classifiers such as C4.5 that use attribute-attribute value representation does not solve this problem, as shown in Fig. 5.
aacgtcgattagccgat gtccatggtcaagtccg tccaggtgcagtcatgc aacgtcgattagccgat g tccatggtcaagtcc g tccaggtgcagtcat gc
Prediction error (C4.5, LVO) Original data 16.0% Shift randomly by 16.0% ≤ 1 element 21.7% ≤ 2 elements 26.4% ≤ 3 elements 44.3% ≤ 5 elements
Fig. 5. Change of error rate by shifting the sequence in the promoter dataset
Performance Evaluation of Decision Tree Graph-Based Induction
135
One of the advantages of graph rep.. 10 5 resentation is that it does not require the data to be aligned at a reference 4 point. In our approach, each sequence 3 .. 8 3 is converted to a graph representa2 2 .6 tion assuming that an element inter1 1 1 acts up to 10 elements on both sides a t g c a t ・・・・・ ・ (See Fig. 6). Each sequence thus re1 1 5 sults in a graph with 57 nodes and 2 2 515 lines. Note that a sequence is rep.. 3 resented as a directed graph since it 7 4 .. is known from the domain knowledge 9 that influence between nucleotides is directed. Fig. 6. Conversion of DNA Sequence Data In the experiment, frequency was to a graph used to select a pair to chunk in GBI and information gain [6] was used in DT-GBI to select a pair from the pairs returned by GBI as the typicality measure. A decision tree was constructed in either of the following two ways: 1) apply chunking nr times only at the root node and only once at other nodes of a decision tree, 2) apply chunking ne times at every node of a decision tree. Note that nr and ne are defined along the depth in Fig. 4. Thus, there is more chunking taking place during the search when the beam width is larger. The pair (subgraph) that is selected for each node of the decision tree is the one which maximizes the information gain among all the pairs that are enumerated. Pruning of decision tree was conducted either by prepruning: set the termination condition in DT-GBI in Fig. 3 to whether the number of graphs in D is equal to or less than 10, or, by postpruning: conduct pessimistic pruning in Subsection 3.2 by setting the confidence level to 25%. Beam width was changed from 1 to 15. The prediction error rate of a decision tree constructed by DT-GBI was evaluated by the average of 10 runs of 10 fold cross-validation in both experiments. The first experiment focused on the effect of the number of chunking at each node of a decision tree and thus beam width was set to 1 and the prepruning was used. The parameter nr and ne were changed from 1 to 10 in 1) and 2) in accordance with 1) and 2) explained above, respectively. Fig. 7 shows the result of experiments. In this figure the dotted line indicates the error rate for 1) the solid line for 2). The best error rate was 8.11% when nr = 5 for 1) and 7.45% when ne = 3 for 2). The corresponding induced decision trees for all 106 instances are shown in Fig. 8 (nr = 5) and Fig. 9 (ne = 3). The decrease of error rate levels off when the the number of chunking increases for both 1) and 2). The result shows that repeated application of chunking at every node results in constructing a decision tree with better predictive accuracy. The second experiment focused on the effect of beam width, changing its value from 1 to 15 using pessimistic pruning. The number of chunking was fixed at the best number which was determined by the first experiment in Fig. 7,
136
W. Geamsakul et al. 14
root node only every node
Error rate (%)
12 10 8 6 4 2 0 0
2
4
6
Number of chunking at a node
8
10
Fig. 7. Result of experiment (beam width=1, without pessimistic pruning)
→→→
a 1 a1 a 1 a n=53, p=53
Y
N
→→→
1 1 1 g a g a n=53, p=31
Promoter n=0, p=22
Y
N
→→
10 1 g t t n=34, p=31
Non-promoter n=19, p=0
Y
N
→1 c→1 t→1 t→2 a→1 a
→→→
1 1 1 t c a a n=32, p=17
g
n=2, p=14
Y Non-promoter n=2, p=0
N Promoter n=0, p=14
Y Promoter n=0, p=2
Y
→ 2
c c n=23, p=3
N Non-promoter n=23, p=1
N
→→→
1 3 1 t t a a n=9, p=14
Y Promoter n=0, p=8
N Non-promoter n=9, p=6
Fig. 8. Example of constructed decision tree (chunking applied 5 times only at the root node, beam width = 1, with prepruning)
namely, nr = 5 for 1) and ne = 3 for 2). The result is summarized in Fig. 12. The best error rate was 4.43% when beam width = 12 for 1) (nr = 5) and 3.68% when beam width = 11 for 1) (ne = 3). The corresponding induced decision trees for all 106 instances are shown in Fig. 10 and Fig. 11. Fig. 13 shows yet another result when prepruning was used (beam with=8, ne = 3). The result reported in [8] is 3.8% (they also used 10-fold cross validation) which is obtained by the M-of-N expression rules extracted from KBANN (Knowledge Based Artificial Neural Network). The obtained M-of-N rules are
Performance Evaluation of Decision Tree Graph-Based Induction
137
→→→
1 1 1 a a a a n=53, p=53
Y
N
→ →→
1 3 1 t c t a n=53, p=31
Promoter n=0, p=22
Y
N
→ →→
1 1 1 a a t t n=53, p=21
Promoter n=0, p=10
N
Y
→→
1 1 a a c n=53, p=14
Promoter n=0, p=7
Y
N
→1 t8→a1→a1→c
→→
6 1 c g a n=19, p=10
t
n=34, p=4
Y
Y
N
Promoter n=0, p=3
Non-promoter n=34, p=1
N
→ → →→
3 1 5 1 a g c t t n=9, p=10
Non-promoter n=10, p=0
Y
N
Non-promoter n=5, p=0
Promoter n=4, p=10
Fig. 9. Example of constructed decision tree (chunking applied 3 times at every node, beam width = 1, with prepruning)
→→→
1 8 1 g a g a n=53, p=53
Y
N
→→→
1 1 1 a a a a n=32, p=53
Non-promoter n=21, p=0
N
Y
→→
9 1 g t t n=32, p=31
Promoter n=0, p=22
N
Y
→→
1 7 t g c n=32, p=13
Promoter n=0, p=18
N
Y Non-promoter n=18, p=3
→→→
1 2 1 t a a a n=14, p=10
Y Promoter n=0, p=6
N Non-promoter n=14, p=4
Fig. 10. Example of constructed decision tree (chunking applied 5 times only at the root node, beam width = 12, with pessimistic pruning)
too much complicated and not easy to interprete. Since KBANN uses domain knowledge to configure the initial artificial neural network, it is worth mentioning
138
W. Geamsakul et al.
→ →→
1 1 1 a a t t n=53, p=53
Y
N
→→→
1 8 1 g a g a n=53, p=35
Promoter n=0, p=22
Y
N
→→→→
1 3 1 1 g a a a c n=32, p=35
Non-promoter n=21, p=0
N
Y
→→→→
1 7 1 a g c g c n=20, p=35 1
Non-promoter n=12, p=0
Y
N
→→
→ →→
1 2 t t t n=9, p=2
Y Promoter n=0, p=2
1 3 1 g c t g n=11, p=33
Y
N
N
→→
9 1 a a a n=7, p=3
Non-promoter n=9, p=0
Promoter n=4, p=30
Y
N
Promoter n=0, p=3
Non-promoter n=7, p=0
Fig. 11. Example of constructed decision tree (chunking applied 3 times at every node, beam width = 8, with pessimistic pruning) 14
root node only (up to 5)
12
every node (up to 3)
Error rate (%)
10 8 6 4 2 0 0
5
GBI beam width
10
15
Fig. 12. Result of experiment (with pessimistic pruning)
that DT-GBI that does not use any domain knowledge induced a decision tree with comparable predictive accuracy. Comparing the decision trees in Figs. 10 and 11, the trees are not stable. Both gives a similar predictive accuracy but the patterns in the decision nodes are not the same. According to [8], there are many pieces of domain knowledge and the rule conditions are expressed by the various combinations of these pieces. Among these many pieces of knowledge, the pattern (a → a → a → a) in the second node in Fig. 10 and the one (a → a → t → t) in the root node in Fig. 11 match their domain knowledge, but the
Performance Evaluation of Decision Tree Graph-Based Induction
139
14
root node only (up to 5)
12
every node (up to 3)
10
) % ( e t a r r o r r E
8 6 4 2 0 0
5
GBI beam width
10
15
Fig. 13. Result of experiment (with prepruning)
others do not match. We have assumed that two nucleotides that are apart more than 10 nodes are not directly correlated. Thus, the extracted patterns have no direct links longer than 9. It is interesting to note that the first node in Fig. 10 relates two pairs (g → a) that are 7 nodes apart as a discriminatory pattern. Indeed, all the sequence having this pattern are concluded to be non-promoter from the data. It is not clear at this stage whether the DT-GBI can extract the domain knowledge or not. The data size is too small to make any strong claims. [4,5] report another approach to construct a decision tree for the promoter dataset. The extracted patterns (subgraphs) by B-GBI, which incorporates beam search into GBI to enhance search capability, were treated as attributes and C4.5 was used to construct a decision tree. The best reported error rate with 10 fold cross validation is 6.3% in [5] using the patterns extracted by B-GBI (beam width = 2) and C4.5 (although 2.8% with Leave One Out (LVO) is also reported, LVO tends to reduce the error rate compared with 10 fold cross validation from the reported result in [5]). On the other hand, the best prediction error rate of DTGBI is 3.68% (which is much better than 6.3% above) when chunking was applied 3 times at each node, beam width = 8, and with pessimistic pruning. The result is also comparable to 3.8 % obtained by KBANN using M-of-N expression [8].
5
Conclusion
This paper reports the improvement made on DT-GBI, which constructs a classifier (decision tree) for graph-structured data by GBI. To classify graphstructured data attributes, namely substructures useful for classification task, are constructed by applying chunking in GBI on the fly while constructing a decision tree. The predictive accuracy of DT-GBI is improved by incorporating 1) a beam search, 2) pessimistic pruning. The performance evaluation of the improved DT-GBI is reported through experiments on a classification problem of DNA promoter sequences from the UCI repository and the results show that
140
W. Geamsakul et al.
DT-GBI is comparable to other method that uses domain knowledge in modeling the classifier. Immediate future work includes to incorporate more sophisticated method for determining the number of cycles to call GBI at each node to improve prediction accuracy. Utilizing the rate of change of information gain by successive chunking is a possible way to automatically determine the number. Another important direction is to explore how the partial domain knowledge is effectively incorporated to constrain the search space. DT-GBI is currently being applied to a much larger medical dataset. Acknowledgment. This work was partially supported by the grant-in-aid for scientific research 1) on priority area “Active Mining” (No. 13131101, No. 13131206) and 2) No. 14780280 funded by the Japanese Ministry of Education, Culture, Sport, Science and Technology.
References 1. C. L. Blake, E. Keogh, and C.J. Merz. Uci repository of machine leaning database, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. 2. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth & Brooks/Cole Advanced Books & Software, 1984. 3. T. Matsuda, T. Horiuchi, H. Motoda, and T. Washio. Extension of graph-based induction for general graph structured data. In Knowledge Discovery and Data Mining: Current Issues and New Applications, Springer Verlag, LNAI 1805, pages 420–431, 2000. 4. T. Matsuda, H. Motoda, T. Yoshida, and T. Washio. Knowledge discovery from structured data by beam-wise graph-based induction. In Proc. of the 7th Pacific Rim International Conference on Artificial Intelligence, Springer Verlag, LNAI 2417, pages 255–264, 2002. 5. T. Matsuda, T. Yoshida, H. Motoda, and T. Washio. Mining patterns from structured data by beam-wise graph-based induction. In Proc. of The Fifth International Conference on Discovery Science, pages 422–429, 2002. 6. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986. 7. J. R. Quinlan. C4.5:Programs For Machine Learning. Morgan Kaufmann Publishers, 1993. 8. G. G. Towell and J. W. Shavlik. Extracting refined rules from knowledge-based neural networks. Machine Learning, 13:71–101, 1993. 9. G. Warodom, T. Matsuda, T. Yoshida, H. Motoda, and T. Washio. Classifier construction by graph-based induction for graph-structured data. In Advances in Knowledge Discovery and Data Mining, Springer Verlag, LNAI 2637, pages 52–62, 2003. 10. K. Yoshida and H. Motoda. Clip : Concept learning from inference pattern. Journal of Artificial Intelligence, 75(1):63–92, 1995.
Discovering Ecosystem Models from Time-Series Data Dileep George1 , Kazumi Saito2 , Pat Langley1 , Stephen Bay1 , and Kevin R. Arrigo3 1 Computational Learning Laboratory, CSLI Stanford University, Stanford, California 94305 USA {dil,langley,sbay}@apres.stanford.edu 2 NTT Communication Science Laboratories 2-4 Hikaridai, Seika, Soraku, Kyoto 619-0237 Japan [email protected] 3 Department of Geophysics, Mitchell Building Stanford University, Stanford, CA 94305 USA [email protected]
Abstract. Ecosystem models are used to interpret and predict the interactions of species and their environment. In this paper, we address the task of inducing ecosystem models from background knowledge and timeseries data, and we review IPM, an algorithm that addresses this problem. We demonstrate the system’s ability to construct ecosystem models on two different Earth science data sets. We also compare its behavior with that produced by a more conventional autoregression method. In closing, we discuss related work on model induction and suggest directions for further research on this topic.
1
Introduction and Motivation
Ecosystem models aim to simulate the behavior of biological systems as they respond to environmental factors. Such models typically take the form of algebraic and differential equations that relate continuous variables, often through feedback loops. The qualitative relationships are typically well understood, but there is frequently ambiguity about which functional forms are appropriate and even less certainty about the precise parameters. Moreover, the space of candidate models is too large for human scientists to examine manually in any systematic way. Thus, computational methods that can construct and parameterize ecosystem models should prove useful to Earth scientists in explaining their data. Unfortunately, most existing methods for knowledge discovery and data mining cast their results as decision trees, rules, or some other notation devised by computer scientists. These techniques can often induce models with high predictive accuracy, but they are seldom interpretable by scientists, who are used to different formalisms. Methods for equation discovery produce knowledge in forms that are familiar to Earth scientists, but most generate descriptive models rather than explanatory ones, in that they contain no theoretical terms and make little contact with background knowledge. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 141–152, 2003. c Springer-Verlag Berlin Heidelberg 2003
142
D. George et al.
In this paper, we present an approach to discovering dynamical ecosystem models from time-series data and background knowledge. We begin by describing IPM, an algorithm for inducing process models that, we maintain, should be interpretable by Earth scientists. After this, we demonstrate IPM’s capabilities on two modeling tasks, one involving data on a simple predator-prey ecosystem and another concerning more complex data from the Antarctic ocean. We close with a discussion of related work on model discovery in scientific domains and prospects for future research on the induction of ecosystem models.
2
An Approach to Inductive Process Modeling
As described above, we are interested in computational methods that can discover explanatory models for the observed behavior of ecosystems. In an earlier paper (Langley et al., in press), we posed the task of inducing process models from time-series data and presented an initial algorithm for addressing this problem. We defined a quantitative process model as a set of processes, each specifying one or more algebraic or differential equations that denote causal relations among variables, along with optional activation conditions. At least two of the variables must be observed, but a process model can also include unobserved theoretical terms. The IPM algorithm generates process models of this sort from training data about observable variables and background knowledge about the domain. This knowledge includes generic processes that have a form much like those in models, in that they relate variables with equations and may include conditions. The key differences are that a generic process does not commit to specific variables, although it constrains their types, and it does not commit to particular parameter values, although it limits their allowed ranges. Generic processes are the building blocks from which the system constructs its specific models. More specifically, the user provides IPM with three inputs that guide its discovery efforts: 1. A set of generic processes, including constraints on variable types and parameter values; 2. A set of specific variables that should appear in the model, including their names and types; 3. A set of observations for two or more of the variables as they vary over time. In addition, the system requires three control parameters: the maximum number of processes allowed in a model, the minimum number of processes, and the number of times each generic process can occur. Given this information, the system first generates all instantiations of generic processes with specific variables that are consistent with the type constraints. After this, it finds all ways to combine these instantiated processes to form instantiated models that have acceptable numbers of processes. The resulting models refer to specific variables, but their parameters are still unknown. Next, IPM uses a nonlinear optimization routine to determine these parameter values. Finally, the system selects and
Discovering Ecosystem Models from Time-Series Data
143
returns the candidate that produces the smallest squared error on the training data, modulated by a minimum description length criterion. The procedure for generating all acceptable model structures is straightforward, but the method for parameter optimization deserves some discussion. The aim is to find, for each model structure, parameters that minimize the model’s squared predictive error on the observations. We have tried a number of standard optimization algorithms, including Newton’s method and the LevenbergMarquardt method, but we have found these techniques encounter problems with convergence and local optima. In response, we designed and implemented our own parameter-fitting method, which has given us the best results to date. A nonlinear optimization algorithm attempts to find a set of parameters Θ that minimizes an objective function E(Θ). In our case, we define E as the squared error between the observed and predicted time series: E(Θ) =
T J
(ln(xoj (t)) − ln(xj (t)))2 ,
(1)
t=1 j=1
where xoj and xj represent the observed and predicted values of J observed variables, t denotes time instants, and ln(·) is the natural logarithmic function. Standard least-squares estimation is widely recognized as relatively brittle with respect to outliers in samples that contain gross error. Instead, as shown in Equation (1), we minimize the sum of squared differences between logarithmically transformed variables, which is one approach to robust estimation proposed by Box and Cox (1964). In addition, we maintain positivity constraints on process variables by performing a logarithmic transformation on the differential equations in which they appear. Predicted values for xj are obtained by solving finite-difference approximations of the differential equations specified in the model. The parameter vector Θ incorporates all unknowns, including any initial conditions for unobserved variables needed to solve the differential equations. In order to minimize our error function, E, defined as a sum of squared errors, we can calculate its gradient vector with respect to a parameter vector. For this purpose, we borrowed the basic idea of error backpropagation through time (Rumelhart, Hinton, & Williams, 1986), frequently used for learning in recurrent neural networks. However, the task of process model induction required us to extend this method to support the many different functional forms that can occur. Our current solution relies on hand-crafted derivatives for each generic process, but it utilizes the additive nature of process models to retain the modularity of backpropagation and its compositional character. These in turn let the method carry out gradient search to find parameters for each model structure. Given a model structure and its corresponding backpropagation equations, our parameter-fitting algorithm carries out a second-order gradient search (Saito & Nakano, 1997). By adopting a quasi-Newton framework (e.g., Luenberger, 1984), this calculates descent direction as a partial Broyden-Fletcher-GoldfarbShanno update and then calculates the step length as the minimal point of a second-order approximation. In earlier experiments on a variety of data sets, this algorithm worked quite efficiently as compared to standard gradient search
144
D. George et al.
methods. Of course, this approach does not eliminate all problems with local optima; thus, for each model structure, IPM runs the parameter-fitting algorithm ten times with random initial parameter values, then selects the best result. Using these techniques, IPM overcomes many of the problems with local minima and slow convergence that we encountered in our early efforts, giving reasonable performance according to the squared error criterion. However, we anticipate that solving more complex problems will require the utilization of even more sophisticated algorithms for non-linear minimization. However, reliance on squared error as the sole optimization criterion tends to select overly complex process models that overfit the training data. Instead, IPM computes the description length of each parameterized model as the sum of its complexity and the information content of the data left unexplained by the model. We define complexity as the number of free parameters and variables in a model and the unexplained content as the number of bits needed to encode the squared error of the model. Rather than selecting the model with the lowest error, IPM prefers the candidate with the shortest description length, thus balancing model complexity against fit to the training data.
3
Modeling Predator-Prey Interaction
Now we are ready to consider IPM’s operation on an ecosystem modeling task. Within Earth science, models of predator-prey systems are among the simplest in terms of the number of variables and parameters involved, making them good starting points for our evaluation. We focus here on the protozoan system composed of the predator P. aurelia and the prey D. nasutum, which is well known in population ecology. Jost and Adiriti (2000) present time-series data for this system, recovered from an earlier report by Veilleux (1976), that are now available on the World Wide Web. The data set includes measurements for the two species’ populations at 12-hour intervals over 35 days, as shown in Figure 1. The data are fairly smooth over the entire period, with observations at regular intervals and several clear cycles. We decided to use these observations as an initial test of IPM’s ability to induce an ecosystem model. 3.1
Background Knowledge about Predator-Prey Interaction
A scientist who wants IPM to construct explanatory models of his observations must first provide a set of generic processes that encode his knowledge of the domain. Table 1 presents a set of processes that we extracted from our reading of the Jost and Adiriti article. As illustrated, each generic process specifies a set of generic variables with type constraints (in braces), a set of parameters with ranges for their values (in brackets), and a set of algebraic or differential equations that encode causal relations among the variables (where d[X, t, 1] refers to the first derivative of X with respect to time). Each process can also include one or more conditions, although none appear in this example.
Discovering Ecosystem Models from Time-Series Data
145
Table 1. A set of generic processes for predator-prey models. generic process logistic growth; generic process exponential growth; variables S{species}; variables S{species}; parameters ψ [0, 10], κ [0, 10]; parameters β [0, 10]; equations d[S, t, 1] = ψ ∗ S ∗ (1 − κ ∗ S); equations d[S, t, 1] = β ∗ S; generic process predation volterra; generic process exponential decay; variables S1{species}, S2{species}; variables S{species}; parameters π [0, 10], ν [0, 10]; parameters α [0, 1]; equations d[S1, t, 1] = −1 ∗ π ∗ S1 ∗ S2; equations d[S, t, 1] = −1 ∗ α ∗ S; d[S2, t, 1] = ν ∗ π ∗ S1 ∗ S2; generic process predation holling; variables S1{species}, S2{species}; parameters ρ [0, 1], γ [0, 1], η [0, 1]; equations d[S1, t, 1] = −1 ∗ γ ∗ S1 ∗ S2/(1 + ρ ∗ γ ∗ S1); d[S2, t, 1] = η ∗ γ ∗ S1 ∗ S2/(1 + ρ ∗ γ ∗ S1);
The table shows five such generic processes. Two structures, predation holling and predation volterra, describe alternative forms of feeding; both cause the predator population to increase and the prey population to decrease, but they differ in their precise functional forms. Two additional processes – logistic growth and exponential growth – characterize the manner in which a species’ population increases in an environment with unlimited resources, again differing mainly in the forms of their equations. Finally, the exponential decay process refers to the decrease in a species’ population due to natural death. All five processes are generic in the sense that they do not commit to specific variables. For example, the generic variable S in exponential decay does not state which particular species dies when it is active. IPM must assign variables to these processes before it can utilize them to construct candidate models. Although the generic processes in Table 1 do not completely encode knowledge about predator-prey dynamics, they are adequate for the purpose of evaluating the IPM algorithm on the Veilleux data. If needed, a domain scientist could add more generic processes or remove ones that he considers irrelevant. The user is responsible for specifying an appropriate set of generic processes for a given modeling task. If the processes recruited for a particular task do not represent all the mechanisms that are active in that environment, the induced models may fit the data poorly. Similarly, the inclusion of unnecessary processes can increase computation time and heighten the chances of overfitting the data. Before the user can invoke IPM, he must also provide the system with the variables that the system should consider including in the model, along with their types. This information includes both observable variables, in this case predator and prey, both with type species, and unobservable variables, which do not arise in this modeling task. In addition, he must state the minimum acceptable number of processes (in this case one), the maximum number of processes (four), and the number of times each generic process can occur (two).
146
D. George et al. Table 2. Process model induced for predator-prey interaction.
model Predator Prey; variables P redator, P rey; observables P redator, P rey; process exponential decay; equations d[P redator, t, 1] = −1 ∗ 1.1843 ∗ P redator; process logistic growth; equations d[P rey, t, 1] = 2.3049 ∗ P rey ∗ (1 − 0.0038 ∗ P rey); process predation volterra; equations d[P rey, t, 1] = −1 ∗ 0.0298 ∗ P rey ∗ P redator; d[P redator, t, 1] = 0.4256 ∗ 0.0298 ∗ P rey ∗ P redator;
3.2
Inducing Models for Predator-Prey Interaction
Given this information, IPM uses the generic processes in Table 1 to generate all possible model structures that relate the two species P. aurelia and D. nasutum, both of which are observed. In this case, the system produced 228 candidate structures, for each of which it invoked the parameter-fitting routine described earlier. Table 2 shows the parameterized model that the system selected from this set, which makes general biological sense. It states that, left in isolation, the prey (D. nasutum) population grows logistically, while the predator (P. aurelia) population decreases exponentially. Predation leads to more predators and to fewer prey, controlled by multiplicative equations that add 0.4256 predators for each prey that is consumed. Qualitatively, the model predicts that, when the predator population is high, the prey population is depleted at a faster rate. However, a reduction in the prey population lowers the rate of increase in the predator population, which should produce an oscillation in both populations. Indeed, Figure 1 shows that the model’s predicted trajectories produce such an oscillation, with nearly the same period as that found in the data reported by Jost and Adiriti. The model produces a squared error of 18.62 on the training data and a minimum description length score of 286.68. The r2 between the predicted and observed values is 0.42 for the prey and 0.41 for the predator, which indicates that the model explains a substantial amount of the observed variation. 3.3
Experimental Comparison with Autoregression
Alternative approaches to induction from time-series data, such as multivariate autoregression, do not yield the explanatory insight of process models. However, they are widely used in practice, so naturally we were interested in how the two methods compare in their predictive abilities. To this end, we ran the Matlab package ARFit (Schneider & Neumaier, 2001) on the Veilleux data to infer the structure and parameters of an autoregressive model. This uses a stepwise leastsquares procedure to estimate parameters and a Bayesian criterion to select the
Discovering Ecosystem Models from Time-Series Data
147
Fig. 1. Predicted and observed log concentrations of protozoan prey (left) and predator (right) over a period of 36 hours.
best model. For the runs reported here, we let ARFit choose the best model order from zero to five. To test the two methods’ abilities to forecast future observations, we divided the time series into successive training and test sets while varying their relative sizes. In particular, we created 35 training sets of size n = 35 . . . 69 by selecting the first n examples of the time series, each with a corresponding test set that contained all successive observations. In addition to using these training sets to induce the IPM and autoregressive models, we also used their final values to initialize simulation with these models. Later predictions were based on predicted values from earlier in the trajectory. For example, to make predictions for t = 40, both the process model and an autoregressive model of order one would utilize their predictions for t = 39, whereas an autoregressive model of order two would draw on predictions for t = 38 and t = 39. Figure 2 plots the resulting curves for the models induced by IPM, ARFit, and a constant approximator. In every run, ARFit selected a model of order one. Both IPM and autoregression have lower error than the straw man, except late in the curve, when few training cases are available. The figure also shows that, for 13 to 21 test instances, the predictive abilities of IPM’s models are roughly equal to or better than those for the autoregressive models. Thus, IPM appears able to infer models which are as accurate as those found by an autoregressive method that is widely used, while providing interpretability that is lacking in the more traditional models.
4
Modeling an Aquatic Ecosystem
Although the predator-prey system we used in the previous section was appropriate to demonstrate the capabilities of the IPM algorithm, rarely does one find such simple modeling tasks in Earth science. Many ecosystem models involve interactions not only among the species but also between the species and environmental factors. To further test IPM’s ability, we provided it with knowledge
148
D. George et al.
Fig. 2. Predictive error for induced process models, autoregressive models, and constant models, vs. the number of projected time steps, on the predator-prey data.
and data about the aquatic ecosystem of the Ross Sea in Antarctica (Arrigo et al., in press). The data came from the ROAVERRS program, which involved three cruises in the austral spring and early summers of 1996, 1997, and 1998. The measurements included time-series data for phytoplankton and nitrate concentrations, as shown in Figure 3. 4.1
Background Knowledge about Aquatic Ecosystems
Taking into account knowledge about aquatic ecosystems, we crafted the set of generic processes shown in Table 3. In contrast to the components for predatorprey systems, the exponential decay process now involves not only reduction in a species’ population, but also the generation of residue as a side effect. Formation of this reside is the mechanism by which minerals and nutrients return to the ecosystem. Knowledge about the generation of residue is also reflected in the process predation. The generic process nutrient uptake encodes knowledge that plants derive their nutrients directly from the environment and do not depend on other species for their survival. Two other processes – remineralization and constant inflow – convey information about how nutrients become available in ecosystems Finally, the growth process posits that some species can grow in number independent of predation or nutrient uptake. As in the first domain, our approach to process model induction requires the user to specify the variables to be considered, along with their types. In this case, we knew that the Ross Sea ecosystem included two species, phytoplankton and zooplankton, with the concentration of the first being measured in our data set and the second being unobserved. We also knew that the sea contained nitrate, an observable nutrient, and detritus, an unobserved residue generated when members of a species die.
Discovering Ecosystem Models from Time-Series Data
149
Table 3. Five generic processes for aquatic ecosystems with constraints on their variables and parameters. generic process exponential decay; generic process constant inflow; variables S{species}, D{detritus}; variables N {nutrient}; parameters α [0, 10]; parameters ν [0, 10]; equations d[S, t, 1] = −1 ∗ α ∗ S; equations d[N, t, 1] = ν; d[D, t, 1] = α ∗ S; generic process nutrient uptake; generic process remineralization; variables S{species}, N {nutrient}; variables N {nutrient}, D{detritus}; parameters β [0, 10], µ [0, 10]; parameters ψ [0, 10] ; conditions N > τ ; equations d[N, t, 1] = ψ ∗ D ; equations d[S, t, 1] = µ ∗ S; d[D, t, 1] = −1 ∗ ψ ∗ D; d[N, t, 1] = −1 ∗ β ∗ µ ∗ S; generic process predation; variables S1{species}, S2{species}, D{detritus}; parameters ρ [0, 10], γ [0, 10]; equations d[S1, t, 1] = γ ∗ ρ ∗ S1; d[D, t, 1] = (1 − γ) ∗ ρ ∗ S1; d[S2, t, 1] = −1 ∗ ρ ∗ S1;
4.2
Inducing Models for an Aquatic Ecosystem
Given this background knowledge about the Ross Sea ecosystem and data from the ROAVERRS cruises, we wanted IPM to find a process model that explained the variations in these data. To make the system’s search tractable, we introduced further constraints by restricting each generic process to occur no more than twice and considering models with no fewer than three processes and no more than six. Using the four variables described above – Phyto{species}, Zoo{species}, Nitrate{nutrient}, and Detritus{residue} – IPM combined these with the available generic processes to generate some 200 model structures. Since Phyto and Nitrate were observable variables, the system considered only those models that included equations with these variables on their left-hand sides. The parameter-fitting routine and the description length criterion selected the model in Table 4, which produced a mean squared error of 23.26 and a description length of 131.88. Figure 3 displays the log values this candidate predicts for phytoplankton and nitrate, along with those observed in the field. The r2 value is 0.51 for Phyto but only 0.27 for Nitrate, which indicates that the model explains substantially less of the variance than in our first domain. Note that the model includes only three processes and that it makes no reference to zooplankton. The first process states that the phytoplankton population dies away at an exponential rate and, in doing so, generates detritus. The second process involves the growth of phytoplankton, which increases its population as it absorbs the nutrient nitrate. This growth happens only when the nitrate concentration is above a threshold, and it causes a decrease in the concentration
150
D. George et al. Table 4. Induced model for the aquatic ecosystem of the Ross Sea.
model Aquatic Ecosystem; variables P hyto, N itrate, Detritus, Zoo; observables P hyto, N itrate; process exponential decay 1; equations d[P hyto, t, 1] = −1 ∗ 1.9724 ∗ P hyto; d[Detritus, t, 1] = 1.9724 ∗ P hyto; generic process nutrient uptake; conditions N itrate > 3.1874; equations d[P hyto, t, 1] = 3.6107 ∗ P hyto; d[N itrate, t, 1] = −1 ∗ 0.3251 ∗ 3.6107 ∗ P hyto; generic process remineralization; equations d[N itrate, t, 1] = 0.032 ∗ Detritus; d[Detritus, t, 1] = −1 ∗ 0.032 ∗ Detritus;
of the nutrient. The final process states that the residue is converted to the consumable nitrate at a constant rate. In fact, the model with the lowest squared error included a predation process which stated that zooplankton feeds on phytoplankton, thereby increasing the former population, decreasing the latter, and producing detritus. However, IPM calculated that the improved fit was outweighed by the cost of including an additional process in the model. This decision may well have resulted from a small population of zooplankton, for which no measurements were available but which is consistent with other evidence about the Ross Sea ecosystem. We suspect that, given a more extended time series, IPM would rank this model as best even using its description length, but this is an empirical question that must await further data.
5
Discussion
There is a large literature on the subject of ecosystem modeling. For example, many Earth scientists develop their models in STELLA (Richmond et al., 1987), an environment that lets one specify quantitative models and simulate their behavior over time. However, work in this and similar frameworks has focused almost entirely on the manual construction and tuning of models, which involves much trial and error. Recently, increased computing power has led a few Earth scientists to try automating this activity. For instance, Morris (1997) reports a method for fitting a predator-prey model to time-series data, whereas Jost and Adiriti (2000) use computation to determine which functional forms best model similar data. Our approach has a common goal, but IPM can handle more complex models and uses domain knowledge about generic processes to constrain search through a larger model space. On another front, our approach differs from most earlier work on equation discovery (e.g., Washio et al., 2000) by focusing on differential equation models
Discovering Ecosystem Models from Time-Series Data
151
Fig. 3. Predicted and observed log concentrations of phytoplankton (left) and nitrate (right) in the Ross Sea over 31 days.
of dynamical systems. The most similar research comes from Todorovski and D˜zeroski (1997), Bradley et al. (1999), and Koza et al. (2001), who also report methods that induce differential equation models by searching for model structures and parameters that fit time-series data. Our framework extends theirs by focusing on processes, which play a central role in many sciences and provide a useful framework for encoding domain knowledge that constrains search and produces more interpretable results. Also, because IPM can construct models that include theoretical terms, it supports aspects of abduction (e.g., Josephson, 2000) as well as induction. Still, however promising our approach to ecosystem modeling, considerable work remains before it will be ready for use by practicing scientists. Some handcrafted models contain tens or hundreds of equations, and we must find ways to constrain search further if we want our system to discover such models. The natural source of constraints is additional background knowledge. Earth scientists often know the qualitative processes that should appear in a model (e.g., that one species preys on another), even when they do not know their functional forms. Moreover, they typically organize large models into modules that are relatively independent, which should further reduce search. Future versions of IPM should take advantage of this knowledge, along with more powerful methods for parameter fitting that will increase its chances of finding the best model. In summary, we believe that inductive process modeling provides a valuable alternative to the manual construction of ecosystem models which combines domain knowledge, heuristic search, and data in a powerful way. The resulting models are cast in a formalism recognizable to Earth scientists and they refer to processes that domain experts will find familiar. Our initial results on two ecosystem modeling tasks are encouraging, but we must still extend the framework in a number of directions before it can serve as a practical scientific aid. Acknowledgements. This work was supported by the NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation. We thank Tasha Reddy and Alessandro Tagliabue for preparing the ROAVERRS data and
152
D. George et al.
for discussions about ecosystem processes. We also thank Saˇso Dˇzeroski and Ljupˇco Todorovski for useful discussions about approaches to inductive process modeling.
References Arrigo, K. R., Worthen, D. L. & Robinson, D. H. (in press). A coupled ocean-ecosystem model of the Ross Sea. Part 2: Phytoplankton taxonomic variability and primary production. Journal of Geophysical Research. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B , 26, 211–252. Bradley, E., Easley, M., & Stolle, R. (2001). Reasoning about nonlinear system identification. Artificial Intelligence, 133 , 139–188. Josephson, J. R. (2000). Smart inductive generalizations are abductions. In P. A. Flach & A. C. Kakas (Eds.), Abduction and induction. Kluwer. Jost, C., & Adiriti, R. (2000). Identifying predator-prey processes from time-series. Theoretical Population Biology, 57, 325–337. Koza, J., Mydlowec, W., Lanza, G., Yu, J., & Keane, M. (2001). Reverse engineering and automatic synthesis of metabolic pathways from observed data using genetic programming. Pacific Symposium on Biocomputing, 6 , 434–445. Langley, P., George, D., Bay, S. & Saito, K. (in press). Robust induction of process models from time-series data. Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC: AAAI Press. Luenberger, D.G. (1984). Linear and nonlinear programming. Reading, MA: AddisonWesley. Morris, W. F. (1997). Disentangling effects of induced plant defenses and food quantity on herbivores by fitting nonlinear models. American Naturalist, 150 , 299–327. Richmond, B., Peterson, S., & Vescuso, P. (1987). An academic user’s guide to STELLA. Lyme, NH: High Performance Systems. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing. Cambridge: MIT Press. Saito, K., & Nakano, R. (1997). Law discovery using neural networks. Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (pp. 1078–1083). Yokohama: Morgan Kaufmann. Schneider, T., & Neumaier, A. (2001). Algorithm 808: ARFIT – A Matlab package for the estimation of parameters and eigenmodes of multivariate autoregressive models. ACM Transactions on Mathematical Software, 27 , 58–65. Todorovski, L., & Dˇzeroski, S. (1997). Declarative bias in equation discovery. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 376–384). San Francisco: Morgan Kaufmann. Veilleux, B. G. (1979). An analysis of the predatory interaction between Paramecium and Didinium. Journal of Animal Ecology, 48, 787–803. Washio, T., Motoda, H., & Niwa, Y. (2000). Enhancing the plausibility of law equation discovery. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 1127–1134). Stanford, CA: Morgan Kaufmann.
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets and Genetic Algorithm Xiaoshu Hang and Honghua Dai School of Information Technology, Deakin University, Australia {xhan,hdai}@deakin.edu.au
Abstract. This paper proposes an optimal strategy for extracting probabilistic rules from databases. Two inductive learning-based statistic measures and their rough set-based definitions: accuracy and coverage are introduced. The simplicity of a rule emphasized in this paper has previously been ignored in the discovery of probabilistic rules. To avoid the high computational complexity of roughset approach, some rough-set terminologies rather than the approach itself are applied to represent the probabilistic rules. The genetic algorithm is exploited to find the optimal probabilistic rules that have the highest accuracy and coverage, and shortest length. Some heuristic genetic operators are also utilized in order to make the global searching and evolution of rules more efficiently. Experimental results have revealed that it run more efficiently and generate probabilistic classification rules of the same integrity when compared with traditional classification methods.
1 Introduction One of the main objectives of database analysis in recent years is to discover some interesting patterns hidden in databases. Over the years, much work has been done and many algorithms have been proposed. Those algorithms can be mainly classified into two categories: machine learning-based and data mining-based. The knowledge discovered is generally presented as a group of rules which are expected to have high accuracy, coverage and readability. High accuracy means a rule has high reliability, and high coverage implies a rule has strong prediction ability. It has been noted that some data mining algorithms produce large amounts of rules that are potentially useless, and much post-processing work had to be done in order to pick out the interesting patterns. Thus it is very important to have a mining approach that is capable of directly generating useful knowledge without post-process. Note that genetic algorithms (GA) have been used in some applications to acquire knowledge from databases, as typified by the GA-based classifier system for concept learning and the GA-based fuzzy rules acquisition from data set[3]. In the later application which contains information of uncertainty and vagueness to some extent, GA is used to evolve the initial fuzzy rules generated by fuzzy clustering approaches. In the last decade, GAs have been used in many diverse areas, such as function optimization, G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 153–165, 2003. © Springer-Verlag Berlin Heidelberg 2003
154
X. Hang and H. Dai
image processing, pattern recognition, knowledge discovery, etc. In this paper, GA and rough sets are combined in order to mine probabilistic rules from a database. Rough set theory was introduced by Pawlak in 1987. It has been recognized as a powerful mathematical tool for data analysis and knowledge discovery from imprecise and ambiguous data, and has been successfully applied in a wide range of application domains such as machine learning, expert system, pattern recognition, etc. It classifies all the attributes of an information table into three categories which are: core attributes, reduct attributes and dispensable attributes, according to their contribution to the decision attribute. The drawback of rough sets theory is its computational inefficiency, which restricts it from being effectively applied to knowledge discovery in database[5]. We borrow its idea and terminologies in this paper to represent rules and design the fitness function of GA, so that we can efficiently mine some probabilistic rules. The rules to be mined are called probabilistic rules since they are characterized with the two key parameters: accuracy and coverage. The optimal strategy used in this paper focuses on acquiring rules with high accuracy, coverage and short length. The rest of this paper is organized as follows: section 2 introduces some concepts on rough sets and the definition of probabilistic rules. Section 3 show the genetic algorithm-based strategy for mining probabilistic rules and section 4 gives the experimental results. Section 5 is the conclusion.
2 Probabilistic Rules in Rough Sets 2.1 Rough Sets Let I= be an information table, where U is the definite set of instances denoting the universe and A is the attribution set. R is an equivalent relation on U and AR=(U,R) is called an approximation space. For any two instances x and y in U, they are said to be equivalent if they are indiscernible with respect to relation R. In general, [x]R is used to represent the equivalent class of the instance x with respect to the relation R in U. [x]R= {y| y∈U and ∀r ⊆R, r(x) = r(y)} Let X be a certain subset of U, the lower approximation of X donated by R_(X), also known as the positive region of X donated by POSR(X), is the greatest collection of equivalent classes in which each instance in R_(X) can be fully classified into X. − the upper approximation of X donated by R (X), also known as the negative region of X donated by NEGR(X), is the smallest collection of equivalent classes that contains some instances that possibly belong to X. In this paper, R is not an ordinary relation but an extended relation represented by a formula, which is normally composed of the conjunction or disjunction of attributevalue pairs. Thus, the equivalent class of the instance x in U with respect to a conjunctive formula like [temperature=high] ∧ [humidity=low] is described as follows:
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets
155
[x][temperature=high]∧[Humidity=low] ={y ∈ U | (temperature(x) =temperature(y)=high)
∧ (humidity(x)=humidity(y)=low)}
The equivalent class of the instance x in U with respect to a disjunctive formula like [temperature=high] ∨ [humidity=low] is presented by : [x][temperature=high]∨[Humidity=low] ={ y ∈ U| (temperature(x) =temperature(y)=high) ∨ ( humidity(x)=humidity(y)=low) } An attribute-value pair, e.g.[temperature=high], corresponds to the atom formula in concept learning based on predicate logic programming. The conjunction of attributevalue pairs e.g. [temperature=high] ∧ [Humidity=low] is equivalent to a complex formula in AQ terminology. The length of a complex formula R is defined as the number of attribute-value pairs it contains, and is donated by |R|. Accuracy and coverage are two important statistic measures in concept learning, and are used to evaluate the reliability and prediction ability of a complex formula acquired by a concept learning program. In this paper, their definitions based on rough sets are given as follows: Definition 1. Let R be a conjunction of attribute-value pairs and D be the set of objects belonging to the target class d. The accuracy of rule R→d is defined as : α R (D ) =
| [ x]R D | | [ x]R |
(1)
Definition 2. Let R be a conjunction of attribute-value pairs and D be the set of objects belonging to the target class d. The coverage of rule R→d is defined as: κ
R
(D ) =
| [x]R D | | D |
(2)
where [x]R is the set of objects that are indiscernible with respect to R. For example, αR(D)=0.8 indicates that 80% of the instances in the equivalent class [x]R probabilistically belong to class d and κR(D)=0.5 refers to that 50% of instances belonging to class d are probabilistically covered by the rule R. αR(D)=1 implies that all the instances in [x]R are fully classified into class d and κR(D)=1 means to that all the instances in target class d are covered by the rule R. In this case the rule R is deterministic, otherwise the rule R is probabilistic. 2.2 Probabilistic Rules We now formally give the definition of a probabilistic rule. Definition 3. Let U be the definite set of training instances, D be the set whose instances belong to the concept d and R be a formula of the conjunction of attributevalue pairs, a rule R → d is said to be probabilistic if it satisfies: αR(D)>δα and κR(D)> δκ.
156
X. Hang and H. Dai
Where δα and δκ are the two user-specified positive real values. The rule is then represented as: ,κ R α→ d
For example, a probabilistic rule induced in table1 is presented as: [prodrome=0] ∧ [ nausea=0] ∧ [ M1= 1] → [class = m.c.h.] Since[x]prodrome=0∧ nausea=0∧M1=1 ={1,2,5} and [x]class=m.c.h.={1,2,5,6}, therefore:
αR(D)=1 and κR(D)=0.75 It is well known that in concept learning, consistency and completeness are two fundamental concepts. If a concept description covers all the positive examples, it is called complete. If it covers no negative examples, it is called consistent. If both completeness and consistency are retained by a concept description then the concept description is correct on all examples. In rough sets theory, consistency and completeness are also defined in a similar way. A composed set is said to be consistent with respect to a concept if and only if it is a subset of the lower approximation of the concept. A compose set is said to be complete with respect to a concept if and only if it is a superset of the upper approximation of the concept[7]. Based on this idea we give the following two definitions that describe the consistency and completeness of a probabilistic rule. Definition 4. A probabilistic rule is said to be consistent if and only if itsαR(D)=1. Definition 5. A probabilistic rule is said to be complete if and only if itsκR(D)=1. Now we consider the problem on the simplicity of a concept description. It is an empirical observation rather than a theoretical proof that there is an inherent tradeoff between concept simplicity and completeness. In general, we do not want to trade accuracy for simplicity or efficiency, but still strive to maintain the prediction ability of the resulting concept description. In most applications of concept learning, due to noise in the training data rules that are overfitted tend to be long, in order for the induced rules to be consistent with all the training data, a relax requirement is usually set up so that short rules can be acquired. The simplicity of a concept description is considered and therefore is one of the goals we pursue in the process of knowledge acquisition. The simpler a concept description is, the lower the cost for storage and utilization. The simplicity of a probabilistic rule R in this paper is defined as follows: η
R
(D ) = 1 −
| R | | A |
(3)
where |R| represents the number of attribute-value pairs the probabilistic rule contains, and |A| donates the number of attributes the information table contains. Note that a probabilistic rule may still not be the optimum one even if it has both α=1and κ=1. The reason for this is that it may still contain dispensable attribute-value pairs when it is of full accuracy and coverage. In this paper, these rules will be further
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets
157
refined through mutation in the process of evolution. Thus, taking into account the constraint of simplicity, we can further represent a probabilistic rule as follows: ,κ ,η R α → d
s.t. R= ∧ j[xj=vk] , αR(d)>δ1 , κ R(d) >δ2 and η R(d) >δ3
Table 1. A small database on headache diagnosis[1]
No. 1 2 3 4 5 6
Age 50-59 40-49 40-49 40-49 40-49 50-59
Loc ocular whole lateral whole whole whole
Nature pers. pers. throb. throb. radia. pers.
Prod 0 0 1 1 0 0
Nau 0 0 1 1 0 1
M1 1 1 0 0 1 1
Class m.c.h m.c.h migra migra m.c.h m.c.h
M1: tenderness of M1, m.c.h: muscle contraction headache,
The following two probabilistic rules derived from table1 are used to illustrate the importance of simplicity of a rule. Both rule1 and rule2 have full accuracy and coverage but rule2 is much shorter than rule1, therefore rule2 is preferred. Rule1: [Age=40-49] [nature=throbbing] ∧ [prodrome=1] ∧ [nausea=1] ∧ [M1=0] → [class=migraine], α=1, κ =1 and η=0.143. Rule2: [nature=throbbing] ∧ [ prodrome=1] ∧ [ M1=0] → [ class=migraine] α=1, κ =1 and η=0.429
3 A GA-Based Acquisition of Probabilistic Rules It is well known that a rough sets based approach is a multi-concept inductive learning approach, and is widely applied to the uncertain or incomplete information systems. However, due to its high computational complexity, it is ineffective to be applied to acquiring knowledge from a database of large size. To sufficiently overcome this problem and to exploit rough sets theory in knowledge discovery with uncertainty, we use rough set technology combined with genetic algorithm to acquire optimal probabilistic rules from databases. 3.1 Coding Genetic algorithms, as defined by Holland, work on fixed-length bit chains. Thus, a coding function is needed to present every solution to the problem with a chain in a one-to-one way. Since we do not know at the phase of coding how many rules an
158
X. Hang and H. Dai
information table contains, we have the individuals be initially coded to contain enough rules which will be refined during the evolution phase. Note that binary coded or classical GAs are less efficient when applied to multidimensional, large sized databases. We consider real-coded GAs in which variables appear directly in the chromosome and modified by genetic operators. First, we define a list of attribute-value pairs for the information table of discourse. Its length L is determined by each attribute’s domain. Let Dom(Ai) donate the domain of the attribute Ai, then we have: L=|Dom(A1)| + |Dom(A2)| + ⋅⋅⋅ + |Dom(Am)| = ∑i=1,m |Dom(Ai)| where |Dom(Ai)| represents the number of the different discrete values that attribute Ai contains. Assume the information table contains m attributes and a rule, in the worst case, contains one attribute-value pair for each attribute, then the rule has the maximum length with m attribute- value pairs. And we also assume an individual contains n rules and each rule is designed to be the fixed length m, therefore the total length of an individual is n×m attribute-value pairs. In this paper, an individual is not coded as a string of binary bits but an array of structure variables. Each structure variable has four fields(see figure1): a structure variable corresponding to the body of a rule, and a flag variable indicating whether the rule is superfluous, and two real variables representing the accuracy and the coverage, respectively. A rule, which is a structured variable, has three integer fields: an index
struc.var1
structure variable
flag
struc.var2
…
struc. var n
rule2 accuracy coverage attri-val 1 attri-val 2
consequence
attr-val 3 . .
set of premises
. attri-val n Fig. 1. A real coding scheme
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets
159
pointing to the list of attribute-value pairs, an upper bounder and a lower bounder of the index, and two sub-structured variables. The first sub-structured variable is the consequence of the rule and the rest are the premise items of the rule. During the process of evolution, the index changes between the upper and lower bounder and possibly become zero, meaning the corresponding premise item is indispensable in the rule. 3.2 Fitness Assume an individual contains n rules and each rule Ri (i=1,…,n) has an accuracy αi and a coverage κi. The fitness function is designed to be the average of all the rules’ fitness. F (α , κ ) =
= =
n
1 n
∑
1 n
∑
1 n
i =1
n
i =1
ω1 + ω 2 1 = ω1 ω2 n + αi κi
n
∑
i =1
(ω 1 + ω 2 )α i κ ω 2 α i + ω 1κ i
i
(ω 1 + ω 2 ) [ x ] R i D i
ω1 [x]R
i
+ ω
2
Di
n
∑
i =1
f (Ri, Di)
where ω1,ω2 are the weights for αi and κi, respectively, and ω1+ω2=1. Therefore: 0
Where R1 and R2 are the conjunctive form of attribute-value pairs and d is the target concept whose equivalent class is D, thus we achieve the following properties of the fitness function: if R1 ⊃ R2 then
f (R2 , D) ≥ f (R1 , D)
(ω1 + ω2 ) [ x]R2 D [ x]R1 D
if R1 ⊃ R2 and αR1(D)= 1 then f( R2,D) ≤ f(R1,D); if R1 ⊃ R2 and κR2(D)=1
then f( R2,D)
≤ f(R1,D);
if R1 ⊃ R2 and αR1(D)= 0 then f( R2,D) =0; if R1 ⊃ R2 and αR2(D)= 0 then f(R1,D)
≥ f( R2,D) =0;
if f(R1,T), f(R2,T) are given and f ( R , D ) ≤ 2 f(R1,D)≥f( R2,D);
ω 2 f ( R1 , D ) then ω 2 + ω 1 (1 − f ( R 1 , D ))
160
X. Hang and H. Dai
The fitness function is intentionally designed without considering the simplicity measure for a rule, to make it as simple as possible. The simplicity is used as an important parameter for mutation of chromosomes, which will be discussed in section 3.4. 3.3 Crossover Crossover and mutation are the main operators in GA and are applied with varying probabilities. Crossover occurs between two individuals which exchange the corresponding-positioned attribute-value pairs of corresponding rules. In doing so, it breaks up both the structures of the individuals and the rules. The result of crossover is that the rules with high accuracy and coverage come together to form an individual which is potentially the optimal resolution to the problem. In this paper, heuristic crossover operators are designed to make the individuals evolve efficiently. Let P1 and P2 be the two corresponding-positioned rules in two parent individuals respectively, and O1,O2 are the two rules in their offspring respectively.
P1 = (av1’ , av2’ ,...,avm’ ) , O1 = (av11 , av12 ,...,av1m ) ,
P2 = (av1" , av2" ,...,avm" ) O2 = (av12 , av22 ,...,avm2 )
where avi refers to an attribute-value pair. There are totally four cases to be considered with the parent rules, and in each case we design two heuristic operators: µ,ν for crossover as follows: (1) O1 = µ P1 + ν P2
and
O 2 = µ P2 + ν P1 , if P1 P2 = φ
( 2 ) O1 = µ P1 + ν ( P2 − ( P1 ∩ P2 )) and O 2 = µ P2 + ν ( P1 − ( P1 ∩ P2 )), if P1 P2 ≠ φ (3) O1 = P1 and
O 2 = µ P2 + ν ( P1 − P2 ) , if P1 ⊃ P2
( 4 ) O1 = P1 and
O 2 = P2 , if P1 = P2
In case 4, µ ,ν both equal to 1 and in other cases, µ ,ν are determined is as follows:
µ= f1 / (f1+f2) and ν = f2 /(f1+f2) µP indicates that µ|P| attribute-value pairs are selected from P and be put into the offspring rules. The two heuristic operators are characterised that the offspring rules are inclined to inherit more attribute-value pairs from the parents that have higher fitness. For example: P1 and P2 in figure 2 are two corresponding-positioned rules in the two parent individuals. The numbers in each rule are the indices to the related attribute-value pairs in the list. Assume that f(P1)=0.654 and f(P2)=0.216, so u=0.738 and v=0.263. Since P1 ∩ P2={20,23,24,29} ≠ φ , O1=0.738*P1 ∪ 0.263*(P1−{20,23,24,29
})={20,22,23,24,25,28,29},O2=0.738*P2 ∪ 0.263*(P2−{20,23,24,29})={20,21,23,24,2 6,27, 29}.
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets
P1
20 22 23 24 25 27 29
P2
20 21 23 24 25 28 29
O1: O2:
161
20 22 23 24 25 28 29 20 21 23 24 26 27 29
Fig. 2. Crossover between parents
3.4 Mutation In the classical genetic algorithm, mutation is brought about by changing the bits or characters of the offspring chromosome if it passes the probability test. Mutation in our case takes place in the following three manners: 1) Mutate an index between its upper bounder and lower bounder; Let Oi={ index1, index2,…, indexm} be the offspring rule where indexi is chosen to mutate. Its new value is determined as follows: index
mut i
= index ( index
l _ bounder i
u _ bounder i
+ round (ζ ∗
− index
l _ bounder i
))
(4)
where ζ is a random float-point number in the interval [0, 1] and function round(·) refers to getting an integer. This mutation plays a key role in maintaining the diversity of the offspring population. 2) Mutate an index into zero; This mutation only happens when the length of the rule Oi is over δ, and at once mutation only one index in the rule has the chance to mutate to zero.
index i , if O i <= δ index imut = if O i > δ 0,
(5 )
where δ is a given threshold integer. This type of mutation acts actually as a kind of rule generalization. In contrast to fuzzy rules models in which all the rules are set up to the same length, the probabilistic rules generated in our model are not in equivalent length. 3) Mutate the flag of a rule into zero Mutation is designed to occur at different levels. Mutations in the first two manners refer to changing the index values so that a better or a shorter rule may appear. Mutation in this manner is designed to delete a rule that has both a low accuracy and a low coverage so that the total number of rule is as less as possible without reducing the performance of the model.
0, if α i < δ α and κ α < δ κ flag imut = flag i , otherwise
(6)
162
X. Hang and H. Dai
4 Experimental Results We apply our method to a real-life database of ticket machines which contains 12 attributes and 3067 records of break down and maintenance(see table2). The attributes C1~C11 are the conditional ones, and D is the decisional attribute which has seven different values denoting the maintenance priorities. C5 and C7 are numeric attributes and are to be discreted into 7 and 4 different values, respectively. The total number of attribute-value pairs and the total searching space are calculated as follows: Total attribute-value pairs =Dim(C1)+Dim(C2)+…+Dim(C11)+Dim(D) = 4+6+2+5+7+5+4+4+5+3+4+7 =56 Total space=Dim(C1)×Dim(C2) ×…×Dim(C11) ×Dim(D) = 4×6×2×5×7×5×4×4×5×3×4×7 =56448000 Table 2. A database on ticket machines’ breakdown failure and maintenance No
C1
C2
C3
C4
1
2
1
1
4
C5
C6
C7
C8
C9
C10
C11
D
13117
1
67
3
3
24
3
7
2
2
1
1
4
162
2
5
4
3
24
3
7
3
2
1
1
4
13748
0
68
3
5
24
1
2
4
2
1
1
4
9105
0
58
3
3
24
0
1
5
2
1
1
4
14604
0
77
4
3
24
0
2
6
2
1
1
4
162
2
5
4
3
24
3
7
7 …
2 …
1 …
1 …
4 …
13748 …
0 …
68 …
3 …
5 …
24 …
1 …
2 …
3067 4 2 2 4 2891 0 45 2 5 48 0 2 C1: Station Type, C2: Machine Model, C3: Day Type, C4: Level of Busy, C5: Daily Usage, C6: Soft Failure Count, C7: Ticket Rejection, C8:Humidity, C9:Temperature,C10:Maintenance Interval, C11: Failure Rate, D: Maintenance Priority
4.1 Number of Rules An ideal model should be able to fit the data accurately by using a minimum number of representative rules. A large number of rules increase the complexity of the model without providing additional information. Moreover, it is even worse to mislead us in understanding the properties of the system, and thus reduces the generalizing capability of the model. The final number of rules generated using our method is mainly depended on two factors. The first is the initial number we set in chromosome, which
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets
163
is generally over-estimated. The second is the two thresholds used, that determine whether a rule is deleted from the chromosome. The figure2 shows the evolution of the rule number under the different combinations of the three parameters: the initial rule number, δα and δκ. The final number of rules is about 30 and it takes more time to reach this number when the initial number of rules is bigger.
Fig. 2. Final number of rule with different δαand δκ
4.2
Evolution of the Population
Figure 3 shows the performance of the best individual in each generation with the parameters: 100 rules in a chromosome, the probability of crossover pc=0.9, and the probability of mutation pm=0.1. The fitness of the best individual reaches about 0.8 after 100 generations, and gradually converges at 0.91. The total coverage of the best individual achieved similar results.
4.3 Comparison with Other Methods We have compared our method with others in order to verify its effectiveness and efficiency. We examined the total rules that each model produces, the maximum length, the minimum length, the average length of the mined probabilistic rules, the total coverage and the time taken for each algorithm to execute. We get a minimum number of rules and a shorter average length by our approach without losing high coverage or efficiency.
164
X. Hang and H. Dai
Fig. 3. Fitness and total coverage of the best individual
Table 3. Comparison with other methods Approaches Total rules Max length
Min length Average Length Total coverage Time(minutes)
C4.5 47 12 3 6.54 96.76 1.50
Rough set 65 12 3 6.98 100 5.31
FP-growth 232 12 2 4.32 100 2.22
GA-Rough set 31 10 3 5.78 92.54 1.88
FP-Growth: the mining process is constrained by minimum support=2.0% and the result is selected with attribute D as the consequence of association rule.
5
Conclusions
In this paper, we introduced an optimal strategy for extracting probabilistic rules from databases. Due to the drawback of high computational complexity of rough set, it is unpractical to acquire probabilistic rules from a large database by directly applying rough set method, because it is a completely NP-hard problem to find the best reduct of an attribute set. The probabilistic rules are defined on two statistic measures: accuracy and coverage based on rough sets theory. The advantage of the combination of genetic algorithm(GA) with rough sets is to find the optimal probabilistic rules that have the highest accuracy and coverage, as well as having the shortest length. Some heuristic genetic operators are also utilized in order to make the global searching and evolution of rules more efficient. Experimental results reveal that our approach significantly reduced the time complexity compared with pure rough set approach. It generates probabilistic classification rules of the same integrity when compared with traditional classification methods, however with the advantage of being able to dealing with noise and missing value as well.
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets
165
References 1. 2.
3. 4. 5.
6. 7. 8. 9. 10.
11. 12. 13. 14. 15.
16. 17.
18.
Tsumoto, S. Knowledge discovery in clinic databases and evaluation of discovered knowledge in outpatient clinic, information science 124(2000)125–137 Wogulis, J., Iba.W., and Langley, P. Trading off simplicity and coverage in incremental concept learning. In Proceedings of the Fifth International Conference in Machine Learning, pages 73–79, San Mateo, 1992. Morgan Kaufmann. 11 Papadakis, S.E., Theocharis, J.B. A GA-based modeling approach for generating TSK models Fuzzy sets and system 131(2002)121–152. Tsumoto, S. Automated extraction of medical expert system rules from clinic databases based on rough set theory information science 112(1998)67–84 Xiaohua Hu. Using Rough sets theory and database operations to construct a good ensemble of classifiers for data mining applications. In the proceeding of IEEE International conference on data mining. 29 Nov. –2 Dec. 2001, San Jose, California ,USA. Chow, K.M. and Rad, A.B. On-line fuzzy identification using genetic algorithms, fuzzy sets and systems 132(2002)147–171 Arul Siromoney and Inoue, K. Consistency and Completeness in rough sets Journal of Intelligent and information system, 15(2000),207–220 Zdzilaw Piasta and Andrej Lenarcik. Rule induction with Probabilistic rough classifications Flochkhart and Radcliffe, N., A genetic algorithm-based approach to data mining. International conference on KDD,1996. M. Li, J. Kou and J. Zhou. Programming Model for concept learning and its solution based on genetic algorithm. Proceeding of the 3rd world congress on intelligent control and automation, June 28–July, 2 2000, Hefei,P.R.China. Magne Stenes and Hans Roubos, GA_Fuzzy modelling and classification: complexity and performance. IEEE transactions on fuzzy systems. Vol.8.No.5. October 2000. Marzena Kryszkiewicz, Rough set approach to incomplete information systems. Information sciences 112(1998)39–49. Romon Lopezde Mantaras and Eva Armentgol, Machine learning from examples: inductive and lazy methods. Data & Knowledge engineering 25(1998)99–123 J. Yang and V. Honavar. Feature subset selection using a genetic algorithm. IEEE Trans. on Intelligent Systems, 13(2):pp.44–49, 1998. Daijin Kim, Sung-Yang Bang, A Handwritten Numeral Character Classification Using Tolerant Rough Set, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(2000) pp. 923–937. Staal Vinterbo, A genetic algorithm for a family of a set cover problems, http://www.idi.ntnu.no/~staal/setc/setc.pdf H.Dai & X. Hang. A Rough set Theory Based Optimal Attribute Reduction using Genetic Algorithm. In proceedings of Computational Intelligence for Modelling Control and Automation(CIMCA),2001, Las Vegas,Vevada, USA, pp.140–148. Xiaoshu Hang and Honghua Dai, Rough computation of extension matrix for learning from examples, In proceedings of Computational Intelligence for Modelling Control and Automation(CIMCA),2001, Las Vegas,Vevada,USA, pp.161–171
Extraction of Coverings as Monotone DNF Formulas Kouichi Hirata1 , Ryosuke Nagazumi2 , and Masateru Harao1 1
2
Department of Artificial Intelligence {hirata,harao}@ai.kyutech.ac.jp Graduate School of Computer Science and Systems Engineering Kyushu Institute of Technology Kawazu 680-4, Iizuka 820-8502, Japan [email protected]
Abstract. In this paper, we extend monotone monomials as large itemsets in association rule mining to monotone DNF formulas. First, we introduce not only the minimum support but also the maximum overlap, which is a new measure how much all pairs of two monomials in a monotone DNF formula commonly cover data. Next, we design the algorithm dnf cover to extract coverings as monotone DNF formulas satisfying both the minimum support and the maximum overlap. In the algorithm dnf cover , first we collect the monomials of which support value is not only more than the minimum support but also less than the minimum support as seeds. Secondly we construct the coverings as monotone DNF formulas, by combining monomials in seeds under the minimum support and the maximum overlap. Finally, we apply the algorithm dnf cover to bacterial culture data.
1
Introduction
The purpose of data mining is to extract hypotheses to explain a database. An association rule is one of the most famous forms of hypotheses in data mining or association rule mining [1,6,7,12]. In order to extract association rules from a transaction database, the algorithm Apriori, introduced by Agrawal et al. [2,3], extracts large itemsets as sets of variables satisfying the minimum support for the transaction database. Then, by combining variables in each large itemset, we can extract association rules satisfying both the minimum support and the minimum confidence for the transaction database. The disadvantage of Apriori, however, is that, if we extract association rules that explain a transaction database nearly overall, then the extracted large itemsets only reflect the data with very high frequency and they are not interesting in general. Furthermore, if we deal with a transaction database with one
This work is partially supported by Japan Society for the Promotion of Science, Grants-in-Aid for Encouragement of Young Scientists (B) 15700137 and for Scientific Research (B) 13558036. Current address: Zenrin Co., Ltd.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 166–179, 2003. c Springer-Verlag Berlin Heidelberg 2003
Extraction of Coverings as Monotone DNF Formulas
167
class class, it is natural to extract an association rule x1 ∧ · · · ∧ xn → class with the consequence class rather than the association rules constructed from the rule generation in [2,3], for example, x2 ∧ · · · ∧ xn → x1 , from a large itemset {x1 , . . . , xn }. In order to extract the association rules with the consequence class that explain the transaction database nearly overall, in this paper, we regard a large itemset as a monotone monomial and extend it to a monotone DNF formula as a disjunction of monotone monomials. It is a problem for the above purpose that there exist extremely many monotone DNF formulas satisfying only the minimum support. Then, in this paper, we assume that each monotone monomial in a monotone DNF formula should cover data in a transaction database without overlapping preferably. Hence, we introduce a new measure overlap as the cardinality of an overlap set. Here, the overlap set of two monomials t and s is the set of ID’s of data that t and s commonly cover, and the overlap set of a monotone DNF formula f = t1 ∨ · · · ∨ tm is the union of overlap sets of each two terms ti and tj in f (1 ≤ i = j ≤ m). We call a monotone DNF formula satisfying the minimum support and the maximum overlap a covering as monotone DNF formulas. Based on the above two measures, we design the algorithm dnf cover to extract coverings as monotone DNF formulas from a transaction database, by extending the algorithm Apriori. In this paper, we adopt the minimum support more than 70%. In the algorithm dnf cover , first we collect monotone monomials of which frequency is not only more than the minimum support, which are coverings as monotone monomials, but also less than the minimum support as seeds. Secondly we construct monotone DNF formulas by combining monomials in seeds under the minimum support and the maximum overlap. Here, we use the monotonicity of overlap sets in order to reduce a search space. Finally, we give the empirical results by applying the algorithm dnf cover to bacterial culture data, which are full version of data in [9,10,11]. We use two kinds of such data, one is MRSA (methicillin-resistant Staphylococcus aureus) data with 118 records and another is Anaerobes data with 1064 records. Both of them consist of data with 93 attributes. Then, we evaluate the number and the length of extracted coverings and investigate the extracted coverings from a medical viewpoint.
2
Coverings as Monotone DNF Formulas
For a set S, |S| denotes the cardinality of S. Let X be a finite set and we call an element of X a variable. A monotone monomial in X is a finite conjunction of variables in X. For a monotone monomial t = x1 ∧ · · · ∧ xn , we denote the number n of variables in t by |t|. Sometimes we identify a monotone monomial x1 ∧ · · · ∧ xn with a set {x1 , . . . , xn } of variables in X. A monotone DNF formula in X is a finite disjunction of monotone monomials in X. A transaction database in X is a set D of pairs of a natural number and a finite set of variables in X, that is, D = {(tid , Ttid ) | Ttid ⊆ X} [2,3]. Here,
168
K. Hirata, R. Nagazumi, and M. Harao
the natural number tid and the set Ttid of variables are called a transaction ID and a transaction, respectively. For an attribute-value database, the form of “attribute = attribute value” is regarded as a variable in X. Sometimes we omit the statement “in X” in monotone monomials, monotone DNF formulas and transaction databases. We introduce two measures, a usual measure frequency defined as the cardinality of a cover set and a new measure overlap defined as the cardinality of an overlap set. Definition 1. Let D be a transaction database. Then, the cover set cvs D (t) of a monomial t for D is defined in the following way. cvs D (t) = {tid | (tid , Ttid ) ∈ D, t ⊆ Ttid }. Furthermore, the cover set cvs D (f ) of a monotone DNF formula f = t1 ∨· · ·∨tm for D is defined as the union of cover sets of all monomials in f . cvs D (f ) =
m
cvs D (ti ).
i=1
The frequency of f in D is defined as |cvs D (f )| and denoted by freq D (f ).
Definition 2. Let D be a transaction database. Then, the overlap set ols D (t) of monomials t and s for D is defined in the following way. ols D (t, s) = cvs D (t) ∩ cvs D (s). Furthermore, the overlap set ols D (f ) of a monotone DNF formula f = t1 ∨· · ·∨tm in D is defined as the union of all overlap sets of two distinct monomials in f . ols D (f ) =
ols D (ti , tj ).
1≤i =j≤m
The overlap of f in D is defined as |ols D (f )| and denoted by ol D (f ). For δ (0 < δ ≤ 1) and η (0 < η ≤ 1), we say that a monotone DNF formula f is a covering as monotone DNF formulas of D under the minimum support δ and the maximum overlap η if f satisfies that freq D (f ) ≥ δ|D| and ol D (f ) ≤ η|D|. In particular, we say that a monotone monomial t is a covering as monotone monomials of D under the minimum support δ if t satisfies that freq D (f ) ≥ δ|D|. Furthermore, a covering as either monotone monomials or monotone DNF formulas of D is called a covering of D simply.
Extraction of Coverings as Monotone DNF Formulas
3
169
Extraction of Coverings
In this section, we design the algorithm dnf cover to extract coverings of transaction databases under the minimum support δ and the maximum overlap η, by extending Apriori [2,3]. First note that the minimum support is very small (less than 2%) in Apriori [2,3]. On the other hand, our purpose is to find coverings reflecting much data in a transaction database than large itemsets of Apriori, so our minimum support is much larger than one of Apriori. In this paper, we set the minimum support to more than 70% in the empirical results in Section 4. In order to extract coverings under the minimum support δ and the maximum overlap η, we design the algorithm dnf cover described as Figure 1. This algorithm consists of two phases, a conjunction phase to construct Lk by Apriori and a disjunction phase to monotone DNF formulas from monotone monomials. The difference between a conjunction phase in dnf cover and Apriori is that we collect monomials not satisfying the minimum support as seeds and set them to a variable SEED in a conjunction phase. A monotone monomial t ∧ x is collected in SEED if freq D (t) ≥ δ|D| and freq D (t ∧ x) < δ|D|. Furthermore, in order to avoid to collect monotone monomials with low frequency in seeds, in this paper, we introduce the minimum monomial support σ. In dnf cover , we collect the monomial t in SEED satisfying that σ|D| ≤ freq D (t) < δ|D|. If σ = 0, then R coincides with R in a conjunction phase of dnf cover . Theorem 1 (The monotonicity of ol D ). For monotone DNF formulas f and g, it holds that ol D (f ) ≤ ol D (f ∨ g). Proof. By the definition, it holds that ol D (f ∨ g) = |ols D (f ∨ g)|. Suppose that f and g be monotone DNF formulas t1 ∨ · · · ∨ tm and tm+1 ∨ · · · ∨ tn (1 ≤ m < n). Then, f ∨ g = t1 ∨ · · · ∨ tm ∨ tm+1 ∨ · · · ∨ tn . By the definition, the following statement holds. ols D (f ∨ g) = ols D (ti , tj ) 1≤i =j≤n
= ols D (f ) ∪
ols D (ti , tj ).
1≤i≤n,m+1≤j≤n
Hence, it holds that ols D (f ) ⊆ ols D (f ∨ g), that is, ol D (f ) ≤ ol D (f ∨ g).
Theorem 2. Let ti (1 ≤ i ≤ m) be monomials, f a monotone DNF formula t1 ∨ · · · ∨ tm , and g a monotone DNF formula f ∨ t. Then, the following equation holds. ols D (g) = ols D (f ) ∪
m
ols D (ti , t).
i=1
Proof. By regarding t as tm+1 , the following statement holds.
170
K. Hirata, R. Nagazumi, and M. Harao
procedure dnf cover (D, δ, η) /* D: a transaction database, X: a set of variables, δ: minimum support, η: maximum overlap, σ: minimum monomial support */ L0 ← ∅; L1 ← X; k ← 0; SEED ← ∅; = Lk+1 do begin /* conjunction phase */ while Lk k ← k + 1; Ck+1 ← ∅; forall t ∈ Lk such that |t| = k do begin forall (tid , Ttid ) ∈ D do if t ⊆ Ttid then forall lexicographically larger variables x ∈ Lk than all variables in t do Ck+1 ← Ck+1 ∪ {t ∧ x}; end /* forall*/ R ← {t ∈ Ck+1 | freq D (t) < δ|D|}; R ← {t ∈ Ck+1 | σ|D| ≤ freq D (t) < δ|D|}; Lk+1 ← Lk ∪ (Ck+1 − R); SEED ← SEED ∪ R ; end /* while */ DNF 1 ← Lk ; l ← 0; S0 ← ∅; S1 ← SEED; while S l+1 = ∅ do begin /* disjunction phase */ l ← l + 1; DNF l+1 ← ∅; Sl+1 ← ∅; forall f ∈ S l do begin forall lexicographically larger elements t ∈ SEED than all monomials in f do if ol D (f ∨ t) ≤ η|D| then Sl+1 ← Sl+1 ∪ {f ∨ t}; if freq D (f ∨ t) ≥ δ|D| then DNF l+1 ← DNF l+1 ∪ {f ∨ t}; end /* forall */ end /* while */ return
l
DNF i ;
i=1
Fig. 1. The algorithm dnf cover to extract coverings from D
ols D (f ∨ t) =
ols D (ti , tj )
1≤i =j≤m+1
= ols D (f ) ∪
m
ols D (ti , tm+1 ).
i=1
Hence, the statement holds.
By Theorem 1, once a monotone DNF formula does not satisfy the maximum overlap, no monotone DNF formula obtained by adding monotone monomials to it satisfies the maximum overlap. Then, in a disjunction phase, we construct monotone DNF formulas in the following way.
Extraction of Coverings as Monotone DNF Formulas
171
While a monotone DNF formula f satisfies the maximum overlap, we connect t ∈ SEED to f by a disjunction ∨. If f ∨ t satisfies the minimum support, then add f ∨ t to coverings. Note that, in the construction of S2 , we can obtain the overlap sets ols D (t∨s) for each t, s ∈ SEED such that t ∨ s satisfying the maximum overlap. Then, by Theorem 2, the overlap ol D (f ) for f ∈ Sl (l ≥ 3) can be obtained by the overlap sets of each pairs of elements in SEED in S2 . Example 1. Consider the transaction database D in Figure 2. Furthermore, assume that the minimum support and the maximum overlap are 80% and 25%, respectively. Also the minimum monomial support σ is set to 0%. tid 1 2 3 4 5
Ttid a, c, e, f b, c, e c, e, f a, b, c, f d, e
Fig. 2. A transaction database D
A conjunction phase in dnf cover constructs DNF 1 and SEED as Figure 3. We fix the order in SEED from top-down. Next, in a disjunction phase, Sl and DNF l (l ≥ 2) as Figure 4 are constructed from the SEED. Here, the value freq and ol in the column “determ.” mean that the formula does not satisfy the minimum support and the maximum overlap, respectively, and the value • means that the formula satisfies both. Thus, the algorithm dnf cover adds the formula with freq and • to Sl and the formula with • to DNF l . Hence, all of the extracted coverings of D by the algorithm dnf cover is described as follows. DNF 1 : c, e DNF 2 : a ∨ (c ∧ e), b ∨ f, b ∨ (c ∧ e), d ∨ f, d ∨ (c ∧ e), DNF 3 : a ∨ b ∨ d, a ∨ d ∨ (c ∧ e), b ∨ d ∨ f, b ∨ d ∨ (c ∧ e).
4
Empirical Results from Bacterial Culture Data
In this section, we give the empirical results by applying the algorithm dnf cover to bacterial culture data, which are full version in [11]. We use two kinds of such data, one is MRSA (methicillin-resistant Staphylococcus aureus) data with 118
172
K. Hirata, R. Nagazumi, and M. Harao
monotone monomial t cvs D (t) c 1, 2, 3, 4 e 1, 2, 3, 5
monotone monomial t cvs D (t) a 1, 4 b 2, 4 d 5 f 1, 3, 4 c∧e 1, 2, 3
Fig. 3. DNF 1 (left) and SEED (right) for D formula f a∨b a∨d a∨f a ∨ (c ∧ e) b∨d b∨f b ∨ (c ∧ e) d∨f d ∨ (c ∧ e) f ∨ (c ∧ e) DNF 3 a∨b∨d a∨b∨f a ∨ b ∨ (c ∧ e) a∨d∨f a ∨ d ∨ (c ∧ e) a ∨ f ∨ (c ∧ e) b∨d∨f b ∨ d ∨ (c ∧ e) b ∨ f ∨ (c ∧ e) d ∨ f ∨ (c ∧ e) DNF 4 a∨b∨d∨f a ∨ b ∨ d ∨ (c ∧ e) a ∨ d ∨ f ∨ (c ∧ e) DNF 2
cvs D (f ) 1, 2, 4 1, 4, 5 1, 3, 4 1, 2, 3, 4 2, 4, 5 1, 2, 3, 4 1, 2, 3, 4 1, 3, 4, 5 1, 2, 3, 5 1, 2, 3, 4 1, 2, 4, 5 1, 2, 3, 4 1, 2, 3, 4 1, 3, 4, 5 1, 2, 3, 4, 5 1, 2, 3, 4 1, 3, 4, 5 1, 2, 3, 4, 5 1, 2, 3, 4 1, 2, 3, 4, 5 1, 2, 3, 4, 5 1, 2, 3, 4, 5 1, 2, 3, 4, 5
ols D (f ) determ. 4 freq ∅ freq 1 freq 1 • ∅ freq 4 • 2 • ∅ • ∅ • 1, 3 ol 4 • 1, 4 ol 1, 2, 4 ol 1, 4 ol 1 • 1, 3 ol 4 • 2 • 2, 4 ol 1, 3 ol 1, 4 ol 1, 2, 4 ol 1, 2, 3, 4 ol
Fig. 4. Sl and DNF l for D
records and another is Anaerobes data with 1082 records. Both of them consist of data between four years (from 1995 to 1998) with 93 attributes. In this paper, we transform them to transaction databases to applying dnf cover . In the following tables, δ and η denote the maximum support and the minimum overlap, respectively. Also max vars denotes the maximum number of variables in monotone monomials for each extracted covering. Furthermore, we identify DNF i with its cardinality |DNF i |. In this section, the minimum monomial support is fixed to 10%.
Extraction of Coverings as Monotone DNF Formulas
4.1
173
MRSA Data
The number of extracted coverings from MRSA data by dnf cover is described as Figure 5.
δ 70% η 5% 10% 15% 5% DNF 1 150 150 150 7 DNF 2 121 1243 2293 4 DNF 3 9 1108 9325 3 DNF 4 14 133 2464 2 DNF 5 7 200 2734 4 DNF 6 6 178 3486 − DNF 7 − 54 2662 − DNF 8 − − 89 − total 307 3066 23203 20 max vars 7 7 7 2
80% 10% 15% 7 7 93 227 46 356 42 182 9 189 2 140 − 73 − 9 199 1183 3 3
5% 1 2 1 1 − − − − 5 1
90% 10% 15% 1 1 3 5 7 51 8 27 − 25 − 8 − − − − 19 117 1 1
Fig. 5. The number of extracted coverings from MRSA data.
In general, by decreasing the minimum support and by increasing the maximum overlap, the number and the length of extracted coverings are increasing, because decreasing the minimum support is corresponding to increasing the number of elements in SEED and DNF i and increasing the maximum overlap is to decreasing the number of elements in DNF i . Furthermore, for δ = 70% or 80%, i = 1, 2 and 3 are corresponding to the largest cardinality of DNF i under η = 5%, 10% and 15%, respectively.
DNF frequency ((Cep1 = R) ∧ (PcS = R))78.8 ∨ (dis = 13)11.0 81.4% ((PcS = R) ∧ (VCM = S) ∧ (beta = 0))79.7 ∨ (dis = 13)11.0 80.5% (PcB = R)79.7 ∨ (dis = 34)11.9 83.1% (LCM = R)79.7 ∨ (ward = 3)10.2 80.5% (CBP = R)23.7 ∨ (year = 95)50.0 ∨ (year = 96)13.6 80.5% (CBP = R)23.7 ∨ (year = 95)50.0 ∨ (year = 97)16.1 80.5% Fig. 6. A part of extracted coverings from MRSA data under the minimum support 80% and the maximum overlap 10%.
Figure 6 describes a part of extracted coverings from MRSA data under the minimum support 80% and the maximum overlap 10%, where the subscript number of each monotone monomial denotes its frequency (%). Then, dnf cover
174
K. Hirata, R. Nagazumi, and M. Harao
extracts the drug-resistant for MRSA, that is, the resistant to benzilpenicillins (PcB = R), synthetic penicillins (PcS = R) and 1st generation cephems (Cep1 = R). Furthermore, dnf cover extracts not only the disease information that is cranial nerve (dis = 13) or nephrostomy (dis = 34) but also the information that the department is an internal medicine (dept = 1) or the ward is 3 (ward = 3). They are possible to be a key of emerging infection. As another sensitivity of antibiotics, dnf cover extracts the coverings containing the resistant to lincomycins (LCM = R), which implies the fact that lincomycins take no effect for MRSA. Also the last two coverings in Figure 6 contain the resistant to carbapenems (CBP = R). In particular, we can extract the covering (CBP = S)55.1 ∨ (year = 98)20.3 under the minimum support 70% and the maximum overlap 10%, so these coverings are possible to imply the drug-resistant change for four years. 4.2
Reducing a Search Space
In dnf cover , we have already introduced the minimum monomial support σ as a threshold to reduce a search space. In this section, we also introduce another threshold called the maximum monomial support τ . Then, we replace R ← {t ∈ Ck+1 | σ|D| ≤ freq D (t) < δ|D|} in dnf cover with R ← {t ∈ Ck+1 | σ|D| ≤ freq D (t) ≤ τ |D|}. Hence, dnf cover outputs the coverings as monotone DNF formulas of which monotone monomial is uniformly frequent, that is, of which monotone monomial t always satisfies that σ|D| ≤ freq D (t) ≤ τ |D|. Figure 7 describes the number of extracted coverings from MRSA data by dnf cover with the maximum monomial support τ . Here, τ = δ means the results without τ , which are the same results described by Figure 5. Note that |DNF 1 | is the same result without τ , because the construction of DNF 1 is independent from the introduction of τ to dnf cover . On the other hand, if τ = 50%, then |DNF i | for i ≥ 3 (η = 5%), i ≥ 4 (η = 10%) and i ≥ 5 (η = 15%) are the same results without τ , respectively. If τ = 30%, then |DNF i | for i ≥ 7 is the same result without τ . Hence, dnf cover with the maximum monomial support reduces the number of extracted coverings consisting of a few monotone monomials from MRSA data. 4.3
Anaerobes Data
In this section, we apply dnf cover to Anaerobes data with 1082 records1 larger than MRSA data with 118 records. Figure 8 describes the number of extracted coverings from Anaerobes data by dnf cover . Note that, under the minimum support 70% or the maximum overlap 15%, we cannot extract coverings from Anaerobes data. 1
The number of data is different from [8], because we simply extract data of which bacterium is Anaerobes from the original data, while data in [8] has been obtained by cleaning our data.
Extraction of Coverings as Monotone DNF Formulas δ η τ DNF 1 DNF 2 DNF 3 DNF 4 DNF 5 DNF 6 DNF 7 DNF 8 total max vars δ η τ DNF 1 DNF 2 DNF 3 DNF 4 DNF 5 DNF 6 DNF 7 DNF 8 total max vars δ η τ
δ 150 121 9 14 7 6 − − 307 7
5% 50% 150 3 9 14 7 − − − 189 6
δ 7 4 3 2 4 − − − 20 2
5% 50% 7 1 3 2 4 − − − 17 2
30% 150 0 0 3 6 − − − 162 6
70% 10% δ 50% 30% 150 150 150 1243 4 0 1108 80 0 133 133 32 200 200 109 178 178 177 54 54 54 − − − 3066 799 522 7 6 6
30% 7 − − − − − − − 7 2
δ 7 92 46 42 9 2 − − 199 3
80% 10% 50% 30% 7 7 1 0 24 0 42 0 9 1 2 1 − − − − 85 9 2 2
δ
90% 10% 50% 30%
5% δ 50% 30%
DNF 1 1 DNF 2 2 DNF 3 1 DNF 4 1 DNF 5 − DNF 6 − total 5 max vars 1
1 0 1 1 − − 3 1
1 − − − − − 1 1
1 2 7 8 − − 19 1
1 0 1 8 − − 10 1
1 − − − − − 1 1
δ 150 2293 9325 2464 2734 3486 2662 89 23203 7
15% 50% 150 4 223 1089 2734 3486 2662 89 10437 6
175
30% 150 0 1 108 1304 3017 2662 89 7331 6
15% δ 50% 30% 7 7 7 227 1 0 356 55 0 182 145 0 189 189 17 140 140 25 73 73 73 9 9 9 1183 619 131 3 2 2
δ 1 5 51 27 25 8 117 1
15% 50% 30% 1 0 1 24 25 8 59 1
1 − − − − − 1 1
Fig. 7. The number of extracted coverings from MRSA data by introducing the maximum monomial support τ .
Under the minimum support 80% and the maximum overlap 10%, 40 coverings contain the resistant to benzilpenicillins, anti-pseudomonas penicillins (PcAP = R), 1st generation cephems, 2nd generation cephems (Cep2 = R), 3rd
176
K. Hirata, R. Nagazumi, and M. Harao δ 80% η 5% 10% DNF 1 7 7 DNF 2 9 15 DNF 3 6 44 DNF 4 0 9 DNF 5 1 3 DNF 6 − 2 total 23 78 max vars 3 3
90% 5% 10% 1 1 1 3 2 5 − − − − − − 4 9 1 1
Fig. 8. The number of extracted coverings from Anaerobes data.
generation cephems (Cep3 = R), lincomycins and macrolides (ML = R). In particular, 29 coverings are redundant for the sensitivity of antibiotics such as (PcB = R) ∨ (PcB = S) and 11 coverings are nonredundant. A part of extracted nonredundant coverings is described in Figure 9.
DNF frequency (Cep1 = S)45.2 ∨ (PcB = R)52.0 87.5% (Cep2 = R)10.6 ∨ (year = 95)42.4 ∨ (year = 96)36.2 80.9% (Cep2 = S)73.9 ∨ (Cep3 = R)14.7 81.5% (LCM = R)31.0 ∨ (ML = S)59.9 84.1% (LCM = S)57.6 ∨ (ML = R)29.2 81.7% Fig. 9. A part of extracted nonredundant coverings from Anaerobes data under the minimum support 80% and the maximum overlap 10%.
On the other hand, Anaerobes consist of 13 species. In particular, we pay our attention to Bacteroides spp. data with 524 records, Fusobacterium spp. data with 154 records, Prevotella spp. data with 165 records and Streptococcus spp. data with 157 records, and refer them to Bact, Fuso, Prev and Stre, respectively. Then, the number and examples of extracted coverings from them by dnf cover are described as Figure 10 and 11, respectively. Here, the maximum overlap is fixed to 10%. With increasing the size of database under the same minimum support and maximum overlap, the number of extracted coverings by the algorithm dnf cover tends to decrease by Figure 5 and 8, while its tendency is not correct exactly by Figure 10. Furthermore, we pay our attention to the sensitivity of antibiotics in the extracted coverings (cf. Figure 11). 1. In extracted coverings from Bact, the resistant to benzilpenicillins, 1st generation cephems, 2nd generation cephems and 3rd generation cephems occurs
Extraction of Coverings as Monotone DNF Formulas data records δ DNF 1 DNF 2 DNF 3 DNF 4 DNF 5 DNF 6 total max vars
Bact 524 70% 80% 90% 24 7 7 116 22 5 287 43 13 39 4 − 44 2 − 2 2 − 512 80 25 4 3 3
Fuso 154 70% 80% 90% 239 44 8 789 283 10 1494 159 19 396 28 0 230 9 0 28 17 2 3176 540 39 7 5 3
Prev 165 70% 80% 90% 87 15 3 525 33 5 887 57 19 156 16 4 134 4 − 2 − − 1791 125 31 6 4 2
177
Stre 157 70% 80% 90% 5 1 1 34 2 1 202 25 2 118 25 2 83 12 1 2 2 2 444 67 9 2 1 1
Fig. 10. The number of extracted coverings from Bact, Fuso, Prev and Stre under the maximum overlap 10%. data DNF frequency Bact ((CBP = S) ∧ (CP = S) ∧ (PcB = R) ∧ (TC = S))74.6 ∨ (ctr = 2)12.8 77.7% (LCM = R)42.3 ∨ (ML = S)56.9 90.9% (LCM = S)50.2 ∨ (ML = R)36.3 83.9% Fuso ((CBP = S) ∧ (CP = S) ∧ (Cep1 = S) ∧ (Cep2 = S) ∧ (PcAP = S) 84.4% ∧ (PcB = S))64.3 ∨ (PcB = R)16.2 ∨ (age = 10s)12.3 (LCM = R)26.0 ∨ (ML = S)53.9 73.4% Prev ((CBP = S) ∧ (CP = S) ∧ (Cep2 = S) ∧ (Cep3 = S) ∧ (LCM = S))67.3 75.8% ∨(ML = R)10.9 (LCM = R)10.8 ∨ ((ML = S) ∧ (PcAP = S))64.3 72.6% Stre (LCM = R)10.3 ∨ (ML = S)58.1 ∨ (age = 10s)12.3 71.0% Fig. 11. A part of extracted coverings from Bact, Fuso, Prev and Stre under the minimum support 70% and the maximum overlap 10%.
frequently. Also the resistant to anti-pseudomonas penicillins occurs in 4 redundant coverings. Furthermore, the resistant to lincomycins and macrolides occurs in 4 coverings, in which 2 coverings are redundant and 2 coverings are described in Figure 11. 2. In extracted coverings from Fuso, the resistant to benzilpenicillins, 1st generation cephems and 3rd generation cephems occurs frequently. Also the resistant to macrolides occurs in 5 redundant coverings. 3. In extracted coverings from Prev, the resistant to lincomycins and macrolides occurs frequently. Also the resistant to 1st generation cephems occurs in 34 coverings and the resistant to benzilpenicillins occurs in 5 redundant coverings. Other sensitivity of antibiotics is susceptibility. 4. In extracted coverings from Stre, all of the sensitivity of antibiotics are susceptibility, except lincomycins. Note here that the occurrence of the susceptible to carbapenems (CBP = S) in Figure 11 is implied by the fact that carbapenems take effect for Anaerobes. Also
178
K. Hirata, R. Nagazumi, and M. Harao
the above statement for Stre is implied by the fact that any drug takes effect for Streptococcus spp. Furthermore, the information that the age of patients is 10’s is interesting, because our bacterial culture data mainly consist of information for older patients.
5
Conclusion
In this paper, we have extended monotone monomials as large itemsets in Apriori to monotone DNF formulas, and formulated the coverings as monotone DNF formulas by introducing two measure, the minimum support and the maximum overlap. Then, we have designed the algorithm dnf cover to extract the coverings as monotone DNF formulas. Finally, we have given the empirical results by applying the dnf cover to MRSA data and Anaerobes data. In particular, we have succeeded to extract some valuable coverings from a medical viewpoint. It is one of the advantage of the algorithm dnf cover that we give no upperbound of both the number of monotone monomials and the number of variables in monotone monomials in monotone DNF formulas. For example, by using k minimal multiple generalizations by Arimura et al. [5], we can design the algorithm to extract coverings as monotone k-term DNF formulas, where k is the upperbound of the number of monomials. Instead of them, we give the upperbound of the overlap (and the minimum and the maximum monomial supports), and reduce a search space for the extraction of monotone DNF formulas. In this paper, we only implement the prototype to the algorithm dnf cover , so it is a future work to improve the efficiency of the implementation. In particular, Agrawal et al. [2,3] have designed AprioriTid in order to decrease the number of accesses to a transaction database, so it is necessary to improve our dnf cover according to AprioriTid. Figure 5, 8 and 10 in Section 4 claim that the appropriate number of our bacterial culture data from which dnf cover extracts many nonredundant coverings is less than about 500. This claim is concerned with 93 attributes of our data. It is a future work to clear the relationship between the number of attributes and such appropriate number of data for various databases. From a medical viewpoint, it is interesting how coverings can be extracted from Staphylococci and Enterococci data, in particular, by the difference of samples between the blood and others. It is an important future work. From a viewpoint of Algorithmic/Computational Learning Theory, it is well known that monotone DNF formulas are learnable with equivalence and membership queries [4]. It is a future work to analyze the relationship between the learnability and the extraction of monotone DNF formulas, and to incorporate the learning algorithm with dnf cover . Acknowledgment. The authors would thank to Kimiko Matsuoka in Osaka Prefectural General Hospital and Shigeki Yokoyama in Koden Industry Co., Ltd. for the valuable comments from a medical viewpoints in Section 4 and 5.
Extraction of Coverings as Monotone DNF Formulas
179
References 1. J.-M. Adamo: Data mining for association rules and sequential patterns: Sequential and parallel algorithms, Springer, 2001. 2. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo: Fast discovery of association rules, in [7], 307–328. 3. R. Agrawal, R. Srikant: Fast algorithms for mining association rules in large databases, Proc. of 20th VLDB, 487–499, 1994. 4. D. Angluin: Queries and concept learning, Machine Learning 2, 319–342, 1988. 5. H. Arimura, T. Shinohara, S. Otsuki: Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data, Proc. 11th STACS, LNCS 775, 649–660, 1994. 6. S. Dˇzeroski, N. Lavraˇc (eds.): Relational data mining, Springer, 2001. 7. U. M. Fayyed, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.): Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996. 8. K. Matsuoka, M. Fukunami, S. Yokoyama, S. Ichiyama, M. Harao, T. Yamakawa, S. Tsumoto, K. Sugawara: Study on the relationship of patients’ diseases and the occurrence of Anaerobes by using data mining techniques, Proc. International Congress of the Confederation of Anaerobes Societies 186 (1Xa-P2), 2000. 9. E. Suzuki: Mining bacterial test data with scheduled discovery of exception rules, in [10], 34–40. 10. E. Suzuki (ed.): Proc. International Workshop of KDD Challenge on Real-World Data (KDD Challenge 2000), 2000. 11. S. Tsumoto: Guide to the bacteriological examination data set, in [10], 8–12. Also available at http://www.slab.dnj.ynu.ac.jp/challenge2000. 12. C. Zhang, S. Zhang: Association rule mining, LNAI 2308, 2002.
What Kinds and Amounts of Causal Knowledge Can Be Acquired from Text by Using Connective Markers as Clues? Takashi Inui, Kentaro Inui, and Yuji Matsumoto Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, 630-0192, Japan {takash-i,inui,matsu}@is.aist-nara.ac.jp
Abstract. This paper reports the results of our ongoing research into the automatic acquisition of causal knowledge. We created a new typology for expressing the causal relations — cause, effect, precond(ition) and means — based mainly on the volitionality of the related events. From our experiments using the Japanese resultative connective “tame”, we achieved 80% recall with over 95% precision for the cause, precond and means relations, and 30% recall with 90% precision for the effect relation. The results indicate that over 27,000 instances of causal relations can be acquired from one year of Japanese newspaper articles.
1
Introduction
In many fields including psychology and philosophy, the general notion of causality has been a research subject since the age of ancient Greek philosophy. From the early stages of research into artificial intelligence, many researchers have been concerned with common-sense knowledge, particularly cause-effect knowledge, as a source of intelligence. Relating to this field, ways of designing and using a knowledge base of causality information to realize natural language understanding have also been actively studied [14,3]. For example, knowledge about the preconditions and effects of actions is commonly used for discourse understanding based on plan recognition. Figure 1-(a) gives a typical example of this sort of knowledge about actions, which consists of precondition and effect slots of an action labeled by the header. This knowledge-intensive approach to language understanding results in a bottleneck due to the prohibitively high cost of building and managing a comprehensive knowledge base. Despite the considerable efforts put into the CYC [8] and OpenMind [16] projects, it is still unclear how feasible it is to try to build such a knowledge base manually. Very recently, on the other hand, several research groups have reported on attempts to automatically extract causal knowledge from a huge body of electronic documents [1,7,2,13]. While these corpus-based approaches to the acquisition of causal knowledge have considerable potential, they are still at a very preliminary stage in the sense that it is not yet clear what kinds and what amount of causal knowledge they might extract, G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 180–193, 2003. c Springer-Verlag Berlin Heidelberg 2003
What Kinds and Amounts of Causal Knowledge Can Be Acquired
181
Fig. 1. The example of plan operator and causal relations
how accurate the process could be, and how effectively extracted knowledge could be used for language understanding. Motivated by this background, we are reporting the early results of our approach to automatic acquisition of causal knowledge from a document collection. In this work, we consider the use of resultative connective markers such as “because” or “so” as linguistic clues for knowledge acquisition. For example, given the following sentences (1), we may be able to acquire the causal knowledge given in Figure 1-(a), which can be decomposed into two finer-grained causal relations as given in Figure 1-(b): (1) a. Because it was a sunny day today, the laundry dried well. b. It was not sunny today, so John couldn’t dry the laundry in the sun. The idea of using these sorts of connective markers to acquire causal knowledge is not novel in itself. In this paper, however, we address the following subset of the above-mentioned unexplored issues, focusing on knowledge acquisition from Japanese texts: – What classification typology should be given to causal relations that can be acquired using clues provided by connective markers (in Section 5), – How accurately can acquired relation instances be classified (in Section 6 and Section 7), and – How many relation instances can be acquired from currently available document collections (in Section 7).
2
Causal Knowledge
We regard causal knowledge instances as binominal relations such as in Figure 1(b). The headings indicate causal relations and arguments indicate related events held in causal relation with each other. Given text segments like (1), the process of acquiring causal knowledge would form two independent phases: argument identification and causal relation estimation.
182
T. Inui, K. Inui, and Y. Matsumoto Table 1. Typology of causal relations
2.1
A Typology of Causal Relations
One of the main goals of discourse understanding is the recognition of the intention behind each volitional action appearing in a given discourse. In intention recognition, therefore, it is important to distinguish volitional actions (e.g., the action of “drying laundry”) from all the other sorts of non-volitional states of affairs (e.g., the event of “laundry drying”). For convenience, in this paper, we refer to the former simply as actions (Act) and the latter as states of affairs (SOA) except where a more precise specification is needed. We need to classify causal relations with respect to the volitionality of their arguments. Given the distinction between actions and SOAs, the causal knowledge base needed for intention recognition can be considered as consisting of: – – – –
the the the the
causal relation between SOAs, precondition relation between SOAs and actions, effect relation between actions and SOAs, and means relation between actions.
These relations should not be confused. For example, the confusion between precondition and effect may lead to a fatally wrong inference — hanging laundry causes it to become dry, but never causes a sunny day. Based on the distinction between these relations, we have created a typology of causal relations as summarized in Table 1. In the table, Acti denotes a volitional action and SOAi denotes a non-volitional state of affairs. The first column of the table gives the necessary condition for each relation class. For example, effect(Act1 , SOA2 ) denotes that, if the effect relation holds between two arguments, the first argument must be a volitional action and the second must be a non-volitional state of affairs. On the other hand, it is not easy to provide rigorously sufficient conditions for each relation class. To avoid addressing unnecessary philosophical issues, we provide each relation class with a set of linguistic tests that loosely specify the sufficient condition. Several examples of the linguistic tests we use are also presented in Table 1.
What Kinds and Amounts of Causal Knowledge Can Be Acquired
2.2
183
Arguments of Causal Relations
Our proposed collection of causal relations should constitute of a higher level of abstraction than mere rhetorical relations. When a causal relation is estimated from text, we must therefore abstract away subjective information including tense, aspect and modality of the arguments. Similarly, it is desirable that some propositional elements of arguments are also abstracted to conceptual categories, like Asia → LOCATION NAME. Thus, for the acquisition of causal knowledge, we also need to automatize this abstraction process. In this paper, however, we focus on the relation estimation problem. We process the arguments as follows (see the example (2)): – Maintaining all propositional information, and – Discarding all subjective and modal information. (2) I am familiar with Asia because I traveled around Asia. → effect(travel around Asia, be familiar with Asia) Representation of Arguments. We represent arguments of causal relation instances by natural language expressions such as Figure 1-(b) and (2) instead of by any formal semantic representation language for the following two reasons. First, it has proven difficult to design a formal language that can fully represent the diverse meanings of natural language expressions. Second, as discussed in [6], there has been a shift towards viewing natural language as the best means for knowledge representation. In fact, for example, all the knowledge in the Open Mind Commonsense knowledge base is represented by English sentences [16], and Liu et al. [9] reported that it could be successfully used for textual affect sensing.
3 3.1
The Source of Knowledge Causal Relations and Connective Markers
Let us consider the following examples, from which one can obtain several observations about the potential sources of causal knowledge. (3) a. The laundry dried well today because it was sunny. b. The laundry dried well, though it was not sunny. c. If it was sunny, the laundry could dry well. d. The laundry dried well because of the sunny weather. → e. cause(it is sunny, laundry dries well) (4) a. Mary used a tumble dryer because she had to dry the laundry quickly. b. Mary could have dried the laundry quickly if she had used a tumble dryer. c. Mary used a tumble dryer to dry the laundry quickly.
184
T. Inui, K. Inui, and Y. Matsumoto Table 2. Frequency distribution of connective markers
Table 3. Frequency distribution of tame in the intra-sentential contexts
d. Mary could have dried the laundry more quickly with a tumble dryer. → e. means(use a tumble dryer, dry laundry quickly) First, causal knowledge can be acquired from sentences with various connective markers. (3e) is a cause relation instance that is acquired from subordinate constructions with various connective markers as in (3a) – (3d). Likewise, the other classes of relations are also acquired from sentences with various connective markers as in (4). The use of several markers is advantageous for improving the recall of the acquired knowledge. Second, it is also interesting to see that the source of knowledge could be extended to sentences with an adverbial minor clause or even a prepositional phrase as exemplified by (3d), (4c) and (4d). Note, however, that the acquisition of causal relation instances from such incomplete clues may require additional effort to infer elliptical constituents. To acquire a means relation instance (4e) from (4d), for example, one might need the capability to paraphrase the prepositional phrase “with a tumble dryer” to a subordinate clause, say, “if she had used a tumble dryer”. Third, different kinds of instances can be acquired with the same connective marker. For example, the type of knowledge acquired is a cause relation from sentence (3a), but with a means relation from (4a). Thus, one needs to create a computational model that is able to classify the samples according to the causal relation implicit in each sentence. This is the issue we address in the following sections. 3.2
Japanese Connective Markers
The discussion of English in Section 3.1 applies equally to Japanese. One could acquire the same causal relation instances from sentences with various connective
What Kinds and Amounts of Causal Knowledge Can Be Acquired
185
markers such as tame (because, in order to), ga (but) and (re-)ba (if) . On the other hand, different kinds of causal relation instances could be acquired from the same connective marker. Table 2 shows the frequency distribution of connective markers in the collection of Nihon Keizai Shimbun newspaper articles from 1990. Observing this distribution, we selected tame as our target for exploration because (1) the word tame is used relatively frequently in our corpus, and (2) the word tame is typically used to express causal relations more explicitly than other markers. Next, Table 3 shows the frequency distribution of the intra-sentential contexts in which tame appears in the same newspaper article corpus. The word tame is most frequently used as an adverbial connective marker accompanying a verb phrase that constitutes an adverbial subordinate clause (see Table 3-(a)). Hereafter, sentences including such clauses will be referred to as tame-complex sentences. We were pleased to observe this tendency because, as argued above, the acquisition from complex sentences with adverbial subordinate clauses is expected to be easier than from sentences with other types of clues such as nominal phrases (see Table 3-(b)). Based on this preliminary survey, we restrict our attention to the tame-complex sentences.
4
Related Work
There have been several studies aiming at the acquisition of causal knowledge from text. Garcia [1] used verbs as causal indicators for causal knowledge acquisition in French. Khoo et al. [7] acquired causal knowledge with manually created syntactic patterns specifically for the MEDLINE text database. Girju et al. [2] and Satou et al. [13] tried to acquire causal knowledge by using connective markers in the same way as we do. However, the classification of causal relations that we described in this paper is not taken into consideration in their methods. It is important to note that our typology of causal relations is not just a simple subset of common rhetorical relations as proposed in Rhetorical Structure Theory [10]. For example, (3) shows that a cause relation instance could be acquired not only from a Reason rhetorical relation (exemplified by (3a)), but also from Contrast and Condition relations ((3b) and (3c), respectively). A collection of causal relations should be considered as representing knowledge of a higher level of abstraction rather than as a collection of rhetorical relations. In other words, causal relation instances are knowledge that is needed to explain why rhetorical relations are coherent. For example, it is because you know the causal relation (3e) that you can understand (3a) to be coherent but (5) to be incoherent. (5) ∗ The laundry dried well today though it was sunny.
186
T. Inui, K. Inui, and Y. Matsumoto
Table 4. Distribution of causal relations held by tame-complex sentences in S1 . SC denotes the subordinate clause and MC denotes the matrix clause. Acts and SOAs denote an event referred to by the SC, and Actm and SOAm denote an event referred to by the MC.
5
Causal Relations in Tame-Complex Sentences
Before moving into the classification of tame-complex sentences, in this section we describe the causal relations implicit in tame-complex sentences. We examined their distribution as follows: Step 1. First, we took random samples from a newspaper article corpus of 1000 sentences that were automatically categorized into tame-complex sentences. Removing interrogative sentences and sentences from which a subordinatematrix clause pair was not properly extracted due to preprocessing (morphological analyzer) errors, we had 994 remaining sentences. We refer to this set of sentences as S1 . Step 2. Next, we manually divided the 994 sentences composing S1 into four classes depending on the combination of volitionality of the subordinate and matrix clauses. The frequency distribution of the four classes (A – D) is shown in the left-hand side of Table 4. Step 3. We then examined the distribution of the causal relations we could acquire from the samples of each class using the linguistic tests examplified in Table 11 . The right-hand side of Table 4 shows the most abundant relation and its ratio for each class A – D. For example, given a tame-complex sentence, if the subordinate clause refers to a volitional action and the matrix clause refers to a non-volitional SOA (namely, class B), they are likely to hold a relation effect(Acts , SOAm ) with a probability of 0.93 (149/161). The following are examples of cases where the most abundant relation holds. (6) 1
tai-de manguroubu-wo hakaisi-ta-tame daisuigai-ga hasseisi-ta. in Thailand mangrove-acc destroy-past-tame flooding-nom occur-past
The clausal volitionality and the causal relations were judged using the linguistic test. To estimate reliability of judgements, two subjects majoring in computational linguistics are currently annotating the texts with both volitionality and causal relations. We calculated κ statistical measure with 200 previously annotated samples. The κ value was 0.93 for the volitionality, 0.88 for causal relations.
What Kinds and Amounts of Causal Knowledge Can Be Acquired
187
Table 5. Feature set used for volitionality estimation
Serious flooding occurred because mangrove swamps were destroyed in Thailand. Acts : (someone) destroy mangrove swamps in Thailand SOAm : serious flooding occur → effect(destroy mangrove swamps in Thailand, serious flooding occur) pekin-eno kippu-wo kau-tame kippuuriba-ni i-tta. for Beijing ticket-acc buy-tame to ticket office go-past (I) went to the ticket office in order to buy a ticket for Beijing. Acts : (I) buy a ticket for Beijing Actm : (I) go to the ticket office → means(go to the ticket office, buy a ticket for Beijing)
(7)
The distribution shown in Table 4 is quite suggestive. As far as tame-complex sentences are concerned, if one can determine the value of the volitionality of the subordinate and matrix clauses, one can classify tame-complex sentences into the four relations — cause, effect, precond and means — with precision of 85% or more. Motivated by this observation, in the next section we first address the issue of automatic estimation of clausal volitionality before moving onto the issue of automatic classification of causal relations.
6
Estimation of Volitionality
In this section, we present our approach to estimating clausal volitionality.
188
T. Inui, K. Inui, and Y. Matsumoto Table 6. Ratio of volitionality of each clause
6.1
Preliminary Analysis
In our previous work, we found that clausal volitionality depends mostly on the verb of the clause. That is, if certain clauses contain the same verb, the volitionality values of these clauses will also tend to be the same. Nevertheless, there are some counterexamples. For example, both the subordinate clause of (8a) and the matrix clause of (8b) contain the same verb kakudaisuru (expand), however, (8a) refers to the volitional action and (8b) refers to the non-volitional SOA. (8) a. seisannouryoku-wo kakudaisuru-tame setubitousisuru. production ability-acc expand-tame make plant investment (A company) will make plant investments to expand production ability. b. kanrihi-ga sakugensi-ta-tame eigyourieki-ga kakudaisi-ta. cost-nom reduce-past-tame profit-nom expand-past Business profit expanded as a result of management costs being reduced. Still, there will be factors in addition to the verb that help determine clausal volitionality. As a result of analyzing tame-complex sentences in S1 , we found the following new characteristics of volitionality: – The volitionality value of a clause tends to be a non-volitional SOA when the subject is not a person or an organization. – The volitionality value of a clause tends to change depending on whether it appears as a subordinate clause or a matrix clause. – The volitionality value of a clause tends to change based on modality, such as tense. 6.2
Estimation of Volitionality by SVMs
We investigated experimentally how accurately the volitionality value (volitional action or non-volitional SOA) of each clause can be estimated by using Support Vector Machines (SVMs) [17] – an accurate binary classification algorithm. Experimental Conditions. Table 5 shows the features we used to represent the sentences. While almost all features can be automatically extracted, it is not so easy to extract the “Subject” feature. Because subject phrases usually
What Kinds and Amounts of Causal Knowledge Can Be Acquired
189
do not appear overtly in Japanese complex sentences. In this experiment, we implemented a simple subject feature extractor with about 60% precision. We used all the sentences in S1 as training samples and another new tamecomplex sentence set S2 as test samples. The set S2 includes 985 tame-complex sentences sampled from newspaper articles issued in a different year than S1 . The frequency distribution of clausal volitionality of both S1 and S2 are shown in Table 6. In addition to the characteristics of clausal volitionality mentioned before, we found little evidence of a correlation between the volitionality values of matrix and subordinate clauses. So, in this experiment, we created a separate classifier for each clause. We used the quadratic polynomial kernel as a kernel function. Results. The accuracy is 0.885 for the subordinate clauses, and 0.888 for the matrix clauses. The baseline accuracy is 0.853. Here, the baseline denotes the accuracy achieved by applying a simple classification strategy where (a) if the verb of the input clause appeared in the training set, the clause was classified by a majority vote, and (b) if the voting was even or the verb was not included in the training set, the clause was classified as volitional action by default. Our results obtained through SVMs outperforms the baseline accuracy. Next, we introduced a reliability metric to obtain a higher accuracy. When the reliability of estimating the volitionality value is known, the accuracy of automatic classification of causal relations can be improved by removing samples where the reliability of estimating the volitionality value is low. For the estimation of reliability, we used the absolute values of the discriminate function (the distances from the hyperplane) output by the SVMs. We set up the reliability threshold value α, and then assumed that a judgment would only be decided for a sample when the reliability was greater than α. By varying α, we obtained the coverage-accuracy curves of Figure 22 . These results confirm that the problem of clausal volitionality estimation is solvable with very high confidence.
7
Automatic Classification of Causal Relations
We investigated how accurately we could classify the causal relation instances contained in tame-complex sentences. For this purpose, we again used SVMs as the classifier. 7.1
Experimental Conditions
We set up four classes — cause, effect, precond and means. The features we used to represent the sentences are as follows: i. All the features shown in Table 5, 2
Coverage = # of samples output by the model / # of samples. Accuracy = # of samples correctly output by the model / # of samples output by the model.
190
T. Inui, K. Inui, and Y. Matsumoto
Fig. 2. Coverage-accuracy curves (clausal Fig. 3. Recall-precision curves (causal revolitionality estimation) lation classification)
ii. The volitionality value estimated by the technique described in the previous section, and iii. Whether the subjects of the two clauses in the sentence are the same. The third, subjects agreement feature can be automatically extracted by using the technique described in Nakaiwa et al. [12] with a high level of precision. However, in this experiment, we were unable to implement this method. Instead, a simple rule-based extractor was used. The data are the same as those in Section 6.2. We used the sentences in S1 as training samples and S2 as test samples. We first estimated the volitionality value and its reliability using all the data. Then, we removed about 20% of the samples by applying to the reliability metric. The one-versus-rest method was used so that we could apply SVMs to multiple classifications. When the discriminate function value acquired from two or more classifiers with this technique was positive, one classifier with the maximum function value was ultimately selected. 7.2
Results
We refer to the maximum discriminate function value obtained through the one-versus-rest method as s1 , and to the second highest one as s2 . We then obtained the results shown in Table 8 and Figure 33 through the same procedure as described for reliability in Section 6.2, where the classification reliability was defined as s1 + (s1 − s2 ). The 3-point averaged precision in Table 8 represents the summary of the recall-precision curves. This value is the 3-point average of precision where the 3 points are recall = 0.25, 0.50, 0.75. 3
For each relation R: Recall = # of samples correctly classified as R / # of samples holding the target relation R, Precision = # of samples correctly classified as R / # of samples output as being R.
What Kinds and Amounts of Causal Knowledge Can Be Acquired
191
Table 7. Distribution of causal relations held by tame-complex sentences in S2
Table 8. Accuracy of causal relation classification
The first row of Table 8 shows that our causal relation classifier performed with high precision. All relations excluding effect relation class achieved over 0.95. The second row shows the current upper bound of causal classification. These are the results in the case that classifiers were trained with the feature information for the two primitive features, the subject feature and the subjects agreement feature, by using a human judge instead of our simple feature extractor in an effort to avoid machine-induced errors in input data. The third row shows the results in the case that classifiers were trained without the volitionality values. It is clear that clausal volitionality plays an important role in classifying causal relations. 7.3
Discussion
Let us estimate the amount of knowledge one can acquire from tame-complex sentences in a collection of one year of newspaper articles with approximately 1,500,000 sentences in total. Suppose that we want to acquire causal relations with a precision of, say, 99% for cause relation, 95% for precond and means relations, and 90% for effect relation. First, it can be seen from Figure 3 that we achieved 79% recall (REC) for the cause relation, 30% for effect, 82% for precond , and 83% for means. Second, assume that the frequency ratios (FR) of these relations to all the tamecomplex sentences are as given in Table 7. In this case, for example, the frequency ratio of the cause relation class was 193/1000 = 19%. From these, it can be seen effect effect precond cause × that we achieved 64% recall: 0.19cause FR × 0.79REC + 0.11FR × 0.30REC + 0.17FR precond means 0.82REC + 0.38means × 0.83 = 0.64 . FR REC Finally, since we collected about 42,500 tame-complex sentences from one year of newspaper articles (see Table 3), we expect to acquire over 27,000 instances of causal relations ( 42, 500 × 0.64). This number accounts for 1.8% of
192
T. Inui, K. Inui, and Y. Matsumoto
all the sentences (1,500,000 sentences), and is not small in comparison to number of causal instances included in the Open Mind Commonsense knowledge base [15] and Marcu’s results [11].
8
Conclusion
Through our approach to acquiring causal knowledge from text, we made the following findings: – If one can determine the volitionality of the subordinate and matrix clauses of a Japanese tame-complex sentence, the causal relation can be classified as cause, effect, precond or means with a precision of over 85% on average (Table 4 and Table 7). – By using SVMs, we achieved 80% recall with over 95% precision for the cause, precond and means relations, and 30% recall with 90% precision for the effect relation (Figure 3). – The classification results indicate that over 27,000 instances of causal relations can be acquired from one year of Japanese newspaper articles. In future work, we will extend the connective markers covered to include frequent connective markers such as ga (but) and re-ba (if). More importantly, what we have discussed in this paper is not specific to Japanese, so we want to investigate application to English connectives as well. We also plan to design a computational model for applying the acquired knowledge to natural language understanding and discourse understanding. Acknowledgements. We would like to express our special thanks to the creators of Nihongo-Goi-Taikei and several of the dictionaries used in the ALT-J/E translation system at NTT Communication Science Laboratories, and the EDR electronic dictionaries produced by Japan Electronic Dictionary Research Institute. We would also like to thank Nihon Keizai Shimbun, Inc. for allowing us to use their newspaper articles. We are grateful to the reviewers for their suggestive comments, Taku Kudo for providing us with his dependency analyzer and SVM tools, and Eric Nichols and Campbell Hore for proofreading.
References 1. D. Garcia. COATIS, an NLP system to locate expressions of actions connected by causality links. In Proc. of the 10th European Knowledge Acquisition Workshop, pages 347–352, 1997. 2. R. Girju and D. Moldovan. Mining answers for causation questions. In Proc. the AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases, 2002. 3. J. R. Hobbs, M. Stickel, D. Appelt, and P. Martion. Interpretation as abduction. Artificial Intelligence, 63:69–142, 1993.
What Kinds and Amounts of Causal Knowledge Can Be Acquired
193
4. S. Ikehara, M. Miyazaki, S. Shirai, A. Yokoo, H. Nakaiwa, K. Ogura, Y. Ooyama, and Y. Hayashi. Goi-Taikei – A Japanese Lexicon. Iwanami Shoten, 1997. 5. S. Ikehara, S. Shirai, A. Yokoo, and H. Nakaiwa. Toward an MT system without pre-editing – effects of new methods in ALT-J/E-. In Third Machine Translation Summit: MT Summit III, pages 101–106, Washington DC, 1991. 6. L. M. Iwanska and S. C. Shapiro. Natural Language Processing and Knowledge Representation – Language for Knowledge and Knowledge for Language. The MIT Press, 2000. 7. C. S. G. Khoo, S. Chan, and Y. Niu. Extracting causal knowledge from a medical database using graphical patterns. In Proc. of the 38th. Annual Meeting of the Association for Computational Linguistics (ACL2000), pages 336–343, 2000. 8. D. Lenat. Cyc: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11), 1995. 9. H. Liu, H. Lieberman, and T. Selker. A model of textual affect sensing using real-world knowledge. In Proc. of the International Conference on Intelligent User Interfaces, pages 125–132, 2003. 10. W. C. Mann and S. A. Thompson. Rhetorical structure theory: A theory of text organization. In USC Information Sciences Institute, Technical Report ISI/RS-87190, 1987. 11. D. Marcu. An unsupervised approach to recognizing discourse relations. In Proc. of the 40th. Annual Meeting of the Association for Computational Linguistics (ACL2002), pages 368–375, 2002. 12. H. Nakaiwa and S. Ikehara. Intrasentential resolution of japanese zero pronouns in a machine translation system using semantic and pragmatic constraints. In Proc. of the 6th TMI, pages 96–105, 1995. 13. H. Satou, K. Kasahara, and K. Matsuzawa. Retrieval of simplified causal knowledge in text and its application. In Proc. of The IEICE, Thought and Language, 1998. (in Japanese). 14. R. Schank and R. Abelson. Scripts Plans Goals and Understanding. Lawrence Erlbaum Associates, 1977. 15. P. Singh. The public acquisition of commonsense knowledge. In Proc. of AAAI Spring Symposium on Acquiring Linguistic Knowledge for Information Access, 2002. 16. D. G. Stork. Character and document research in the open mind initiative. In Proc. of Int. Conf. on Document Analysis and Recognition, pages 1–12, 1999. 17. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. 18. T. Yokoi. The edr electronic dictionary. Communications of the ACM, 38(11):42– 44, 1995.
Clustering Orders Toshihiro Kamishima and Jun Fujiki AIST Tsukuba Central 2, Umezono 1–1–1, Tsukuba, Ibaraki, 305-8568 Japan [email protected], http://www.kamishima.net/
Abstract. We propose a method of using clustering techniques to partition a set of orders. We define the term order as a sequence of objects that are sorted according to some property, such as size, preference, or price. These orders are useful for, say, carrying out a sensory survey. We propose a method called the k-o’means method, which is a modified version of a k-means method, adjusted to handle orders. We compared our method with the traditional clustering methods, and analyzed its characteristics. We also applied our method to a questionnaire survey data on people’s preferences in types of sushi (a Japanese food).
1
Introduction
Clustering is a task to partition a sample set into clusters which have the properties of internal cohesion and external isolation [1], and is a basic tool for exploratory data analysis. Almost all of the traditional methods for clustering are designed to deal with sample sets represented by the attribute vectors or similarity matrices [2]. These clustering methods, therefore, are not suited for handling other types of data. In this paper, we propose a clustering technique for partitioning one such type of data, namely orders. We define the term order as a sequence of objects that are sorted according to some property. For example, given three objects, x1 , x2 , and x3 , one example of the order is the sequence x3 x1 x2 , which is sorted according to an individual’s preference. An example in which performing the clustering of orders would be useful would be in completing a questionnaire survey on preferences in foods. The surveyor presents some kinds of foods to each respondent, and requests that he/she sort the foods according to his/her preference. By clustering these preference data, the surveyor would be able to find groups which have the same preference tendency. For such a sensory survey, it is typical to adapt the Semantic Differential (SD) method [3]. In this method, the respondents’ preferences are measured by a scale, the extremes of which are symbolized by antonymous adjectives. For example: like 5 4 3 2 1 dislike A proper interpretation of these responses should be, for example, that respondents prefer the objects categorized as “5” to the objects categorized as “4”. However, due to lack of analysis techniques, it is common to assume that all respondents share an understanding of this scale’s range, intervals, and extremes [4]. Such an unrealistic assumption can be avoided by introducing order responses. We therefore developed a clustering technique for orders. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 194–207, 2003. c Springer-Verlag Berlin Heidelberg 2003
Clustering Orders
195
We formalize this clustering task in Section 2. Our clustering methods are presented in Section 3. The experimental results are shown in Section 4. Section 5 summarizes our conclusions. 1.1
Related Work
Clustering techniques for partitioning time series data have been previously proposed [5,6,7]. Though both orders and time series data are sequences of observations, there is an important difference between them. The same observations can appear in the same sequences of time series data, but cannot appear in the orders. Therefore, these clustering techniques are not suited for dealing with orders. The pioneering work of handling orders is Thurstone’s law of comparative judgment [8]. Thurstone proposed a method of constructing a real-value scale from a given set of pairwise precedence information, that indicates which object precedes the other between two objects. Recently, there has been active research in the processing of orders. Cohen et al. [9] and Joachims [10] proposed a method to sort attributed objects associated with pairwise precedence information. Kamishima and Akaho [11] and Kazawa et al. [12] studied the learning problem from ordered object sets. Mannila and Meek [13] tried to establish the structure expressed by partial orders among a given set of orders. Sai et al. [14] proposed association rules between order variables. However, since we don’t know of a clustering method for orders, we advocate this new technique in this paper.
2
Clustering Orders
In this section, we formalize a task of clustering orders. An order is defined as a sequence of objects that are sorted according to some property, such as size, preference, or price.An object xa corresponds to an object, entity, or substance to be sorted. The universal object set, X ∗ , consists of all possible objects. The order is denoted by O=x1 x2 · · · x3 . The transitivity is preserved in the same order, i.e., if x1 x2 and x2 x3 then x1 x3 . To express the order of two objects, x1 x2 , we use the sentence “x1 precedes x2 .” Xi ⊆ X ∗ denotes the object set that is composed of all the objects that appear in the order Oi . Let |A| be the size of the set A, then |Xi | is equal to the length of the order Oi . The order Oi is called a full-order if Xi = X ∗ , and is called a sub-order if Xi ⊂ X ∗ . The task of clustering orders is as follows. A set of sample orders, S = {O1 , O2 , . . . , O|S| } is given. Note that it is allowed to be Xi = Xj (i = j). In addition, even if x1 x2 in the order Oi , it may be x2 x1 in the order Oj . The aim of clustering is to divide the S into a partition. The partition, π = {C1 , C2 , . . . , C|π| }, is a set of all clusters. Clusters are mutually disjoint and exhaustive, i.e., Ci ∩Cj = ∅,∀i,j ,i = j and S = C1 ∪C2 ∪· · ·∪C|π| . Partitions are generated such that the orders in the same cluster are similar (internal cohesion), and those in the different clusters are dissimilar (external isolation). The similarity measure of orders and our clustering method is presented in the next section.
196
3
T. Kamishima and J. Fujiki
Methods
We modified a well-known clustering algorithm k-means so as to be able to deal with orders. We named this modified method the k-o’means. 3.1
Similarity between Two Orders
To measure the similarity between two orders, we adopted Spearman’s Rank Correlation, which is denoted by “ρ” [15]. The ρ is the correlation between ranks of objects. The rank, r(O , x), is the cardinal number that indicates the position of the object x in the order O. For example, for the order O=x1 x3 x2 , the r(O , x1 )=1 and the r(O , x2 )=3. The ρ between two orders, O1 and O2 , consisting of the same objects (i.e., X1 = X2 ) is defined as: r(O1 , x) − r¯1 r(O2 , x) − r¯2 1 ρ = x∈X 2 2 , r(O r(O , x)−¯ r , x)−¯ r 1 1 2 2 x∈X1 x∈X1 where r¯i = (1/|X1 |) x∈Xi r(Oi , x). If no tie in rank is allowed, this can be calculated by the simple formula: 2 6 × x∈X1 r(O1 , x) − r(O2 , x) . ρ=1− |X1 |3 − |X1 |
The ρ becomes 1 if two orders are coincident, and −1 if one order is a reverse of the other order. The ρ is designed for the two orders consisting of the same objects. In the case of the clustering task in Section 2, the object in one order may not appear in another order. We thus derived the rank correlation from the objects included in both orders, and ignored the rest of the objects. For example, two orders were given: O1 =x1 x3 x4 x6 ,
O2 =x5 x4 x3 x2 x6 .
From these orders, all the objects that were not included in the other orders were eliminated. The generated orders were: O1 = x3 x4 x6 ,
O2 = x4 x3 x6 .
The ranks of objects in these orders were: r(O1 , x3 ) = 1,
r(O1 , x4 ) = 2,
r(O1 , x6 ) = 3;
r(O2 , x3 ) = 2,
r(O2 , x4 ) = 1,
r(O2 , x6 ) = 3.
Consequently, the ρ was ρ=1−
6 (1−2)2 + (2−1)2 + (3−3)2 = 0.5. 33 − 3
Note that if no common objects existed between the two orders, the ρ = 0 (i.e., no correlation). Over or under estimations in similarities will be caused by this heuristic
Clustering Orders
197
of ignoring objects. However, if Xi are randomly sampled from X ∗ , the expectation of the observed similarities will be the equivalent to true similarity. Therefore, such over or under estimations can be treated as same as the other types of noises. To use for the clustering task, distance or dissimilarity is more suitable than similarity. We defined a dissimilarity between two orders based on the ρ: d(O1 , O2 ) = 1 − ρ.
(1)
Since the range of ρ is [−1,1], this dissimilarity ranges [0,2]. This dissimilarity becomes 0, if two orders are equivalent. We comment on the reason for adopting Spearman’s ρ as a similarity measure. Similarities between orders are based on one of the following two quantities. The one is the differences between ranks of objects. The Spearman’s ρ is an example of this type. The other is the number of discordant object pairs among all object pairs. Formally, an object pair, xa and xb , is discordant if r(O1 ,xa ) < r(O1 ,xb ) and r(O2 ,xa ) > r(O2 ,xb ), or vice versa. The Kendall’s τ is an example of this type. Though the computational complexity of the former quantity is O(|X|), that of the later is O(|X|2 ). The Spearman’s ρ can be calculated faster, so we adopt this as a similarity. 3.2
Order Means
Before describing the algorithm of k-o’means, we would like to give the definition of an order mean. In the case of k-means, the mean of the cluster C is derived such that xi − xj , x ¯ = arg min i x
xj ∈C
where xi are data points, C is a cluster, and · is the L2 norm. We extend this notion so as to fit for orders. That is to say, by employing Equation (1) as the loss function, we ¯ as follows: define an order mean, O, ¯ = arg min O d(Oi , Oj ). (2) Oj
Oi ∈C
¯ = Note that the order mean consists of all objects in all the orders in the C, that is X ∪Oi ∈C Xi . If every order consists of all the objects in X ∗ (i.e., Xi = X ∗ , ∀i), the order means are derived by the Borda rule in the 18th century. The rule is equivalent to the following algorithm: 1) For each object xa in the X ∗ , calculate the following value: r˜∗ (xa ) =
1 r(Oi , xa ). |C|
(3)
Oi ∈C
2) By sorting objects according to the r˜∗ (xa ) in ascending order, the order mean of C can be derived. Note that, if r˜∗ (xa ) = r˜∗ (xb ) , xa = xb , either xa xb or xb xa is allowed.
198
T. Kamishima and J. Fujiki
The proof of the optimality of this algorithm is as follows: First, we relax the limitations of ranks. While the strict ranks take one of 1 , . . . , |X|, the relaxed ranks, r˜(x), are real number and satisfy the condition: x∈X
r˜(x) =
|X|
i.
i=1
Clearly, strict ranks satisfy this condition. Since all the |Xi | are equal, Equation (1) is proportional to the sum of the squared difference between the ranks of two orders. Therefore, the optimal relaxed ranks can be found by minimizing: (˜ r(x) − r(Oi , x))2 . Oi ∈C x∈X ∗
This is minimized at r˜(x) = r˜∗ (x) , ∀x in Equation (3). We then have to find the strict order, Oj , that minimizes the error. Minimizing the Equation (2) is equivalent to minimizing: 2 r(Oj , x) − r(Oi , x) =
Oi ∈C x∈X ∗
Oi ∈C x∈X ∗
2 2 r(Oi , x) − r˜∗ (x)) +|C| r˜∗ (x) − r(Oj , x) .
(4)
x∈X ∗
¯ corresponds to the Since the first term is already minimized, the strict order mean O ¯ is strict order Oj that minimizes the second term. We next show that the order mean O ∗ ∗ ˜ , that is the order sorted according to r˜ (x). Assume that there is at least equivalent to O ¯ and O ˜ ∗ . Formally, there exists an object pair, one discordant pair of objects between O a b x and x , such that ∗ a ¯ , xa ) − r(O ¯ , xb ) < 0. r˜ (x ) − r˜∗ (xb ) r(O Let d1 be the second term of the Equation (4) in this case. By swapping these objects only in the order mean, this error becomes d2 . Then, ¯ , xa ) − r(O ¯ , xb ) > 0. r∗ (xa ) − r˜∗ (xb ))(r(O d1 − d2 = −2|C|(˜ The fact that the error decreases by swapping objects is contradicted by the assumption ¯ is an order mean. Therefore, the order mean must not be discordant with O ˜∗. that O Consequently, the above algorithm leads the order mean. Unfortunately, it is practically a rare situation in which all orders consist of the same object set. If such conditions are not satisfied, the above algorithm cannot be applied. The calculation of the order mean is difficult in this case, since this is a discrete optimization problem. Instead of deriving a strictly optimal solution, we investigated several ad-hoc methods. We tried the methods in [11], such as the one compatible to Cohen’s greedy methods [9]. As a result of the empirical comparisons, we found that the following method based on Thurstone’s model achieved less error in Equation (2) than the others.
Clustering Orders
199
The Thurstone’s law of comparative judgment (case V) [8] is the generative model of orders. This model assumes that the score R(xa ) is assigned to each object xa . The orders are derived by sorting according to the score. The scores follow the normal distribution, i.e., R(xa ) ∼ N (µa , σ), where µa is the mean score of the object xa and σ is a common constant standard deviation. Based on this model, the probability that object xa precedes the xb is t ∞ t − µa u − µb Pr[xa xb ] = ) )du dt φ( φ( σ σ −∞ −∞ µa − µb √ =Φ , (5) 2σ where φ(·) is a normal distribution density function, and Φ(·) is a normal distribution function. Accordingly, √ µa − µb = 2σΦ−1 Pr[xa y b ] . √ ¯a by dividing 2σ, and the origin is set to the mean of µ ¯1 , The µa is rescaled into µ ...µ ¯|X| ¯a = 0. To estimate these means, Thurstone’s paired comparison ¯ , i.e., ¯ µ xa ∈X method tries to minimize the function: 2 Φ−1 Pr[xa xb ] − (¯ µa − µ ¯b ) . Q= ¯ xb ∈X ¯ xa ∈X
To minimize the Q, it is differentiated by µ ¯a : ∂Q =−2 µa −¯ µb ) =0. Φ−1 Pr[xa xb ] − (¯ ∂µ ¯a b ¯ x ∈X
¯
This formula is derived for each xa =x1 , . . . , x|X| . By solving these linear equations, we obtain 1 −1 Pr[xa xb ] . Φ (6) µ ¯a = ¯ |X| ¯ xb ∈X
The order means can be derived by sorting according to these µ ¯a . To derive these values, we have to estimate the Pr[xa xb ] from the orders in the cluster, C. This probability is estimated by the following process. From the order O in the cluster, all the object pairs, (xa , xb ), are extracted such that xa precedes xb in the order. For example, from the order O = x3 x1 x2 , three object pairs, (x3 , x1 ),(x3 , x2 ), and (x1 , x2 ), are extracted. Such pairs are extracted from all |C| orders in the cluster, and are collected into the set PC . As the probability Pr[xa xb ], we adopted the following Bayesian estimator with Dirichlet prior distribution in order that the probability remains at non-zero: Pr[xa xb ] =
|xa , xb |+0.5 , |xa , xb |+|xb , xa |+1
where |xa , xb | is the number of the object pairs, (xa , xb ), in PC .
200
T. Kamishima and J. Fujiki 2
¯ |C|), since the counting Note that, the calculation time for one order mean is O(|X| |C| 2 2 a b ¯ up time for deriving Pr[x x ] is O( i |Xi | ) ≤ O(|X| |C|), the estimation time of ¯ is O(|X| ¯ 2 ), and the sorting time is O(|X| ¯ log |X|). ¯ µ ¯a for all xa ∈ X Finally, we comment on the method for estimating µ ¯a based on the maximum likelihood principle (e.g., in [16]). Similar to the original k-means, our k-o’means depends on the the initial partition. To cancel this unstable factor, one has to select the best result among a number of trials. The maximum likelihood based method is a kind of gradient decent, so the resultant order also depends on a initial state. More trials are required to cancel this unstable factor. Moreover, since such an iterative method is time-consuming, we adopted the above non-iterative method.
Algorithm k-o’means(S, k, maxIter) S = {O1 , . . . , O|S| }: a set of sample orders k: the number of clusters maxIter: the limit of iteration times 1) initial partition: S is randomly partitioned into π = {C1 , . . . , Ck }, π := π, t := 0 2) t := t + 1, if t > maxIter then goto step 6 ¯ j by the procedure in Section 3.2 3) for each cluster Cj ∈ π, derive the order means O ¯ j , Oi ) 4) for each order Oi in S, assign it to the cluster: arg minCj d(O 5) if π = π then goto step 6 else π := π, goto step 2 6) output π Fig. 1. The k-o’means algorithm
3.3
k-o’means
The k-o’means algorithm is the same as the well-known k-means algorithm, except for the notion of a mean and dissimilarity. The algorithm is shown in Figure 1. First, initial clusters are generated by randomly partitioning S. These clusters are improved by iteratively performing two steps: deriving the order mean for each cluster, and assigning each order to the nearest cluster. If the number of iterations exceeds the threshold or the partitions do not change, the algorithm stops and outputs the current partition. Note that, as the original k-means algorithm, the k-o’means cannot find the global optimal solution. Therefore, multiple partitions are derived by starting from different initial partitions, and then select π minimizing: ¯ i , Oj ). d(O (7) Ci ∈π Oj ∈Ci
We here comment on the time complexity of this algorithm. First, the calculation 2 ¯ 2 |C|) time time for k order means is O(|X ∗ | |S|), since one mean is calculated in O(|X| ∗ ¯ (see Section 3.2), |C| ≈ |S|/k, and |X| ≤ |X |. Second, the time for the assignment of one order is O(|Xi |k), because dissimilarity is derived for each of the k clusters and one dissimilarity is calculated in O(|Xi |) time. Thus the total assignment time is |S| O( i |Xi |k) ≤ O(|X ∗ ||S|k). Since the number of iteration times is constant, the time
Clustering Orders
201
complexity of one iteration is equivalent to the total complexity. Consequently, the total 2 complexity becomes O(|X ∗ | |S| + |X ∗ ||S|k). In terms of |S| and k, this is equivalent to that of the original k-means. However, in terms of |X ∗ |, the complexity is quadratic.
4
Experiments
We applied our k-o’means algorithm to two types of data: artificially generated data and real preference survey data. In the former experiment, we compared our k-o’means and the traditional hierarchical clustering method using the dissimilarity of Equation (1). In addition, by applying the k-o’means to data which had various properties, we revealed the characteristics of this algorithm. In the latter experiment, we analyzed a questionnaire survey data on preferences in sushi (a kind of Japanese food). We used the k-o’means algorithm exploratory tools for analysis. Note that, in the experiments described below, Thurstone’s method is used for deriving the order means since Xi are always proper subsets of X ∗ . 4.1
Evaluation Criteria
Before reporting the experimental results, the evaluation criteria for partitions will be described. In this section, the same object set was divided into two different partitions: ˆ . We present two criteria to measure the difference of π ˆ from π ∗ . π ∗ and π The first measure is called purity, and is widely used (e.g., in [17]). Assume that the objects in cluster Ci∗ ∈ π ∗ are classified into the true class labeled i. If all the objects in the Cˆi ∈ π ˆ classified into the majority true class, the purity corresponds to the classification accuracy. Formally speaking, the purity is defined as: 1 ˆi ∩ C ∗ | . max | C (8) purity = j Cj∗ ∈π ∗ |S| ˆi ∈ˆ C π
The range of the purity is [0 , 1], and becomes 1 if two partitions are identical. Though this purity is widely used, its lower bound changes according as π ∗ , and the resulting scale normalization problem makes it difficult to use this as the basis for calculating the means of these criteria. Therefore, we introduce the second criteria, the ratio of information loss (RIL) [18], which is also called the uncertainty coefficient in numerical taxonomy literature. The RIL is the ratio of the information that is not acquired to the total information required for estimating a correct partition. This criterion is defined based on the contingency table for indicator functions [2]. The indicator function I((xi , xj ) , π) is 1 if an object pair (xi , xj ) are in the same cluster, and 0 if they are in different clusters. The contingency table is a 2 × 2 matrix consisting of elements ast , that are the number of object pairs satisfying the condition I((xi , xj ) , π ∗ )=s and I((xi , xj ) , π ˆ )=t, among all the possible object pairs. RIL is defined as 1 1 ast a·t s=0 t=0 a log2 a RIL = (9) 1 as· ·· a·· st , s=0 a·· log2 as· 1 1 1 1 where a·t = s=0 ast , as· = t=0 ast , and a·· = s=0 t=0 ast . The range of the RIL is [0 , 1], and becomes 0 if two partitions are identical.
202
4.2
T. Kamishima and J. Fujiki
Experiments on Artificial Data
Data Generation Process. We applied the k-o’means to artificial data in order to compare it with traditional clustering methods and to analyze the method. Test data were generated in the following two steps: In the first step, we generated the k of order means. One random permutation (we called it a pivot) consisting of all objects in X ∗ was generated. The other k−1 means were generated by transforming this pivot. Two adjacent objects in the pivot were randomly selected and swapped. This swapping was repeated at specified times. By changing the number of swapping times, the inter-cluster closeness could be controlled. In the second step, for each of the clusters, its constituent orders were generated. From the order mean, the |Xi | of the objects were randomly selected. These objects were sorted so as to be concordant with the order mean. Namely, if xa precedes xb in the order mean, it should be the case in the generated orders. Again, two adjacent object pairs were randomly swapped. By changing the number of swapping times, the intra-cluster closeness could be controlled. Table 1. Parameters of experimental data
1) the total number of objects: |X ∗ | = 100 2) the number of sample orders: |S| = 1000 3) the length of the orders: |Xi | = 10 4) the number of clusters: |π| = {2 , 5 , 10 , 50} 5) the swapping times of the order means: {a:∞, b:230000, c:120000} 6) the ratio of the minimum cluster size to the the maximum: {1/1 , 1/2 , 1/5 , 1/10} 7) the swapping times of the sample orders: {a:0 , b:30 , c:72}
The parameters of the data generator are summarized in Table 1. The parameters 1–3 are common for all the data. |X ∗ | and |S| are set so as to be roughly the same as those for the survey data in Section 4.3. If |Xi | is too short, then the differences between orders cannot be tested. However, it is hard for respondents to sort too many objects. By considering these factors, we set the order length to 10. The parameter 4 was the number of clusters. It is difficult to partition if this number is large, since the sizes of the clusters then decrease. The parameter 5 was the swapping time in the first step of the data generation process. Three cases were examined. According to the simulation result, ρ means between a pivot and the other order means were 0.0, 0.1, and 0.3 in the cases of a, b, and c, respectively. Since the case a was the most separated, it was the easiest to partition. The parameter 6 controls the deviation of cluster sizes. If the sizes of the clusters are diverged, the relatively small clusters tend to be ignored. Thus, 1/10 is the hardest to cluster. The last parameter is the swapping times in the second step of the data generation process. The means of ρ between a sample order and an order mean was 1.0, 0.715, and 0.442 in the cases of a, b, and c, respectively. ρ between two random orders
Clustering Orders
203
becomes larger than these values with probabilities 0.0, 0.01, and 0.10. Since the case a is the tightest, it was the easiest to partition. The number of the total parameter combinations was 4×3×4×3=144. For each setting, we generated 100 sample sets. Below, we will show the means of purity and RIL of these sets.
Table 2. The means of purities and of RIL on artificial data KOM AVE MIN MAX
purity 0.561 (0.3629) 0.466 (0.2966) 0.315 (0.2663) 0.371 (0.2430)
RIL 0.705 (0.4105) 0.910 (0.1679) 0.999 (0.0014) 0.994 (0.0112)
Comparison with Traditional Clustering Methods. We compared our k-o’means algorithm with the traditional hierarchical clustering methods: the minimum distance, maximum distance, and group average methods. Since data are not represented by attribute vectors, the original k-means cannot partition a set of sample orders. However, hierarchical methods can partition a set of sample orders by adopting Equation (1) as dissimilarities between any of the order pairs. We applied our k-o’means and three traditional algorithms to all 144 types of artificial data. The correct number of clusters were given as an algorithm’s parameter. The means of purities and RIL are shown in Table 2. The symbols, KOM, AVE, MIN, and MAX indicate the means derived by the k-o’means, the group average, the minimum distance, and the maximum distance method, respectively. In parentheses, standard deviations are also shown. Clearly, our k-o’means is superior to the other three methods. According to the paired t-test between the k-o’means and each of the other three methods, the difference is significant, even at the significance level of 0.1%. We think that this advantage of the k-o’means is due to the following reasons. The dissimilarity between an order pair tends to be 1.0, since common objects are found infrequently. However, our k-o’means adopted a notion of an order mean. Since order means are derived from far more orders than two, common objects can be found far more frequently. In other words, while the traditional methods are based on only local information, the k-o’means can capture more global features of clusters. In addition, it is well known that a minimum or a maximum are not robust against the outliers, due to effects such as chaining [1]. Effects of Data Parameters. We next show the characteristics of the k-o’means according to the changes of the parameters 4–7 in Table 1. Table 3 shows the means of the purities and of the RIL on each of the data groups that are separated on the specific parameter. For example, in the column labeled “2” in Table 3(a), the means of purities the and of the RIL on 36 types of sample sets whose parameter 4 is 2, i.e., |π| = 2, are shown. Overall, parameters 4 and 7 affected partitioning, but the others did not. We will next comment on each of these parameters.
204
T. Kamishima and J. Fujiki
Table 3. The means of purities and of RIL on each of the order sets with the same parameters.
purity RIL
2 0.909 0.525
(a) parameter 4: the number of clusters 5 10 0.705 0.492 0.598 0.697
purity RIL
(b) parameter 5: inter-cluster closeness a:∞ (separated) b:230000 0.566 0.566 0.695 0.698
purity RIL
(c) parameter 6: deviation of cluster sizes 1/1 (equal) 1/2 1/5 0.543 0.547 0.569 0.684 0.689 0.710
purity RIL
(d) parameter 7: intra-cluster tightness a:0 (tight) b:30 0.782 0.531 0.278 0.843
50 0.139 0.999 c:120000 (close) 0.552 0.723 1/10 (deviated) 0.586 0.738 c:72 (scattered) 0.370 0.994
Parameter 4: The more the number of clusters increases, the poorer the performance becomes. If the number becomes 50, it is almost impossible to recover the original partition, even in the case of tightest intra-cluster closeness (i.e., the parameter 7 is 0). This can be explained as follows. The number of possible order means can be bounded |X ∗ |!. To choose one from these means, log2 (|X ∗ |!)≈525 bits information is required. Roughly speaking, since the number of permutations of |Xi | objects is |Xi |!, one order provides log2 (|Xi |!) bits information. In total, the orders in one cluster provide |C| log2 (|Xi |!)≈(|S|/|π|) log2 (|Xi |!)≈436 bits information. Consequently, due to the shortage of information, fully precise order means weren’t derived. Parameter 5: Closeness between order means does not affect partitioning so much. We don’t know the exact reason, but one possible explanation is that order means can easily be distinguished from each other since they are very long compared to sample orders. Parameter 6: The original k-means tends to lead to poor partitions if the sizes of clusters vary, but the k-o’means do not. We think that this is also caused by the above high distinguishablity between order means. Parameter 7: It is hard to find partitions consisting of clusters in which intra-cluster similarities are low. The sample orders are relatively short, so even a low level of noise affects the partitioning performance a great deal. 4.3
Experiments on Preference Survey Data
We applied our k-o’means to the questionnaire survey data of preference in sushi. Since notion of true clusters are not appropriate for such a real data, we use the k-o’means as exploratory analysis tools. We asked each respondent to sort 10 objects (i.e., sushi) according to his/her preference. Such a sensory survey is a very suitable area for analysis based on orders. The objects were randomly selected from 100 objects according to
Clustering Orders
205
Table 4. Summaries of partition on sushi data Attributes of Clusters |C|: the numbers of respondents A1 : preference to heavy tasting sushi A2 : preference to sushi users infrequently eat A3 : preference to expensive sushi A4 : preference to sushi fewer shops supply
C1 607 0.4016 −0.6429 −0.4653 −0.4488
C2 418 −0.1352 −0.6008 −0.0463 −0.2532
their probability distribution, based on menu data from 25 sushi shops found on the WWW. For each respondent, the objects were randomly permutated to cancel the effect of the display order to their responses. The total number of respondents was 1039. We eliminated the data obtained within a response time which was either too short (shorter than 2.5 minutes) or too long (longer than 20 minutes). Consequently, 1025 data was extracted. We use k-o’means as an exploratory tool, and divided the data into two clusters. The summary of each cluster is shown in Table 4. The results given in the table were the best in terms of Equation (7) among 20 trials. The first row of the table shows the number of respondents grouped into each of the clusters. The C1 is a major cluster. The subsequent four rows show the rank correlation between each order mean and the sorted object list according to the specific object attributes. For example, the fourth row presents the ρ between the order mean and the sorted object sequence according to their price. Based on these correlations, we were able to learn what kind of object attributes affected the preferences of the respondents in each cluster. We next comment on each of the object attributes. Note that attributes A1 and A2 were derived from the questionnaire survey by the SD method, and the others from the menu data. The attribute A1 (the second row) shows whether the object tasted heavy (i.e., high in fat) or light (i.e., low in fat). The positive correlation indicate a preference for heavy tasting. The C1 respondents preferred heavy tasting objects much more than the C2 respondents. The attribute A2 (the third row) shows how frequently the respondent eats the object. The positive correlation indicates a preference for objects that the respondent infrequently eats. Both respondents of C1 and C2 prefer the objects they usually eat. No clear difference was observed between clusters. The attribute A3 (the fourth row) is the prices of the objects. These were regularized so as to cancel the effects of sushi styles (hand-shaped or rolled), and differences of price between shops. The positive correlation indicate a preference for cheap objects. While the C1 respondents preferred expensive objects, the C2 respondents did not. The attribute A4 (the fifth row) shows how frequently the objects are supplied at sushi shops. The positive correlation indicates a preference for the objects that fewer shops supply. Though the correlation of C2 is rather larger than that of C1 , the difference is not statistically significant. Roughly speaking, the members of the major group C1 prefer more heavy tasting and expensive sushi than the members of the minor group C2 . Selecting the number of clusters is worth a mention in passing. Since notion of the optimal number of clusters depend on application of the clustering result, the number cannot be decided in general. However, it can be considered that dividing uniformly
206
T. Kamishima and J. Fujiki
distributed data is invalid. To check this, we tested whether the nearest pair of order means ¯ |π| , ¯1 , . . . , O could be distinguished or not. We denote the order means of clusters by O ¯ ∗ . Let ρab be the rank correlation between O ¯a and that of the entire sample set S by O ¯ b , and ρ∗ be the rank correlation between O ¯ ∗ and O ¯ a . First, we found the O ¯ α and and O a ¯ β such that the ραβ was the maximum among all pairs of order means. To test whether O the closest (i.e. the most correlated) pair of clusters, Cα and Cβ , should be merged or not, we performed statistical test of the difference between two correlation coefficients, ρ∗α and ρ∗β . In this case, the next statistics follows the Student t-distribution with the degree of freedom |X ∗ |−3: t=
(ρ∗α
−
ρ∗β )
2(1 −
(|X ∗ | − 3)(1 + ραβ ) . − ρ∗β 2 − ραβ 2 + 2ρ∗α ρ∗β ραβ )
ρ∗α 2
If the hypothesis ρ∗α =ρ∗β is not rejected, these two clusters should be merged and the number of cluster should be decreased. When partitioning the survey data into two clusters (i.e. k = 2), t = 9.039. At the significance level of 1%, it could be concluded that these clusters should not be merged. However, since t = 1.695 when k = 3, these clusters are invalid and the closest pair of clusters should be merged. Note that any criteria for selecting the number of clusters are not almighty as pointed in [19], since the optimality depends on clustered data and the aim of clustering. For example, when adopting the k-o’means for the purpose of caching [20], the best estimation accuracy was achieved when k was larger than two.
5
Conclusions
We developed a clustering technique for partitioning a set of sample orders. We showed that this method outperforms the traditional methods, and presented the characteristics of our method. By using this method, we analyzed a questionnaire survey data on preference in sushi. The time complexity of the current algorithm is quadric in terms of |X ∗ |, thus this algorithm is difficult to deal with thousands of objects. We plan to develop the method to accommodate a much larger universal object set. It is possible to extend our methods to hierarchical ones. We simply embedded the dissimilarity of Equation (1) and the order means to the traditional Ward method. However, the computation is very slow, since the new order mean cannot be derived from two order means of merged clusters. In addition, the performance was poor. The mean RIL on the artificial data over 10 trials is 0.894 (compare with Table 2). This is because fully precise order means cannot be derived if the sizes of clusters are small, and the sizes of clusters are very small in the early stage of the Ward method. More elaborate method would be required.
Acknowledgments. A part of this work is supported by the grant-in-aid for exploratory research (14658106) of the Japan society for the promotion of science.
Clustering Orders
207
References 1. Everitt, B.S.: Cluster Analysis. third edn. Edward Arnold (1993) 2. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall (1988) 3. Osgood, C.E., Suci, G.J., Tannenbaum, P.H.: The Measurement of Meaning. University of Illinois Press (1957) 4. Nakamori, Y.: Kansei data Kaiseki. Morikita Shuppan Co., Ltd. (2000) (in Japanese). 5. Cadez, I.V., Gaffney, S., Smyth, P.: A general probabilistic framework for clustering individuals and objects. In: Proc. of The 6th Int’l Conf. on Knowledge Discovery and Data Mining. (2000) 140–149 6. Ramoni, M., Sebastiani, P., Cohen, P.: Bayesian clustering by dynamics. Machine Learning 47 (2002) 91–121 7. Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Proc. of The 4th Int’l Conf. on Knowledge Discovery and Data Mining. (1998) 239–243 8. Thurstone, L.L.: A law of comparative judgment. Psychological Review 34 (1927) 273–286 9. Cohen, W.W., Schapire, R.E., Singer, Y.: Learning to order things. Journal of Artificial Intelligence Research 10 (1999) 243–270 10. Joachims, T.: Optimizing search engines using clickthrough data. In: Proc. of The 8th Int’l Conf. on Knowledge Discovery and Data Mining. (2002) 133–142 11. Kamishima, T., Akaho, S.: Learning from order examples. In: Proc. of The IEEE Int’l Conf. on Data Mining. (2002) 645–648 12. Kazawa, H., Hirao, T., Maeda, E.: Order SVM: A kernel method for order learning based on generalized order statistic. The IEICE Trans. on Information and Systems, pt. 2 J86-D-II (2003) 926–933 (in Japanese). 13. Mannila, H., Meek, C.: Global partial orders from sequential data. In: Proc. of The 6th Int’l Conf. on Knowledge Discovery and Data Mining. (2000) 161–168 14. Sai, Y., Yao, Y.Y., Zhong, N.: Data analysis and mining in ordered information tables. In: Proc. of The IEEE Int’l Conf. on Data Mining. (2001) 497–504 15. Kendall, M., Gibbons, J.D.: Rank Correlation Methods. fifth edn. Oxford University Press (1990) 16. Hohle, R.H.: An empirical evaluation and comparison of two models for discriminability scales. Journal of Mathematical Psychology 3 (1966) 173–183 17. Huang, Z.: Extensions to the k-means algorithm for clustering large data with categorical values. Journal of Data Mining and Knowledge Discovery 2 (1998) 283–304 18. Kamishima, T., Motoyoshi, F.: Learning from cluster examples. Machine Learning (2003) (in press). 19. Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50 (1985) 159–179 20. Kamishima, T.: Nantonac collaborative filtering: Recommendation based on order responses. In: Proc. of The 9th Int’l Conf. on Knowledge Discovery and Data Mining. (2003)
Business Application for Sales Transaction Data by Using Genome Analysis Technology Naoki Katoh1 , Katsutoshi Yada2 , and Yukinobu Hamuro3 1
Department of Architecture and Architectural Systems, Kyoto University Kyoto, Kyoto 606-8501, Japan, [email protected] 2 Faculty of Commerce, Kansai University, Suita, Osaka 564-8680, Japan, [email protected] 3 Faculty of Business Administration, Osaka Sangyo University, Osaka 574-8530, Japan, [email protected] Abstract. We have recently developed an E-BONSAI (Extended BONSAI) for discovering useful knowledge from time-series purchase transaction data, developed by improving and adding new features to a machine learning algorithm for analyzing string pattern such amino acid sequence, BONSAI, proposed by Shimozono et al. in 1994. E-BONSAI we developed can create a good decision tree to classify positive and negative data for records whose attributes are either numerical, categorical or string patterns while other methods such as C5.0 and CART cannot deal with string patterns directly. We shall demonstrate advantages of E-BONSAI over existing methods for forecasting future demands by applying the methods to real business data. To demonstrate an advantage of E-BONSAI for business application, it is significant to evaluate it from the two perspectives. The first is the objective and technical perspective such as the prediction accuracy. The second is the management perspective such as the interpreterability to create new business action. Applying the E-BONSAI to forecast how long new products survive in instant noodle market in Japan, we have succeeded in attaining high prediction ability and discovering useful knowledge for domain experts.
1
Introduction
1.1
Motivations and Background
Due to the diversification of consumer needs, a lot of new products have been developed and introduced into the market [6]. It has become extremely difficult for them to survive as regular assortment since the product lifecycle has been shortened due to the price reduction and high competitive market. It is said that many of them (nearly 75%) fail at the time of the market introduction [8]. It is significant for manufacturers and retailers to predict with high confidence and at an early stage whether the new products will survive or not.
Research of this paper is partly supported by the Grant-in-Aid for Scientific Research on Priority Areas (2) and RCSS research fund by the Ministry of Education, Science, Sports and Culture of Japan and the Kansai University Special Research fund, 2002.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 208–219, 2003. c Springer-Verlag Berlin Heidelberg 2003
Business Application for Sales Transaction Data
209
In this paper, we shall discuss the demand forecast problems of instant noodle by using purchase data of a supermarket in Japan. In Japanese market of instant noodle, there are many new entries and most of them disappear quickly. Thus, the market of instant noodle is extremely competitive and it is very difficult to predict with high confidence whether the new product will dead or survive. On the other hand, knowledge discovery in database or data mining has become an active research area in recent years and several successful cases of data mining systems by using neural network have been reported for demand forecast. Due to the diffusion of frequent shoppers program (FSP) in many enterprises, a very large amount of data about consumer behavior has been accumulated, and thereby, it has become very important to use it in order to predict sales and consumer behavior. However few firms in Japan can make use of a large amount of time-series purchase data accumulated in their database for making useful business strategy. This paper is concerned with discovery of useful and deep knowledge concerning customer purchase behavior from time-series purchase data. For this purpose, it is necessary for the discovery system to be able to deal with not only quantitative data such as sales volume but also sequential categorical data such as brand purchase pattern as well. 1.2
Previous Work
There have been a lot of research papers on demand forecasts. For example in marketing research area, ”Diffusion Model” was proposed by Bass [2] and Kalish [7] for sales forecast of consumer durables with low purchase frequency. The model expresses the diffusion process of new products in the market that once ”innovator” with great sensitivity to the new product has purchased the new product for the first time, then after that, other consumers follow him. For the products which are purchased more frequently, ”Repeat Purchase Model” was proposed by Fourt & Woodlock [3], Parfitt & Collins [11], and Nakanishi [10], which focus on the repeat purchase pattern of consumers in the market of new products. In the marketing research area, most of them have dealt with the small data set such as summarized daily sales data. Due to the recent development of new technologies in the area of data mining and knowledge discovery in databases for the purpose of semi-automatically extracting meaningful knowledge from a huge amount of data, many researchers have tried to forecast the demand of various goods by using these technologies and methods. In Japan a few papers have dealt with this issue where they used neural network or Bayesian inference to predict the sales volume [9][15] or the price of the stocks [16] in future. In their models, evaluation criterion was the forecast precision. However, these approaches have serious drawbacks from practical viewpoint. The first is that the existing models, which usually generated too big tree to difficult to interpret for experts, do not create any effective active business action because domain experts cannot either interpret the rules or patterns extracted from data or infer rules that clarify the mechanism governing demand change. On
210
N. Katoh, K. Yada, and Y. Hamuro
the other hand, most of research for demand forecasting is concerned with building models which precisely predict the future demands while less attention is paid to interpretability of rules derived. However, no manager accepts the demand forecast created by statistical or data mining method as it is without executing any action even though the method correctly forecasts future demand. But rather he/she does carry out various strategies to increase the demand quantity and to intend expansion of profit. Generally speaking, it is believed that active and dynamic action rather than a passive action leads to competitive advantage. Therefore there is a large gap between research and practice concerning demand forecasting, and thus in order to make the forecast model useful for practitioners, we need to develop a system which may provide information which help to produce useful business action. The second is that the existing research usually uses aggregated POS (point of sales) data such as sales volume or profitability. In recent years, many retail stores have started to introduce frequent shoppers program (FSP) by which sales data with customer ID are becoming available for use. However, the existing research has not made use of such detailed sequential data [4][17]. If we can build a forecasting model which takes into account purchase behavior of individual customers in the past, it will not only become possible to construct a forecasting model with higher precision but to discover knowledge that helps creation of effective business action. 1.3
The Purpose and Contribution in This Paper
In this paper we shall demonstrate a new business application based on EBONSAI (Extended BONSAI) to predict with high accuracy whether a new product survives or not in future by using data for two or three weeks after the market entry of the product. E-BONSAI system was recently developed by improving the BONSAI system that has been developed for genome analysis so as to adapt it to the above purpose [4]. E-BONSAI system has several advantages over other systems. First, E-BONSAI can deal with categorical sequential data as well as numeric or categorical data to construct prediction model with high confidence. Second, it is easy to interpret the rules and patterns extracted by E-BOSNAI as observed in Hamuro 2002, and hence the expert of the marketing can get the meaningful knowledge from them to create effective business action as we will see in this paper. We observe that the results of E-BONSAI we developed have not only exhibited high prediction accuracy but also high interpretability. Although several successful researches of data mining systems have been reported, it is difficult to apply it to business data in the real world because many of them have used clean and small data set. In this paper we not only present a new prediction model with high accuracy based on E-BONSAI but also discover knowledge useful for domain experts by using a great amount of business sales data. The organization of this paper is as follows. We first explain the algorithm of original BONSAI and that of E-BONSAI. Next we explain the framework of the business case of instant noodle and apply E-BONSAI to purchase historical
Business Application for Sales Transaction Data
211
data of customers who purchased instant noodles in a supermarket. We then shall discuss prediction accuracy of E-BONSAI as well as other methods such as C5.0 and CART, and interpret the rules derive from E-BONSAI. We conclude the paper with remarks concerning future research work.
2
Algorithm – from “BONSAI” to “E-BONSAI” –
2.1
BONSAI; Algorithm for Strings Pattern Analysis
At first, we shall explain original BONSAI developed for strings pattern analysis. Given positive set of strings, pos and negative set of strings, neg, original BONSAI creates a good decision tree that classifies pos and neg as correctly as possible [1][14]. Let P be positive data set, N be negative data set, and |P | and |N | be the numbers of records in P and N, respectively. Given a substring α, let pT and nT be the numbers of records containing α in PT and NT , respectively, and let pF and nF be the numbers of records not containing α in P or N , respectively. Defining entropy function EN T (x, y) in the following manner, EN T (x, y) =
0 x = 0 or y = 0 −x log x − y log y x, y = 0
we define in the following expression the entropy after classifying the original data into two subsets depending on whether data contains α as a substring or not pT + nT pT pF nT pF + nF nF EN T ( EN T ( , )+ , ) |P | + |N | pT + nT pT + nT |P | + |N | pF + nF pF + nF We compute α which minimizes this value. Namely, we choose α by which the information gain is maximized. After partitioning the original data based on α, BONSAI further proceeds in a recursive manner. BONSAI enhances the classification ability by introducing the mechanism called an alphabet indexing. This is one of the major characteristics of BONSAI. The alphabet indexing maps the original alphabet into a new one of (much) smaller size on which decision tree is constructed. BONSAI searches for the alphabet indexing which maximizes the classification ability based on local search. It was observed that the use of an appropriate alphabet indexing increases accuracy and simplifies hypothesis [14]. No one has ever used BONSAI system for other than genome analysis. However, there are many similarities between genome analysis and purchase pattern analysis in that both deal with string patterns. In view of this, we have developed E-BONSAI by adapting the original BONSAI to the analysis of customer purchase behavior.
212
2.2
N. Katoh, K. Yada, and Y. Hamuro
From “BONSAI” to “E-BONSAI”
We shall explain the characteristics of E-BONSAI as follows. 1) While original BONSAI generates a decision tree over regular patterns which are limited to substrings, we extend it to subsequences based on the work of Hirao et al. [5]. Here, a string v is called a substring of a string w if w = xvy for some strings x, y ∈ Σ ∗ (which denotes the set of all strings over the underlying alphabet Σ). A string v is called a subsequence of w if v can be obtained by removing zero or more characters from w. 2) We have to deal with various attributes simultaneously in constructing a prediction model to explain the causality among interrelated and complicated factors. Therefore BONSAI is extended so that it allows us to deal with not only a single sequential pattern but with more than one sequential pattern as well as numerical and categorical attributes which are conventionally used in decision trees such as C5.0 and CART. 3) It is usually believed that the most recently purchased brand is closely related to the next purchase behavior. Thus, we extend regular expression so as to take into account the position where a certain symbol is contained in the whole string. 4) We improve BONSAI so that we can deal with character string transformed from numeric time-series data. We shall explain in details how such transformation is made. For instance, let us consider the case where the transition of monthly sales volume of article A is transformed to a character string. For this, the range of monthly sales amount is discretized, i.e. the range is partitioned into a fixed number of subintervals (buckets). There are two methods to partition the range. The first one is to partition the range so that the length of each subinterval is equal. This is called equal-length partition method. The second is to partition the range so that the number of records classified into each subinterval is almost equal. This is called the equal-size partition method. In any of these two methods, each subinterval is associated with distinct character symbol, and the numerical time-series data is transformed into a character string obtained by concatenating the corresponding character symbols. Actually it depends upon purpose of the analysis or characteristic of data which method we should employ. In the case of demand forecast in instant noodle market we use equal-length partition method. When finding an optimal alphabet indexing for the character string obtained from numeric time-series data, in order to eliminate meaningless indexings, we limit alphabet indexing searching so that the subintervals encoded into the same alphabet constitutes contiguous regions in the range of the original numeric attribute. In summary, E-BONSAI has some advantages over existing typical data mining techniques; 1)E-BONSAI can deal with categorical sequential data. 2)Therefore extracted rules by E-BONSAI have high predict accuracy in practice in particular for marketing applications.
Business Application for Sales Transaction Data
213
3)Experts of marketing can interpret these rules and make use of it to implement strategy.
3
Framework of Our Experiment
In this section we shall explain how we apply E-BONSAI system to demand forecasts of new products in the instant noodle market. Data set has been obtained from purchase history accumulated by FSP system of seven retail stores in a supermarket chain in Japan from August 2000 to October 2001.
3.1
New Products and the Market of Instant Noodle
In recent years, many new products are introduced into instant noodle market of Japan. This market is an oligopolistic market in which six major manufacturers compete each other and dominate market share, nearly 95%. Each manufacturer maintains several brands that have several variation of taste such as soybean, soy sauce, salt and etc. In our experiment, we use 14945 customers who bought instant noodles in the target period. Through discussions with the persons in charge of product development of a few manufacturers and buyers of retail stores, we learned that they are not interested in precise forecasting of demands, but in the discovery of outliers as well as in understanding why they happened. From this survey, we have concluded that it is important to discover the factor that affects the sales demand, and we have determined our aim of the analysis as follows. Instead of predicting sales demand of new products, we try to predict whether a new product sells well for a long time period and will become a regularly selling product, or it will disappear within a month after release, hoping that this will help marketers take effective business. We have targeted our analysis to the new products released in the market from August 2000 till March 2001 which have been sold at seven stores of a supermarket chain.New products are clustered into four classes. : Class 1: those which disappear within one month, Class 2: those which disappear within two to four months, Class 3: those which disappear within five to eight months, and Class IV: those which survive for more than nine months. In order to find a rule that clearly distinguish products which survive for a long time period from those which disappear quickly from the market, we have adopted Classes I, II and IV as a target variable which we want to predict. We then construct the prediction model by using the purchase data for the first three weeks after the entry of new products to the market. We have adopted the products which released from August 2000 to March 2001 as a training data set, and to use for validating the forecasting model constructed based on the training data set, those which released from April to June 2001 as a test data set. Table 1 shows the characteristics of target variables.
214
N. Katoh, K. Yada, and Y. Hamuro Table 1. The characteristics of target variables
Life span in months # of training data # of test data
3.2
Class 1 Class 2 Class 3 0-1 2-4 9108 102 160 36 50 59
Explanatory Variables
Explanatory variables have been constructed using the sales transaction data that has occurred before the release date and for the first three weeks after the release date. They are categorized into the following three groups 1) Quantitative attributes concerning products Weekly sales volume, decrease of sales price for the first three weeks after the release date compared to regular price, and 0-1 variable indicating whether the product was sold at a discount price for the corresponding week after the release. 2) Qualitative attributes concerning products Name of the manufacturer and taste of the product such as soybean, pork bone, salt. 3) Attributes concerning customers The rate of repeat purchase (the ratio of customers who purchased the product also in the previous week among those who bought it in the corresponding week), and the ratio of heavy users of instant noodles among those who bought the product in the corresponding week. Here a customer is called a heavy user of instant noodle if his/her sales amount of instant noodles falls into uppermost one-third before the release date.
3.3
Transformation into Categorical Sequential Data
In this case we have transformed weekly sales volume, weekly ratio of repeat purchase and weekly ratio of heavy users of instant noodles into separate character strings. We shall explain this procedure in detail as follows. There are no criteria to decide how many classes we should employ to discretize numerical data. Therefore, as a preprocessing, we have performed the comparative experiments by trying two possible choices for the number of classes prepared for discretizing each of three numeric data. Thus, we prepared eight data sets as shown in Table 2. Next, in order to apply E-BONSAI, we further transform the sequence of classes of weekly sales volume into a single character string. Similarly, the sequence of classes corresponding to the ratio of repeat purchase and that of heavy users are transformed into other character strings.
Business Application for Sales Transaction Data
215
Table 2. Dataset name and the number of classes prepared for each numeric attribute Weekly sales volume Ratio of repeat purchase Ratio of heavy users Data set name 3 3 3 333 3 3 5 335 3 5 3 353 3 5 5 355 5 3 3 533 5 3 5 535 5 5 3 553 555 5 5 5
4 4.1
Experimental Results Prediction Accuracy and Over-Fitting of Existing Methods
In this section we have compared E-BONSAI with typical existing methods, C5.0 [12][13], neural network (in which the number of hidden layers is one and the standard back propagation algorithm is used) and CART. Figure 1 shows the prediction accuracy of three existing methods for eight categorized data sets explained in Section 3.3 using twelve attributes such as sales volume of first, second and third weeks, maximum discount ratio, first discount day, manufacturer, taste, repeat purchase rate in second and third weeks and ratio of heavy users in first, second and third weeks. As seen from Figure 1-a, the prediction accuracy obtained for the training data set does not depend much on the choice of data patterns prepared in Table 2 or on the algorithm used. As you notice by comparing Figures 1-a and 1-b, we see too much overfitting for any of the prepared datasets and algorithms since the prediction accuracy of test data is much lower than that of training data. However prediction accuracy of C5.0 for test data set 533, 535, 553 and 555 is higher than those of other data sets and of other methods.
Fig. 1. Prediction accuracy of existing methods.
216
4.2
N. Katoh, K. Yada, and Y. Hamuro
Prediction Accuracy of E-BONSAI
Figure 2 illustrates prediction accuracy of E-BONSAI. In this experiments we used eight attributes such as maximum discount ratio, first discount day, manufacturer, taste, and three sequences transformed from weekly sales volume in three weeks, repeat purchase ratio in two weeks and the ratio of heavy users in three weeks. Compared with Figure 1, we observe that prediction accuracy of E-BONSAI for training data is similar to those of existing three methods, 2-a while the prediction accuracy of E-BONSAI for test data (Figure 2-b) is much higher than those for the other three existing methods. This implies that E-BONSAI exhibits lower overfitting than the existing methods and constructs the prediction model with high and reliable confidence.
Fig. 2. Prediction accuracy of E-BONSAI.
4.3
Pruning
In order to see how we can improve the prediction accuracy of these methods by varying pruning parameters, we have performed comparison experiments using data set 333. For E-BONSAI and C5.0, we have tested the same set of pruning parameters by varying pruning confidence computed the binomial probability of misclassifications within the set of cases represented at that node, while for CART, we have tested the one by varying the minimum allowable number of cases falling into a node. Figure 3 illustrates the change of the prediction accuracy of E-BONSAI, C5.0 and CART for test dataset with respect to the change of pruning parameters. Here we used dataset 335 for this experiment. The y-axis indicates the overall accuracy and the x-axis indicates pruning parameters; # of min# and Bon-# are those used in C5.0 and E-BONSAI, respectively, where # denotes cut-off point for pruning. cart#-# used in CART indicates the minimum number of records that should be contained in each leaf while cart00 and cart05 require that maximum depth of a tree is 7, and that each leaf contains at least 5% of the whole records, respectively.
Business Application for Sales Transaction Data
217
As seen from the figure, the accuracy of C5.0 can be improved to the same level of that for E-BONSAI while that of CART cannot. In other words, EBONSAI can attain the high prediction accuracy without spending extra efforts to choose an appropriate pruning parameter.
Fig. 3. Change of prediction accuracy of each method by varying pruning parameters.
4.4
The Interpretability of the Extracted Rules by E-BONSAI
Finally experimental results and the extracted rules in the previous section were reviewed and interpreted by domain experts in manufacturers and retailers. The rules obtained by E-BONSAI are summarized as follows. Rule 1: If weekly sales volume of the new product is less than eight for at least one of the first three weeks from the release date, the product will disappear from instant noodle market within one month from the release date. However, there is one exception. Even if the above condition holds, the product survives after nine months from the release date, if the price is reduced by 10 percent for the second week and as a result the ratio of repeat purchase increases to more than 25 percent for the third week. Rule 2: If weekly sales volume of the new products is more than or equal to eight for at least two of the first three weeks from release date, the product will survive for more than nine months. However, if the ratio of heavy users is low less than 75%, the new product will disappear within four months. These rules have interpretable and give useful implication to experts in marketing of manufacturers and buyers in retail shops. In fact, the former rule is closely related to daily routine operation in the supermarket store. The rule says that a salesperson in a supermarket decides to remove the new products based on sales volume. The second rule raised the interest of domain experts. Manufacturers usually sell the new products at a discount price so as to prevent the removal of them at early stage due to the small amount of sales volume. The second rule means
218
N. Katoh, K. Yada, and Y. Hamuro
that even if there is large sales volume, the new product will die at an early stage if the ratio of heavy users is low. Therefore it is implied that typical sales promotion targeted for all customers commonly adopted in Japan such as discount sales promotion for new products, is not so effective to increase sales volume in future. In the case of instant noodle market, our results suggest that domain experts should create the marketing strategy focusing on heavy users.
5
Conclusion
In this paper we have demonstrated how E-BONSAI system is used to predict correctly whether the new product will survive or not in instant noodle market. The rules extracted by E-BONSAI are interpretable and exhibited practically useful implication for domain experts in marketing and retail stores. The results in this paper have shown that E-BONSAI systems have distinctive advantages over other typical methods when applied to business field. We are planning to apply E-BONSAI to the development of effective sales motion planning to other products.
References 1. Arikawa, S., S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi and T. Shinohara, A Machine Discovery from Amino Acid Sequences by Decision Trees over Regular Patterns, New Generation Computing, 11:361–375, 1993. 2. F. M. Bass, A New Product Growth for Consumer Durables, Management Science, 15:215–227, 1969. 3. Fourt, L. A. and J. W. Woodlock, Early Prediction of Market Success for New Grocery Products, Journal of Marketing, 25(2):30–38, 1960. 4. Hamuro, Y., H. Kawata, N., Katoh and K. Yada, A Machine Learning Algorithm for Analyzing String Patterns Helps to Discover Simple and Interpretable Business Rules from Purchase History, Progresses in Discovery Science, State-of-the-Art Surveys, LNCS:565–575, 2002. 5. M. Hirao, H. Hoshino, A. Shinohara, M. Takeda, and S. Arikawa A practical algorithm to find the best subsequences patterns, Theoretical Computer Science, Vol. 292, Issue 2, pp. 465-479, January 2003. 6. Kahn, E. b. and L. McLister, Grocery Revolution: The New Focus on the Consumer, Addison Wesley, 1997. 7. Kalish, S., A New Product Adoption Model with Pricing, Advertising and Uncertainty, Management Science, 31:1569–1585, 1985. 8. Kotler, P., Marketing Management, Prentice Hall, 2000. 9. Nakamura, H., Marketing of New Products, Chuokeizai-sha, 2001. 10. Nakanishi, M., Advertising and Promotion Effects on Consumer Response to New Products, Journal of Marketing Research, 10:242–249, 1973. 11. Parfitt, J. H. and J. K. Collins., Use of Consumer Panels for Brand Share Prediction, Journal of Marketing Research, 5:131–249, 1968. 12. Quinlan, J. R., Induction of Decision Trees, Machine Learning, 1:81–106, 1986. 13. Quinlan, J. R., C4.5: Programs for Machine Learning, Morgan Kaufman, 1993.
Business Application for Sales Transaction Data
219
14. Shimozono, S., A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara and S. Arikawa, Knowledge Acquisition from Amino Acid Sequences by Machine Learning System BONSAI, Trans. Information Processing Society of Japan, 35:2009–2018, 1994. 15. Toyota, H., Introduction to Data Mining, Kodansha, 2001. 16. Tsukimoto, H., Practical Data Mining, Ohmsha, 1999. 17. Yada, K., The Future Direction of Active Mining in the Business World, Frontiers in Artificial Intelligence and Applications, 79:239–245, 2002.
Improving Efficiency of Frequent Query Discovery by Eliminating Non-relevant Candidates J´erˆome Maloberti1,2 and Einoshin Suzuki2 1
Universit´e Paris-Sud, Laboratoire de Recherche en Informatique (LRI), Bˆ at 490, F-91405 Orsay Cedex, France [email protected] 2 Electrical and Computer Engineering, Yokohama National University, 79-5 Tokiwadai, Hodogaya, Yokohama 240-8501, Japan [email protected]
Abstract. This paper presents, for Frequent Query Discovery (FQD), an algorithm which employs a novel relation of equivalence in order to remove redundant queries in the output. An FQD algorithm returns a set of frequent queries from a data base of query transactions in Datalog formalism. A Datalog data base can represent complex structures, such as hyper graphs, and allows the use of background knowledge. Thus, it is useful in complex domains such as chemistry and bio-informatics. A conventional FQD algorithm, such as WARMR, checks the redundancy of the queries with a relation of equivalence based on the θ-subsumption, which results in discovering a large set of frequent queries. In this work, we reduce the set of frequent queries using another relation of equivalence based on relevance of a query with respect to a data base. The experiments with both real and artificial data sets show that our algorithm is faster than WARMR and the test of relevance can remove up to 92% of the frequent queries.
1
Introduction
The objective of mining frequent structures is to discover frequent substructures in a data base which consists of complex structures such as graphs. Since typical real-world data include various structures such as chemical compounds and web links, this research topic is gaining increasing attention in the data mining community. Each structure is represented by a set of nodes and a set of edges, and each atom and each bond is associated with a label. For example, a node and an edge represent an atom and a bond of a compound, respectively. A label is associated to a node and an edge in order to specify the symbol of an atom and the type of a bond, respectively. Recently, [6,7,13] have proposed fast algorithms to mine graph substructures in a data base. These algorithms share some limitations on the representation of the frequent patterns. The patterns must share the same associated labels on the edges and the nodes with their corresponding structures in the data base. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 220–232, 2003. c Springer-Verlag Berlin Heidelberg 2003
Improving Efficiency of Frequent Query Discovery
221
Therefore, a pattern such as “a carbon atom connected by a double bond to another carbon atom” can be found, but not a pattern such as “a carbon atom connected to any kind of atom”. Furthermore, only one label can be associated with a node or an edge. Consequently, in order to represent more than one label, for example, the symbol, the type and the charge of an atom must be concatenated in a label. Inductive Logic Programming (ILP) provides, with the Frequent Query Discovery (FQD), a more general mining framework that circumvents these limitations. The Datalog [12] formalism used in FQD can model the schema of any relational data base and can also add background knowledge in order to improve the understandability and the expressiveness of the patterns. Therefore FQD generalizes various discovery tasks of frequent structures such as sequence mining, tree mining or graph mining [4,6], and is expected to be particularly useful in complex domains such as chemistry, bio-informatics, materials science, etc. The main drawback of the FQD lies in its high computational complexity due to the large number of frequent queries and to the NP-completeness of the test of generality, the θ-subsumption, in ILP. This test is heavily used to compute the frequency of a pattern in a data base, and to eliminate redundant patterns. WARMR [5], which represents a recently proposed FQD algorithm, can limit the size of the hypothesis space by adding constraints to the pattern language. For example, these constraints can restrict the discovered patterns to trees. A highly restricted pattern language is likely to lead to an incomplete search and overlooking of interesting patterns. In this paper, we present Jimi, a new algorithm of FQD which improves WARMR in two ways. First, Jimi employs a fast θ-subsumption algorithm based on constraint satisfaction techniques [8]. Second, a new test of equivalence between candidates allows elimination of irrelevant candidates which can be deduced from others. Though Jimi never deletes a pattern which cannot be deduced from others, it has been experimentally confirmed that this test significantly reduces the number of frequent patterns. The rest of this paper is organized as follows. In section 2, we define the task of frequent query discovery. Section 3 introduces the theoretical foundation of relevant frequent query discovery and our algorithm. The system is compared with WARMR in section 4, and evaluated experimentally in section 5. The last section summarizes our work and discusses further improvements.
2
Frequent Query Discovery
We use the Datalog formalism to represent data and concepts. Datalog is a restriction of the first-order logic which does not allow functions in clauses. In Datalog, a term is a constant or a variable. In order to distinguish non-numeric constants from variables, the first letter of a variable is written in uppercase. Let p be an n-ary predicate symbol and each of t1 , t2 , · · · , tn be a term, then an atom is represented by p(t1 , t2 , · · · , tn ). We say that the position of ti is i. Let B and Ai be atoms then a formula in the form B ← {A1 , A2 , · · · , An } is called
222
J. Maloberti and E. Suzuki
a clause. In this paper, we assume that a query is represented by a clause, and a query transaction represents a clause which contains no variables. For example, molecule(m1) ← atom element(m1, a1, na), atom type(m1, a1, 81) represents a query transaction. Definition 1. Query Data base Let Q be a set of query transactions {Q1 , Q2 , · · · , Qn } and K be a set of clauses {Cl1 , Cl2 , · · · , Cln } then a Datalog data base D is represented by the pair (Q, K), where Q and K represent a relational data base and background knowledge, respectively. Since the generality relation (|=) in first-order logic is undecidable, we use the θ-subsumption ()[10] as its restricted form. Definition 2. Substitution Let {V1 , V2 , · · · , Vn } be a set of variables and {t1 , t2 , · · · , tn } be a set of terms, then a substitution θ represents a set of bindings {V1 /t1 , V2 /t2 , · · · , Vn /tn }. Definition 3. θ-subsumption Let C and D be clauses, then C θ-subsumes D, denoted by C D, if and only if (iff ) there exists a substitution θ, such that Cθ ⊆ D. For example, a clause C1 : H(X) ← p(X, Y ), p(X, b) θ-subsumes another clause C2 : H(a) ← p(a, b) with a substitution θ = {X/a, Y /b}. Intuitively, “C1 θsubsumes C2 ” represents that C1 is no more specific than C2 . Definition 4. Frequency of a Query Let D be a Datalog data base, H be a query, G be a query transaction such that G ∈ D, and occur be a function such that occur(H, G) = 1 if H G, 0 otherwise. Then, the frequency of H with respect to (w.r.t.) D is defined by a function F req(H, D) = G∈D occur(H, G). In other words, the frequency of a query is the number of query transactions in D that are subsumed by this query. [5] has stated that if a clause C1 is θ-subsumed by a clause C2 , i.e. C2 C1 , then its frequency is no greater than the frequency of C2 . Therefore the relation is monotone w.r.t. the frequency. Definition 5. Logical Equivalence Let C1 and C2 be clauses, then C1 is logically equivalent to C2 , denoted by C1 ∼ C2 , iff C1 C2 and C2 C1 . For example, a clause C3 : H(X) ← p(X, Y ), p(X, Z) is logically equivalent to another clause C4 : H(U ) ← p(U, V ) since C3 C4 with θ = {X/U, Y /V, Z/V }, and C4 C3 with θ = {U/X, V /Y }. The advantage of this representation is described in [10]. The negation of ∼ will be denoted by ∼. Let D be a Datalog data base, C be a set of queries, and t be a frequency threshold specified by the user. Given D and t, the result of FQD is the set C of = j, Ci ∼ Cj . queries such that ∀Ci ∈ C, F req(Ci , D) ≥ t and ∀Cj ∈ C with i A hyper graph, which represents a graph where an edge can connect more than two nodes, can be modeled by a set of queries. Therefore, FQD generalizes the graph mining. Hyper graphs are used in various applications including circuit design, chemical reaction design, and image processing.
Improving Efficiency of Frequent Query Discovery
3 3.1
223
Jimi for Relevant Frequent Query Discovery Strict Frequent Query Discovery
As mentioned in section 1, the only genuine method for reducing the hypothesis space available in WARMR is the restriction of the pattern language. For example, the user can restrict the generated queries to sequences or trees instead of structures with cycles [4]. Obviously, using a highly restricted language can lead to overlooking of interesting patterns. However, if we do not use such a restriction, the set of frequent queries will be prohibitively large. This problem is a consequence of the relation of logical equivalence, defined in section 2, which is used to eliminate redundant candidates. Let H be a query and G be a query transaction such that H G with a substitution θ, then an instantiation I of H w.r.t. θ is I = Hθ. Now we show that two non-equivalent queries can have the same set of instantiations in a data base even if they are not logically equivalent. Both queries H1 : H(X) ← p(X, Y ), p(Y, Z) and H2 : H(U ) ← p(U, V ), p(V, U ) are not equivalent, since H2 H1 though H1 H2 with θ = {X/U, Y /V, Z/U }. However, if we consider a data base D with a query transaction Q: H(a) ← p(a, b), p(b, a), q(b, c), both H1 and H2 θ-subsume Q, with their respective substitutions θ1 = {X/a, Y /b, Z/a} and θ2 = {U/a, V /b}, and have the same instantiation I: H(a) ← p(a, b), p(b, a). Since we can generalize a query by changing the name of a shared variable, i.e. a variable which appears in more than one literal, H1 can be deduced from H2 thus H1 ought to be removed from the set of frequent queries. In order to realize this, we define the strictness property of a set of queries. Let D be a Datalog data base and C be a set of queries, then C is strict iff ∀Hi , Hj ∈ C such that i = j, their respective sets of instantiations Ii and Ij w.r.t D satisfy = Ij . We define Strict Frequent Query Discovery as a task of finding a strict Ii set of frequent queries in a data base. 3.2
Relevant Frequent Query Discovery
The number of instantiations for queries in a data base is usually huge, thus the storage and the effort needed to check the strictness of a set of queries are prohibitively expensive. Therefore we restrict ourselves to check this property on a pair of literals, and we propose an efficient algorithm. We define a pairwise occurrence of two literals as follows: Let C be a query, l and m be literals such that l, m ∈ C, and pl and pm be the predicate symbols of l and m respectively. Let tk be a term which occurs both in l and m, i.e. a shared term, π(k,l) , π(k,m) be the positions of tk in l and m, and P be the set of pairs of all the shared terms in C, then a pairwise occurrence is defined by (pl , pm , P = { π(1,l) , π(1,m) , π(2,l) , π(2,m) , · · · , π(n,l) , π(n,m) }). Intuitively, a pairwise occurrence represents a relation between two literals in a same query from the viewpoint of shared terms. For example, the pairwise occurrence of the literals p(U, V ), p(V, U ) of H2 : H(U ) ← p(U, V ), p(V, U ) is
224
J. Maloberti and E. Suzuki
(p, p, P = {(1, 2), (2, 1)}). Let po1 = (pl , pm , P ) and po2 = (pl , pm , P ) be pairwise occurrences, then po1 is more specific (respectively more general) than po2 , denoted by po1 ⊃ po2 (respectively po1 ⊂ po2 ), if pl = pl , pm = pm and P ⊃ P (respectively P ⊂ P ). Thus, the pairwise occurrence of two literals p(X, Y ), p(Y, Z) of H1 : H(X) ← p(X, Y ), p(Y, Z) is (p, p, P = {(1, 2)}) and is more general than the pairwise occurrence of H2 . Let PO be a set of pairwise occurrences of query transactions in a data base D, po be a pairwise occurrence, and S be a set of pairwise occurrences such that ∀ po ∈ PO, if po ⊃ po then po ∈ S, then IsRelevant is a function such that IsRelevant(po,PO) returns true if po ∈ PO or if |S| > 1, and false otherwise. We define the relevance property for a query as follows: let Q be a query, then Q is relevant iff ∀ po of Q, IsRelevant(po,PO) returns true. Obviously, a pair of literals of a query can only be instantiated in a pairwise occurrence that corresponds to at least the same shared terms. Thus, if a pairwise occurrence po of a query is not in PO, po can be instantiated in a set S of pairwise occurrences of PO which are more specific than po. If S is empty, po cannot be instantiated. But, if |S| = 1, po will have the same set of instantiations than the pairwise occurrence in S, and the query corresponding to po does not satisfy the strictness property. Consequently, a query which contains po is not relevant. If |S| > 1, po can be instantiated in more than one pairwise occurrences, then its set of instantiations cannot be equivalent to another one and po is relevant. We define Relevant Frequent Query Discovery as the task of finding the set of relevant frequent queries in a data base. 3.3
Algorithm of Jimi
Jimi is a relevant frequent query discovery algorithm similar to WARMR in the sense that both algorithms use a top-down breadth-first search like Apriori [1]. Since we deal with Datalog representation, the search space is a lattice of which node and edge represent a query an the θ-subsumption () relation, respectively. The relation ∼ is used to test the equivalence of the queries and to eliminate redundancy. Below we show the procedure Jimi which performs a breadth-first search downward the lattice and calls the procedure GenerateCandidates. The procedure GenerateCandidates returns relevant candidates of the current level CandLevel using frequent queries of its previous level CandLevel−1 , the frequent candidates of size 1 Cand1 , the infrequent queries I, the frequent queries F, and the set PO of pairwise occurrences in D. Then, the procedure EvaluateCandidates returns the frequent candidates in CandLevel , according to D and minf req, and updates F and I. Each of F, CandLevel−1 and CandLevel is represented as a hash table in order to speedup the search for an existing equivalent query. Since two equivalent queries have the same set of predicate symbols, the hash code is computed using a function for a string which concatenates all predicate symbols in the query1 . Since θ-subsumption tests for inclusion of a query Q in another query G, the 1
The predicate symbols are ordered lexicographically.
Improving Efficiency of Frequent Query Discovery
225
set of predicates of Q is a subset of the set of predicates of G. Thus there is no means to compute a equivalent hash code for Q and G, and I is represented as a list. In our experiments, F, CandLevel−1 and CandLevel typically contained hundreds of thousands queries while I typically contained about ten thousands queries. The huge size of F, CandLevel−1 and CandLevel is due to a large number of combinations when a literal is added to a frequent query. Moreover, the much smaller size of I is a consequence of the θ-subsumption test which allows a query to θ-subsume another query with a smaller number of variables, as shown in section 3.1. Therefore, we believe that our choice of the data structures for I, F, CandLevel−1 and CandLevel is appropriate. Jimi(D, minf req) Input: Data base D, threshold minf req 1. Initialize(Cand1 ) // Cand1 : Set of frequent candidates of size 1 2. Initialize(PO) // PO: Set of pairwise occurrences 3. AnalyzeDB(D, minf req, Cand1 , PO) 4. Initialize(F) // F: Hash table of frequent candidates 5. Initialize(I) // I: List of infrequent candidates 6. Level ← 2 7. CandLevel−1 ← Cand1 8. do CandLevel ← GenerateCandidates(CandLevel−1 , Cand1 , F, I, PO) 9. 10. CandLevel ← EvaluateCandidates(D, minf req, CandLevel , F, I) Level ← Level+1 11. 12. while CandLevel−1 =∅ 13. return F AnalyzeDB performs a scan of the data base D to compute the set Cand1 of frequent queries of size 1 and the set PO of all pairwise occurrences in D. It uses a procedure ComputePairwiseOccurrences which returns a set of all pairwise occurrences in a given query Q, and adds them to PO. Additionally, if a same pairwise occurrence always occurs for the literals with the same predicate symbols, for example if there is a pairwise occurrence (p, q, P = {(1, 2), (2, 1)}) for all literals with the symbol p and q in D, this pairwise occurrence should also occur in every candidate that contains a p or a q. Since such a pairwise occurrence is implied by any of its two predicate symbols, it is added to the background knowledge. Furthermore, if all terms in such a pairwise occurrence are shared, then the pairwise occurrence does not introduce a new variable to a query. Obviously, such a pairwise occurrence does not need to appear in the candidate and it is removed from PO. Every pairwise occurrence is tested in this way with a procedure IsBackgroundKnowledge at lines 12–14. AnalyzeDB(D, minf req, Cand1 , PO) Input: Data base D, threshold minf req, Sets Cand1 and PO 1. ∀ Q ∈ D
226
J. Maloberti and E. Suzuki
2. ∀ pred ∈ Predicates(Q) if pred ∈ Cand1 3. 4. pred.f req ← 1 //f req: frequency of a predicate in D Cand1 ← Cand1 ∪ {pred} 5. 6. else pred.f req ← pred.f req + 1 7. 8. PO ← PO ∪ ComputePairwiseOccurrences(Q) 9. ∀ pred ∈ Cand1 10. if pred.f req < minf req Cand1 ← Cand1 \{pred} 11. 12. ∀ po ∈ PO 13. if IsBackgroundKnowledge(po) PO ← PO \ {po} 14. The function GenerateCandidates computes all candidates of size k from the frequent queries of size k − 1, CandLevel−1 , and the frequent candidate of size 1, Cand1 . Each candidate generated must be connected, i.e. it does not contain a set of literals which share no variables with others literals of the candidate. Since a disconnected candidate C can be regarded as a conjunction of two smaller queries Q1 and Q2 , the frequency of C can be computed from query transactions which are θ-subsumed by Q1 and Q2 . Candidates are generated by calling the function AddLiteral which adds a literal pred to a frequent query Q. The new literal must satisfy the following conditions: – the new literal must share at least one variable with another literal in order to generate a connected candidate. – a variable cannot appear more than once in the same literal. This restriction forbids constructions such as p(X, X) that are rarely relevant. – the variables introduced by a new literal must be ordered. If the new literal introduces n(> 1) new variables to a candidate, there are n! combinations of these variables in the new literal. For example, if q(X1 , X2 , X3 ) is added to a candidate C : p(X1 ), then q(X1 , X3 , X2 ) corresponds to the same literal with another combination of the two new variables X2 and X3 . Obviously, q(X1 , X3 , X2 ) should not be added. Then, the procedure ComputeNewPairwiseOccurrences returns all pairwise occurrences between the new literal and the literals of Q, and a candidate is eliminated if one of pairwise occurrences is not relevant according to the function IsRelevant defined in section 3.2. A candidate is also removed if it is equivalent, according to the ∼ relation, to an existing frequent query, or if an infrequent query θ-subsumes it. GenerateCandidates(CandLevel−1 , Cand1 , F, I, PO) Input: Hash tables of CandLevel−1 and F, Sets Cand1 , I and PO 1. Initialize(Cands) // Cands: Hash table of candidates 2. ∀ Q ∈ CandLevel−1
Improving Efficiency of Frequent Query Discovery
227
3. ∀ pred ∈ Cand1 4. cands← AddLiteral(Q, pred) 5. ∀ cand ∈ cands P OQ ← ComputeNewPairwiseOccurrences(cand) 6. 7. ∀ po ∈ P OQ 8. if Relevant(po, PO) 9. if cand ∈ Cands And cand ∈ F And I cand Cands ← Cands ∪ {cand} 10. 11. return Cands EvaluateCandidates computes the frequency of the candidates in D using an outer loop. Each query transaction is tested against all candidates. This method can deal with a very large data base as long as the candidates fit in memory. EvaluateCandidates(D, minf req, CandLevel , F, I) Input: Data base D, threshold minf req, Hash tables of CandLevel and F, Set I 1. ∀ Q ∈ D ∀ cand ∈ CandLevel 2. 3. if cand Q cand.f req ← cand.f req + 1 4. 5. ∀ cand ∈ CandLevel 6. if cand.f req ≥ minf req 7. F ← F ∪ {cand} else 8. 9. I ← I ∪ {cand} CandLevel ← CandLevel \ {cand} 10.
4
Related Work
There are two algorithms of frequent query discovery: WARMR [4] and FARMER [9]. FARMER uses a restricted version of the θ-subsumption test to compute the frequency and does not eliminate redundant candidates, thus the results are different from those of WARMR. There are several differences between WARMR [5] and Jimi. The main difference is that Jimi employs the relevance test which significantly reduces the set of frequent queries. This test allows the elimination of candidates which can be deduced from relevant ones. WARMR needs rules to generate the candidates from frequent queries. These rules are a flexible way to specify how one or more predicates must be added to a query. The basic form of a rule is: p(+X, −Y ), which means that a predicate p must be added with an existing variable (+X) as a first argument, and a new variable (−Y ) as a second argument. Then, by applying this rule to the query C : H(V1 ) ← a(V1 ), b(V2 ), we obtain two queries C5 : H(V1 ) ← a(V1 ), b(V2 ), p(V1 , V3 ) and C6 : H(V1 ) ← a(V1 ), b(V2 ), p(V2 , V3 ). On the other hand, the user must specify these rules for each problem, and a highly restricted set of rules can lead to overlooking of interesting patterns.
228
J. Maloberti and E. Suzuki
Jimi generates all connected candidates by adding one literal at a time under the constraints of the procedure GenerateCandidates in the previous section. Since we allow no other way to control the generation of candidates, the data representation must be chosen carefully. However, this method is complete and all candidates have the same size at each level. Thus, the test of equivalence of the hash table of candidates is optimized for these conditions. Each generated candidate is tested against the set of infrequent queries and the set of frequent queries. WARMR uses lists to store theses sets, thus we believe that it should perform a linear number of θ-subsumption tests in order to eliminate each redundant candidate. Jimi stores all frequent queries in a hash table and the infrequent ones in a list. The θ-subsumption algorithm in WARMR is implemented in PROLOG, which is much slower than Django [8]. This drawback has been recently reduced with query packs [2] which evaluate a set of queries against a query transaction efficiently. The query packs drastically improve performance of WARMR during the data base scan, but are not used during the elimination of redundant candidates [4,2]. It should be noted that, as we will see in the next section, Jimi is still faster than WARMR in all experiments though we do not use query packs.
5
Experiments
The performance of Jimi has been compared with WARMR with an application in chemical carcinogenic analysis which is a challenge proposed by [11], and a task of graph mining [6] with an artificial data set. In both experiments, two versions of Jimi are tested. Jimi represents a version with the elimination of the non-relevant candidates, while Jimi-EX represents a version without it. Similarly, WARMR is tested with the query packs (WARMR) and without them (WARMR-PL). All experiments were performed on a PC with a 3GHz CPU and 1.5GB memory. 5.1
Predictive Toxicology Evaluation Challenge
This data set contains 340 chemical compounds with an average size of 27.0 atoms and 27.4 bonds [11]. Each compound is described by a set of atoms and the bond connectivity of these atoms. Each atom is associated with an element name (c, h, o, cl, etc.) and an integer which represents an element type. Each bond is associated with a type which is also an integer (1 for a single bond, 2 for a double bond, etc.). We use for WARMR a representation similar to the one defined for this problem in [4]. For example, the compound number 158 is represented by a query transaction: molecule(d158) ← atomel(d158, d158 1, na), atomty(d158, d158 1, 81), atomel(d158, d158 2, f ), atomty(d158, d158 2, 92), bondtyp(d158, b d158 0, 1), bond(d158, d158 1, d158 2, b d158 0), bond(d158, d158 2, d158 1, b d158 0). This means that the compound contains two atoms: d158 1 represents an atom of element na and of type 81, and d158 2 represents an atom of element f and
Improving Efficiency of Frequent Query Discovery
229
of type 92. The predicate bond is used to create a directed link between both atoms with an associated type of bond b d158 0. Since the bonds are not oriented in a compound, each bond needs two literals bond to create an undirected link. The type of bond is valued to 1 with the predicate bondtyp, which means that this is a single bond. The rules of generation described in [4] have been adapted to this representation and to the method of generation of candidates of Jimi. Since WARMR can constraint the generated candidates, we forbid a candidate that contains pairs of literals such as bond(A, B, C, D), bond(A, B, C, E) or bond(A, B, C, D), bond(A, C, B, ). The pair bond(A, B, C, D), bond(A, B, C, E) is irrelevant because D and E will always match the same constants in all the query transactions. Since there is a pair of literals bond with the second and the third arguments permuted in our representation, we only need one of both bond in a candidate. Consequently, a candidate with a pair of literals bond(A, B, C, D), bond(A, C, B, ) must be eliminated. The set of frequent candidates of Jimi-EX contains candidates that do not satisfy these constraints because there is no means to specify these constraints. However the elimination of non-relevant candidates in the version Jimi implies these constraints. Tables 1 shows the results of this test with supports of 10% Table 1. Comparison of WARMR and Jimi on Carcinogenesis data set. Time is in seconds and FC represents frequent candidates. WARMR-PL WARMR Jimi-EX Jimi Support Level # of FC Time # of FC Time # of FC Time # of FC Time 2 49 7 49 6 67 1 63 4 3 362 26 362 10 414 3 285 5 4 2167 178 2167 61 2854 17 1106 10 10% 5 17298 5888 17298 1403 30377 293 4307 31 6 - 443972 25021 17615 155 7 - 75528 934 8 - 338451 7197 2 33 7 33 7 44 1 40 4 3 217 13 217 8 253 3 164 5 4 1248 64 1248 26 1698 13 581 8 20% 5 9932 1758 9932 483 18167 212 2118 22 6 - 103939 34512 268140 15143 8308 82 7 - 34416 411 8 - 149356 2891
and 20%, where a minus represents that execution time exceeded 24 hours. Both versions of Jimi outperform WARMR and the gain increases with the number of frequent candidates. Jimi-EX is faster than WARMR because the latter uses query packs not in the elimination of the redundant candidates but only during the data base scans, while Django is used for all θ-subsumption tests. Consequently, when the number of candidates is large, the proportion of the execution
230
J. Maloberti and E. Suzuki
time for the data base scans decreases, and the gain due to the query packs also decreases. Additionally, the property of relevance reduces the number of frequent queries and the execution time on this test in an order of magnitude. 5.2
Artificial Data Sets
In order to evaluate the scalability of Jimi, we used data sets similar to those defined in [6]. These data sets consist of undirected graphs generated using the following parameters with the corresponding values. – – – – – – – –
|D|, the number of graph transactions in the data set: 100, 1000 and 10000 |EG |, the number of edges of the graphs in D: 25 |NG |, the number of nodes of the graphs in D: 15 |S|, the number of potentially frequent patterns: 10 |EP |, the number of edges of the patterns: 7 |NP |, the number of nodes of the patterns: 5 |LE |, the number of possible edge labels: 5 |LN |, the number of possible node labels: 5
Each graph transaction in D and each pattern in S represents a connected graph. The graphs and the patterns have all the same size and each graph contains exactly one pattern. The representation of these graphs as query transactions is similar to that in the previous experiments. An edge is represented by a pair of literals edge and a label is associated to each edge with a literal edgetyp and to each node with a literal nodelab. Table 2 shows the results with data bases of 100, 1000 and 10000 transactions respectively. These results show that WARMR has better scalability than Jimi w.r.t. the size of the data base. In these tests, the data bases are not large enough, and the execution time due to the elimination of the redundant candidates is still longer than the execution time of Jimi. However, while the improvement of Jimi on WARMR in the number frequent candidates is always around 84%, the execution time percentage compared with WARMR increases of 0.1% for 100 transactions, 1.3% for 1000 and to 11.3% for 10000. Furthermore, WARMR always discovers around 81000 candidates for level = 6, but the execution time increases by 1% when the size of the data base increases from 100 to 1000, and 19% when the size of the data base increases from 1000 to 10000. The better scalability of WARMR is due to the use of query packs that significantly improves the θ-subsumption test against a data base. Jimi is still faster than WARMR in these tests, and WARMR can only outperform it on problems with a large data base or at low levels.
6
Conclusion
We have presented an algorithm which addresses two problems of frequent query discovery, the NP-completeness of a θ-subsumption test and the large size of the set of the frequent queries. The first problem has been remedied by integrating
Improving Efficiency of Frequent Query Discovery
231
Django, a fast θ-subsumption algorithm based on constraint satisfaction techniques [8], and the second problem has been remedied by defining the strictness property on a set of queries using a novel relation of equivalence. Using this property can reduce the set of the frequent queries with no restriction on the pattern language by removing only irrelevant patterns, but this relation cannot be checked efficiently. In order to circumvent this problem, we have defined a weaker but sound property, the relevance. The experiments show that reduction of number of frequent patterns is significant especially for a high level where it can remove up to 92% of the frequent queries. Table 2. Comparison of WARMR and Jimi on graph data sets of size 100, 1000 and 10000 with a support of 10%. Time is in seconds and FC represents frequent candidates. WARMR-PL WARMR Jimi-EX Jimi |D| Level # of FC Time # of FC Time # of FC Time # of FC Time 2 17 1 17 1 19 0.1 16 0.2 3 127 2 127 2 149 0.4 101 1 4 782 8 782 4 1083 3 419 2 100 5 7087 531 7087 157 12511 76 2350 6 6 81433 82430 81433 25318 193236 13067 13086 42 7 - 78684 430 8 - 505595 8724 2 17 15 17 15 19 1 16 3 3 128 16 128 16 150 4 102 4 4 784 27 784 19 1087 26 419 11 1000 5 7099 624 7099 180 12533 382 2341 54 6 81490 84228 81490 25580 193496 18452 12963 354 7 - 77925 2587 8 - 497676 24479 2 17 132 17 131 19 7 16 27 3 127 143 127 135 149 37 101 37 4 780 207 780 154 1081 256 417 104 10000 5 7057 1667 7057 529 12478 3432 2324 538 6 81059 101140 81059 30445 192663 71561 12902 3441 7 - 77589 24169
Future work includes integration of query packs [2] and extension of the relevance property. According to our experience on θ-subsumption [8], we believe that implementing query packs using constraint satisfaction techniques will significantly improve its time-efficiency. These improvements will allow us to employ our method to real-world applications in order to obtain significant results.
Acknowledgments. This work was partially supported by the grant-in-aid for scientific research on priority area “Active Mining” from the Japanese Ministry
232
J. Maloberti and E. Suzuki
of Education, Culture, Sports, Science and Technology. J´erˆome Maloberti is supported by a Lavoisier grant of the French Foreign Ministry.
References 1. R. Agrawal et al. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307–328. AAAI/MIT Press, Menlo Park, Calif., 1996. 2. H. Blockeel et al. Executing query packs in ILP. In Proc. Tenth International Conference on Inductive Logic Programming, LNAI 1866, pages 60–77. SpringerVerlag, Berlin, 2000. 3. R. Dechter. Constraint Networks. In Encyclopedia of Artificial Intelligence, volume 1. John Wiley & Sons, New York, 1992. 4. L. Dehaspe. Frequent pattern discovery in first-order logic. PhD thesis, K. U. Leuven, Dept. of Computer Science, 1998. 5. L. Dehaspe and L. De Raedt. Mining association rules in multiple relations. In Proc. seventh International Workshop on Inductive Logic Programming, LNCS 1297, pages 125–132. Springer-Verlag, Berlin, 1997. 6. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Principles of Data Mining and Knowledge Discovery, LNCS 1910, pages 13–23. Springer-Verlag, Berlin, 2000. 7. M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proc. International Conference on Data Mining, pages 313–320. IEEE Computer Society, 2001. 8. J. Maloberti and M. Sebag. Theta-subsumption in a constraint satisfaction perspective. In Proc. 11th International Conference on Inductive Logic Programming, LNCS 2157, pages 164–178. Springer-Verlag, Berlin, 2001. 9. S. Nijssen and J. N. Kok. Faster association rules for multiple relations. In Proc. of the Seventeenth International Joint Conference on Artificial Intelligence, volume 2, pages 891–896. Morgan Kaufmann, San Francisco, 2001. 10. G. D. Plotkin. A note on inductive generalization. In Machine Intelligence, volume 5, pages 153–163. Edinburgh University Press, Edinburgh, 1970. 11. A. Srinivasan et al. The predictive toxicology evaluation challenge. In Proc. Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pages 1–6. Morgan-Kaufmann, San Francisco, 1997. 12. J. D. Ullman. Principles of Database and Knowledge-Base Systems, volume I. Computer Science Press, Rockville, Maryland, 1988. 13. X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. Technical Report UIUCDCS-R-2002-2296, Department of Computer Science, University of Illinois at Urbana-Champaign, 2002.
Chaining Patterns Taneli Mielik¨ ainen HIIT Basic Research Unit Department of Computer Science University of Helsinki, Finland [email protected]
Abstract. Finding condensed representations for pattern collections has been an active research topic in data mining recently and several representations have been proposed. In this paper we introduce chain partitions of partially ordered pattern collections as high-level condensed representations that can be applied to a wide variety of pattern collections including most known condensed representations and databases. We analyze the goodness of the approach, study the computational challenges and algorithms for finding the optimal chain partitions, and show empirically that this approach can simplify the pattern collections significantly.
1
Introduction
The goal of pattern discovery is to find interesting patterns from data sets [1, 2,3]. There exist output-efficient algorithms for finding the interesting patterns from a wide variety of different pattern classes [4,5,6]. The most prominent examples of interesting patterns are frequent sets and association rules [7]. For the frequent sets and the association rules a data set is a finite sequence d = d1 . . . dn of subsets of some finite set R. A set X ⊆ R is interesting if it is σ-frequent in d, i.e., f r(X, d) =
|{i : X ⊆ di , 1 ≤ i ≤ n}| ≥ σ ∈ [0, 1]. n
An association rule X ⇒ Y is interesting if it is both σ-frequent and δ-accurate in d, i.e., f r(X ∪ Y, d) ≥ σ and acc(X ⇒ Y, d) =
f r(X ∪ Y, d) ≥ δ ∈ [0, 1]. f r(X, d)
(If f r(X, d) = 0 then acc(X ⇒ Y, d) is not defined.) Traditionally the interestingness of a pattern has been a local property of the pattern. For example, the interestingness of an association rule X ⇒ Y depends on the frequencies of the sets X and X ∪ Y . However, the collection of the interesting patterns can have also more global interesting structure. Some of the structural properties of pattern collections have been exploited in condensed representations of pattern collections, i.e., pattern collections that G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 233–244, 2003. c Springer-Verlag Berlin Heidelberg 2003
234
T. Mielik¨ ainen
are irredundant w.r.t. some inference method. The condensed representations of pattern collections have been studied extensively and several condensed representations have been suggested [4,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. They can be used, for example, to compress pattern collections, for efficient querying, and to gain insight to the pattern collection (and to the data set). In this paper we propose a new condensed representation of pattern collections, pattern chains, that depends only on a partial order of patterns and thus can be applied to condense a wide variety of pattern collections including many other condensed representations. The rest of the paper is organized as follows. The new condensed representation is described in Section 2. Algorithmic issues of the representation are discussed in Section 3. In Section 4 we show experimentally that the approach can applied to condense pattern collections in practice. The work is concluded in Section 5.
2
Exploiting the Structure
A collection of interesting patterns and the whole pattern classes, too, usually have some structure. For example, the collection of subsets of R can be structured by the frequency of the subsets in a given data set: every subset of a frequent set is frequent and every superset of an infrequent set is infrequent. Pattern collections can have also some data-independent structure. A typical data-independent structure of a pattern collection is a partial order. The definitions we use related to partial orders are the following: – A partial order ≺ for a finite set P is a transitive (p ≺ q ∧ q ≺ r ⇒ p ≺ r) and irreflexive (p ≺ q ⇒ p ≺ q) binary relation ≺⊆ P × P. For example, any collection of sets is partially ordered by the set inclusion ⊂. – Elements p and q of a partially ordered set P are called comparable iff p ≺ q, q ≺ p or p = q. – A total order < for a finite set P is a partial order such that all pairs of elements in P are comparable. Frequencies of subsets of R determine a total order for the subsets. – An element p ∈ P is maximal (minimal ) in P if for no element q ∈ P holds: p ≺ q (q ≺ p). An example of maximal patterns are maximal frequent sets, i.e., the frequent sets that have no frequent supersets. There is only one minimal pattern in the collection of frequent sets, the empty set, because all the other sets contain the empty set and all subsets of frequent sets are frequent. – A subset C (a subset A) of a partially ordered set P is called a chain (an antichain) iff any two elements in C (no two distinct elements in A) are comparable. – A chain partition (an antichain partition) of a partially ordered set P is partition of the set P to disjoint chains C1 , . . . , Cm (antichains A1 , . . . , Am ). A chain partition is minimum iff there is no chain partition of smaller cardinality, and minimal iff there are no two chains in the chain partition such that their union is a chain.
Chaining Patterns
235
Well-studied examples of partially ordered pattern collections, besides of frequent and maximal patterns, are the collections of closed patterns [8,20,23,24, 25,26]. Let ≺ be a partial order for the pattern collection P and let < be a total order determined by the frequencies f r(p, d) of patterns p ∈ P w.r.t. a data set d. A pattern p ∈ P is closed iff p ≺ q, q ∈ P ⇒ f r(p, d) > f r(q, d). The collection of closed patterns in P is denoted by Cl (P). The closed patterns exploit the data set d more extensively than, e.g., maximal or frequent patterns: the maximal patterns depend only on the pattern collection (which, of course, can depend on the data set) and the frequent patterns depend on the data set only by the maximal frequent patterns. Usually each closed pattern p ∈ Cl (P) determines an equivalence class that contains p and all its subpatterns in P with the frequencies equal to the frequency of p, i.e., all patterns q ∈ P such that q ≺ p and f r(p, d) = f r(q, d). Unfortunately this does not hold in general since there are pattern collections P (and data sets d) which contains closed patterns p, q ∈ Cl (P) such that {r ∈ P : r ≺ p, f r(r, d) = f r(p, d)} = {r ∈ P : r ≺ q, f r(r, d) = f r(q, d)} but p = q. This can be the case, e.g., when the pattern collection consists of approximate patterns. Besides of detecting structure in a pattern collection, the found structure can sometimes be further exploited. For example, frequent sets can be stored into an itemset tree by defining a total order for R: each frequent set is a path from root to some node. (Itemset trees are known also by several other names, see e.g. [7,26,27,28].) The itemset tree can save space and allow efficient frequency queries. Unfortunately, itemset trees require an explicit order for the elements of R. The order might be an artificial structure that hides the “true” structure of the pattern collection. The idea of exploiting the structure of the pattern collection, and especially simplifying the partial order structure of the pattern collection, might still be useful although it is not clear whether e.g. the itemset tree makes partial order of the set collection more comprehensible or even more obscure from the human point of view. To exploit the partial order structure of a pattern collection P, we propose the minimum chain partition C1 , . . . , Cm of the partially ordered set P as a condensed representation for P. There exists a chain partition for any partially ordered set P. Thus nothing else have to be assumed about the structure of P in order to be able to find this kind of condensed representation for P. The chain partition can be interpreted as a clustering of the pattern collection. Each chain Ci , 1 ≤ i ≤ m, as a totally ordered set, can have much simpler structure than the original partially ordered set P. a partition of a partially ordered set P to minimum number of chains C1 , . . . , Cm corresponds to a structural clustering of P and that consists of the minimum number of clusters C1 , . . . , Cm and in each cluster Ci all patterns p, q ∈ Ci are comparable.
236
T. Mielik¨ ainen
The minimum chain partition might not be unique but the lack of uniqueness is not necessarily a problem because of the exploratory nature of data mining: Different partitions emphasize different aspects of the pattern collection. Clearly, this can be beneficial when trying to understand the essence of the data set. The maximum number of chains in a chain partition of P is |P| as each element p in P is itself a chain (and an antichain, too). The minimum number of chains in a chain partition is at least the cardinality of the largest antichain in P since no two distinct elements of the largest antichain can be in the same chain as they are not comparable. This inequality is strict by the Dilworth’s Theorem: a partially ordered set P can be partitioned into m chains iff the largest antichain in P is of cardinality m. (For a clear exposition of chains, antichains and partial orders, see [29].) Moreover, as maximal elements of a partial ordered set form an antichain, the number of chains needed is always at least the number of maximal elements in P. A chain partition can be even more than a structural clustering if the pattern collection has more structure than a partial order. As an example of further exploiting the chain structure, let us consider a collection of weighted sets, e.g., frequent sets with their frequencies or binary matrix as a set collection with integer weights for the sets, and let the partial order relation be determined by the set inclusion. Let the set collection P be {{1} , {2} , {1, 3} , {2, 4} , {1, 2, 3} , {1, 2, 4}} and the weights of the sets to be w ({1}) = 4, w ({2}) = 5, w ({1, 3}) = 3, w ({2, 4}) = 4, w ({1, 2, 3}) = 2 and w ({1, 2, 4}) = 1. Then the pattern collection can be partitioned into two chains C1 = {{1} , {1, 3} , {1, 2, 3}} and C2 = {{2} , {2, 4} , {1, 2, 4}} . We can associate to each set its distance from the minimal set in the corresponding chain and then write each chain as one set. If the distance from the minimal element in the chain is denoted as a subscript of an element then the whole chains in the above example can be described as follows: C1 = {10 , 23 , 32 } = 10 23 32 and C2 = {12 , 20 , 41 } = 12 20 41 . This approach to represent pattern chains can be applied to a wide variety of different pattern collections such as substrings and graphs. Besides making the pattern collection more compact and hopefully understandable this approach can also compress the pattern collection: the total size of the sets of a length k (k ≤ |R|) chain can be Θ (k|R| + |w|) in the worst case but the size of a chain as one pattern is only O (|R| log k + |w|), where |w| is the size of the weight
Chaining Patterns
237
function. This is the case when the data set is the collection of all suffixes of R (for any given ordering of elements in R), i.e., the upper triangular binary matrix of size |R| × |R|. The whole set collection can be described as a one chain but the set collection still has |R| distinct sets.
3
Algorithmic Issues
A minimum chain partition C1 , . . . , Cm of a partially ordered set P can be found efficiently by finding a maximum matching in the bipartite graph corresponding to the partial order [30]. A bipartite graph is a triplet G = (Vl , Vr , E), where Vl and Vr are two distinct sets called vertices, and E is a subset of Vl × Vr called edges. An edge e ∈ E is adjacent to a vertex v ∈ Vl ∪ Vr iff e = (p, q) or e = (q, p) for some q ∈ Vl ∪ Vr . A matching M in G is a set of pairwise disjoint edges. A matching M is maximal iff there is no edge in E that is disjoint from all edges in M . A matching M is maximum iff no matching M in G is larger than M , i.e., |M | ≤ |M | for all matchings M in G. The bipartite graph corresponding to the partial order ≺ of P is G = (P, P, ≺). That is, G consists of the partial order ≺ and two copies of the pattern collection P. Any matching M in G determines a partition of P into chains: the matching M partites the pattern collection P to directed paths in the partial order ≺ and each path determines one chain. The number of chains in the matching M is equal to the number of patterns p in P such that for no q ∈ P holds: (p, q) ∈ M . Thus a maximum (maximal) matching corresponds to a minimum (minimal) chain partition. A maximum matching M in the bipartite graph G = (Vl , Vr , E) can be found in time O min {|Vl | , |Vr |} |E| [30,31]. Thus if the partial order ≺ is known explicitly then the minimum chain partition can be found in time O |P||≺| 2 which is bounded above by O |P|5/2 as there are at most |P| pairs in ≺⊆ P × P. It is possible to partite P also to the minimum number of trees or degreeconstrained subgraphs in polynomial time in |P| by finding the maximum bmatching in the corresponding bipartite graph. The bipartite b-matching is a generalization of the bipartite matching such that for each vertex v ∈ Vl ∪Vr there is an upper bound (a lower bound) that determine how many edges adjacent to v can (must) be chosen to M . In the case of the ordinary matching, the upper bound is one and the lower bound is zero. Another way to generalize the maximum matching is to search for such a maximum matching in a edge-weighted bipartite graph that the sum of the edge weights is smallest in the collection of all maximum matchings for that graph. That minimum weight maximum matching corresponds to a minimum chain partition with the smallest sum of weights for consecutive elements in each chain.
238
T. Mielik¨ ainen
However, finding a good partition of a pattern collection into chains has the following two traits: 1. The pattern collections can be enormously large. 2. The partial order might not be known explicitly. The first problem can be overcome by searching for a maximal matching instead of a maximum matching. A maximal matching in G = (Vl , Vr , E) can be found in time O (|E|) by trying to add the edges one by one in arbitrary order to the matching. It is easy to show that a maximal matching is at least half of the maximum matching. Unfortunately this does not imply any nontrivial approximation quality guarantees for the chain partition. To see this, consider the set {1, 2, . . . , 2n} and let the partial order ≺ be {(i, j) : i < j}. The maximum matching {(1, 2) , (3, 4) , . . . , (2n − 1, 2n)} determines just one chain C = {1, 2, . . . , 2n} whereas the worst maximal matching {(1, 2n) , (2, 2n − 1) , . . . , (n, n + 1)} determines n chains C1 = {1, 2n} , C2 = {2, 2n − 1} , . . . , Cn = {n, n + 1} . Thus in the worst case the solution found by the greedy algorithm is |P| /2 times worse than the optimal solution. The quality of a maximal matching, i.e., a minimal chain partition can be improved by finding a total order that conforms to the partial order. If the partial order is known explicitly then a total order conforming it can always be found using topological sorting. Sometimes it is easy to compute a total order even without knowing the partial order explicitly. This is the case, for example, with the frequent sets: the sets can be sorted w.r.t. the cardinalities of the sets. This kind of auxiliary information can significantly reduce the size of the chain partition found using a maximal matching algorithm. The amount of improvement depends on how well the total order is able to capture the essence of the partial order. For example, in the case of the set {1, 2, . . . , 2n} matching the elements greedily in the ascending (or descending) order produces the maximum matching. If the partial order is given implicitly then its explicit computation might itself be the major computational bottleneck of chaining. The time complexity 2 of the brute force solution, i.e., testing all pairs of patterns in P, is O |P| . In the worst case this time bound is asymptotically optimal as the whole set P can be an antichain: then each pair of patterns in P must be compared to verify that P is an antichain.
Chaining Patterns
239
Partial orders have two useful properties that can be exploited when computing (the explicit representation of) the partial order: transitivity and irreflexivity. Because of transitivity we know that if p ≺ q and q ≺ r then p ≺ r. Irreflexivity (together with transitivity) guarantees that the graph G = (P, ≺) is acyclic. The partial order can be obtained as a side product of the minimal chain partition as follows: Init. Init the number m of chains to zero. Growth. For each p ∈ P, add p to some of the existing chains Ci , 1 ≤ i ≤ m, if possible, or create a new chain Cm+1 for p and increase m by one. Comparison. For all chains Ci and Cj , compute the partial order of Ci ∪ Cj . To make the complexity analysis of the above procedure simpler we assume that any two patterns in the pattern collection can be compared in a constant time. Then figuring out whether a pattern p can be added to chain Ci , 1 ≤ i ≤ m, can be computed in time O (|Ci |). The comparison step can be implemented in several ways. Some of the partial order is revealed already when each pattern p ∈ P has been tried to add to existing chains. The partial order for the union Ci ∪ Cj two chains Ci and Cj can be computed from the partial orders of Ci and Cj in time O (|Ci | + |Cj |). However, as the brute force solution is worst case optimal, the efficiency of different comparison heuristics has to be evaluated experimentally.
4
Experiments
We tested the condensation abilities of the pattern chaining using two data sets: Internet Usage data consisting of 10104 rows and 10674 attributes and IPUMS Census data consisting of 88443 rows and 39954 attributes. The data sets were downloaded from the UCI KDD Repository1 . From the data sets we computed the closed frequent sets, minimal and the minimum chain partitions of the closed frequent sets, and the maximal frequent sets, with different minimum frequency thresholds. The closed frequent sets were sorted by their cardinalities before finding the minimal chain partitions. Results are shown in Figures 1 and 2. The number of chains is smaller than the number of closed sets. Thus the idea of finding minimum chain partition seems to be useful in that sense. Even more interesting results were obtained when comparing the minimal and the minimum chain partitions: the greedy heuristic produced almost as good solutions as the much more computational demanding bipartite matching. We got similar results with other data sets we experimented. However it is not clear whether the quality of maximal matchings is specific to frequent sets or if the results hold for other pattern collections. It is worth to remember that the fundamental assumption in the frequent set mining is that not very large sets are frequent since also all the subsets of 1
http://kdd.ics.uci.edu
240
T. Mielik¨ ainen 25000 closed frequent sets chains, greedy matching chains, optimal matching maximal frequent sets
number of patterns
20000
15000
10000
5000
0 0.16
0.18
0.2
0.22 0.24 minimum frequency threshold
0.28
0.3
closed frequent sets chains, greedy matching chains, optimal matching
16 relative number of patterns w.r.t. maximal patterns
0.26
14
12
10
8
6
4
2
0 0.16
0.18
0.2
0.22
0.24
minimum frequency threshold
Fig. 1. IPUMS Census data
0.26
0.28
0.3
Chaining Patterns
241
160000 closed frequent sets chains, greedy matching chains, optimal matching maximal frequent sets
140000
number of patterns
120000
100000
80000
60000
40000
20000
0 0.06
0.08
0.1 0.12 0.14 minimum frequency threshold
0.16
0.18
0.2
7
relative number of patterns w.r.t. maximal patterns
closed frequent sets chains, greedy matching chains, optimal matching 6
5
4
3
2
1
0 0.06
0.08
0.1
0.12
0.14
minimum frequency threshold
Fig. 2. Internet Usage data
0.16
0.18
0.2
242
T. Mielik¨ ainen
a frequent set are frequent. This implies that the chains cannot be very long as the length of the longest chain is equal to the size of the largest frequent set. This observation makes the results even more satisfactory.
5
Conclusions
In this paper we have introduced the chain partitions of partially ordered pattern collections as a high-level approach to condense and structure pattern collections, even already condensed ones, and also a structural clustering of the pattern collection. We described how the minimum chain partitions can be found and how the computation of the minimum chain partition and obtaining the explicit chain partitions can be made more efficient. Also, we showed that the chain partitions of pattern collections are useful in practice. However, there are still many important open problems related to pattern chains: – What kind of additional constraints for the chain partitions are computationally tractable and useful, especially in data analysis? – Are there efficient approximation algorithms (with approximation quality guarantees) for finding the chain partitions to cope with massive pattern collections? – Can the minimum chain partitions make some existing pattern discovery algorithms run faster? – What kind of consensus patterns of pattern collections, besides of chain partitions, are valuable? – In general, how should the structure of pattern collections be exploited? Acknowledgments. I wish to thank Matti K¨ a¨ari¨ ainen, Heikki Mannila and Janne Ravantti for helpful comments on the manuscript and refreshing discussions.
References 1. Hand, D.J.: Pattern detection and discovery. In Hand, D., Adams, N., Bolton, R., eds.: Pattern Detection and Discovery. Volume 2447 of LNAI. Springer-Verlag (2002) 1–12 2. Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press (2001) 3. Mannila, H.: Local and global methods in data mining: Basic techniques and open problems. In Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R., eds.: Automata, Languages and Programming. Volume 2380 of LNCS. Springer-Verlag (2002) 57–68 4. Gunopulos, D., Khardon, R., Mannila, H., Saluja, S., Toivonen, H., Sharma, R.S.: Discovering all most specific sentences. ACM Transactions on Database Systems 28 (2003) 140–174 5. Hipp, J., G¨ untzer, U., Nakhaeizadeh, G.: Algorithms for association rule mining – a general survey and comparison. SIGKDD Explorations 1 (2000) 58–64
Chaining Patterns
243
6. Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1 (1997) 241–258 7. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996) 307–328 8. Boulicaut, J.F., Bykowski, A.: Frequent closures as a concise representation for binary data mining. In Terano, T., Liu, H., Chen, A.L.P., eds.: Knowledge Discovery and Data Mining. Volume 1805 of LNAI. Springer-Verlag (2000) 62–73 9. Boulicaut, J.F., Bykowski, A., Rigotti, C.: Approximation of frequency queries by ˙ means of free-sets. In Zighed, D.A., Komorowski, J., Zytkow, J., eds.: Principles of Knowledge Discovery and Data Mining. Volume 1910 of LNAI. Springer-Verlag (2000) 75–85 10. Boros, E., Gurvich, V., Khachiyan, L., Makino, K.: On the complexity of generating maximal frequent and minimal infrequent sets. In Alt, H., Ferreira, A., eds.: STACS 2002. Volume 2285 of LNCS. Springer-Verlag (2002) 133–141 11. Bykowski, A., Rigotti, C.: A condensed representation to find frequent patterns. In: Proceedings of the Twenteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, ACM (2001) 12. Calders, T., Goethals, B.: Mining all non-derivable frequent itemsets. In Elomaa, T., Mannila, H., Toivonen, H., eds.: Principles of Data Mining and Knowledge Discovery. Volume 2431 of LNAI. Springer-Verlag (2002) 74–865 13. Calders, T., Goethals, B.: Minimal k-free representations of frequent sets. In Lavrac, N., Gamberger, D., Todorovski, L., Blockeel, H., eds.: Principles of Knowledge Discovery and Data Mining. LNAI, Springer-Verlag (2003) 14. Geerts, F., Goethals, B., Mielik¨ ainen, T.: What you store is what you get (extended abstract). In: 2nd International Workshop on Knowledge Discovery in Inductive Databases. (2003) 15. Gouda, K., Zaki, M.J.: Efficiently mining maximal frequent itemsets. In Cercone, N., Lin, T.Y., Wu, X., eds.: Proceedings of the 2001 IEEE International Conference on Data Mining. IEEE Computer Society (2001) 163–170 16. Kryszkiewicz, M.: Concise representation of frequent patterns based on disjunctionfree generators. In Cercone, N., Lin, T.Y., Wu, X., eds.: Proceedings of the 2001 IEEE International Conference on Data Mining, IEEE Computer Society (2001) 305–312 17. Mannila, H., Toivonen, H.: Multiple uses of frequent sets and condensed representations. In Simoudis, E., Han, J., Fayyad, U.M., eds.: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), AAAI Press (1996) 189–194 18. Mielik¨ ainen, T.: Frequency-based views to pattern collections. In: IFIP/SIAM Workshop on Discrete Mathematics and Data Mining. (2003) 19. Mielik¨ ainen, T., Mannila, H.: The pattern ordering problem. In Lavrac, N., Gamberger, D., Todorovski, L., Blockeel, H., eds.: Principles of Knowledge Discovery and Data Mining. LNAI, Springer-Verlag (2003) 20. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In Beeri, C., Buneman, P., eds.: Database Theory – ICDT’99. Volume 1540 of LNCS. Springer-Verlag (1999) 398–416 21. Pavlov, D., Mannila, H., Smyth, P.: Beyond independence: probabilistic methods for query approximation on binary transaction data. IEEE Transactions on Data and Knowledge Engineering (2003) To appear.
244
T. Mielik¨ ainen
22. Pei, J., Dong, G., Zou, W., Han, J.: On computing condensed pattern bases. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), 9–12 December 2002, Maebashi City, Japan, IEEE Computer Society (2002) 378–385 23. Mielik¨ ainen, T.: Finding all occurring sets of interest. In: 2nd International Workshop on Knowledge Discovery in Inductive Databases. (2003) 24. Pei, J., Han, J., Mao, T.: CLOSET: An efficient algorithm for mining frequent closed itemsets. In Gunopulos, D., Rastogi, R., eds.: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. (2000) 21–30 25. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., Lakhal, L.: Computing iceberg concept lattices with Titanic. Data & Knowledge Engineering 42 (2002) 189–222 26. Zaki, M.J., Hsiao, C.J.: CHARM: An efficient algorithms for closed itemset mining. In Grossman, R., Han, J., Kumar, V., Mannila, H., Motwani, R., eds.: Proceedings of the Second SIAM International Conference on Data Mining, SIAM (2002) 27. Agarwal, R.C., Aggarwal, C.C., Prasad, V.V.V.: A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing 61 (2001) 350–371 28. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In Chen, W., Naughton, J.F., Bernstein, P.A., eds.: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, ACM (2000) 1–12 29. Jukna, S.: Extremal Combinatorics: With Applications in Computer Science. EATCS Texts in Theoretical Computer Science. Springer-Verlag (2001) 30. Lov´ asz, L., Plummer, M.: Matching Theory. Volume 121 of Annals of Discrete Mathematics. North-Holland (1986) 31. Galil, Z.: Efficient algorithms for finding maximum matchings in graphs. ACM Computing Surveys 18 (1986) 23–38
An Algorithm for Discovery of New Families of Optimal Regular Networks Oleg Monakhov and Emilia Monakhova Institute of Computational Mathematics and Mathematical Geophysics SB RAS, Pr. Lavrentieva, 6, Novosibirsk, 630090, Russia {monakhov,emilia}@rav.sscc.ru
Abstract. This work describes a new algorithm for discovery of analytical descriptions of new dense families of optimal regular networks using evolutionary computation. We present the new families of the networks of degree 3 and 6 obtained by the discovery algorithm which improve previously known results.
1
Introduction
A design of interconnection networks for parallel computer system architectures and distributed memory computer systems requires a study of undirected dense regular graphs with small diameters. Graphs with these properties are included in the class of the Parametrically described, Regular, and based on Semigroups (PRS) networks introduced by Monakhov (1979) [10,11]. The PRS networks are a generalization of hypercubes, cube-connected cycles, circulant graphs [3], chordal ring networks [1] and other classes of graphs used as interconnection networks of computer systems. In [9] Gaussian cubes are considered as a generalization of hypercubes which also represent a subclass of the PRS graphs. Note that the PRS graphs find an application not only in artificial communication networks but also in natural processes (in chemistry) as reaction graphs in which the nodes symbolize chemical species (molecules or reaction intermediates), and the edges represent elementary reaction steps [2]. In this work we introduce a new template-based evolutionary algorithm for computer discovery of analytical descriptions of new dense families of optimal PRS networks with very large range of orders. This approach differs from a standard way of search for dense families of optimal networks by theoretical mathematical methods based on graph theory, group theory and regular tessellations of the plane [3,6,12,14,16].
2
Preliminary Definitions and Properties of PRS Networks
In this section, we review some graph theoretic terms we will need later and we define PRS graphs and their properties. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 245–255, 2003. c Springer-Verlag Berlin Heidelberg 2003
246
O. Monakhov and E. Monakhova
A graph G is a structure (V, E), where V is a set of nodes and E is a set of edges, with each edge joining one node to another. The order of a graph is the number of its nodes. The degree of node a is the number of edges incident to node a. A graph is regular if all nodes have the same degree. A path in a graph is defined as a sequence of adjacent edges such that no edge occurs twice, and that the first and last nodes are distinct; the length of the path is the number of its edges. A circuit in the graph G is a path in which the initial and terminal nodes coincide. The minimum length of any circuit in G is called the girth of the graph. The distance between two nodes a and b is the length of a shortest path between a and b. The diameter of a graph is the maximum distance over all the pairs of nodes. The diameter represents the worst message delay in the network. We now define the Parametrically described, Regular, and based on Semigroups (PRS) networks or Rs (N, v, g) graphs with the order N , the degree v, the girth g and the number of equivalence classes s. Let Rµ (N, v, g) be a graph with the order N , the degree v, the girth g and the set of nodes V = {1, 2..., N }, the set of edges E ⊆ V 2 , the group of automorphisms Aut(R), and the equivalence relation µ forming a partition of the set of nodes V into m ≤ N classes Vi , such that for each pair of nodes k, j ∈ Vi , i = 1, m, an automorphism ϕ ∈ Aut(R) exists which transforms k to j: ∀(k, j ∈ Vi )∃(ϕ ∈ Aut(R))(ϕ(k) = j). (1) The equivalence µ on the set of nodes V of the graph Rs (N, v, g), which will be considered further, is the congruence modulo a divisor s of N , i.e. µ = {(a, b) ∈ V 2 | a ≡ b(mod s)},
(2)
where s ≤ N, N ≡ 0(mod s). In this case, if the equivalence µ is defined by expression (2), we denote Rµ (N, v, g) graphs as Rs (N, v, g) graphs. Distinguish the class of Rs (N, v) graphs that includes all Rs (N, v, g) graphs with fixed values of s, N and v. Thus, the total set of nodes V of the graph Rs (N, v) is subdivided into s equivalence classes Vi : Vi = {a | a ∈ V, a ≡ i(mod s)}, where i = 1, s. Let r = N/s. It follows from the definition of Rµ (N, v, g) graphs that for each pair of nodes a, b ∈ Vi , i = 1, s, of the graph Rs (N, v) an automorphism ϕ ∈ Aut(R) exists such that ϕ(a) = b, that is: b ≡ a + js(mod N ),
j = 1, r.
(3)
If in the graph Rs (N, v) two nodes a and c are connected by the edge (a, c) ∈ E and c−a ≡ l(mod N ), where l is a natural number and l < N , then we call l a mark of the edge (a, c). Note that the edge has also the mark l ≡ a − c(mod N ). Informally, we can describe the graph Rs (N, v) as a regular graph of degree v with nodes labelled with integers modulo N and, for any two nodes a and c, node a is joined to node c if and only if node a + s(mod N ) is joined to node c + s(mod N ), where s is a divisor of N .
An Algorithm for Discovery of New Families of Optimal Regular Networks
247
Lemma. If two nodes a ∈ Vi , i ∈ 1, s, and c ∈ V of the graph Rs (N, v) are connected by the edge (a, c) ∈ E with the mark l, then the edge (b, d) ∈ E with the mark l is incident to each node b ∈ Vi . Proof. Since a, b ∈ Vi it is seen from (3) that there exists an automorphism of the graph Rs (N, v) such that ϕ(a) = b and congruence (3) is true. Let ϕ(c) = d, then we have (4) d ≡ c + js(mod N ), j = 1, r. If (a, c) ∈ E then we obtain (ϕ(a), ϕ(c)) ∈ E, i.e. (b, d) ∈ E. From expressions (3) and (4) we have d − b ≡ c − a(mod N ), and since c − a ≡ l(mod N ), then d − b ≡ l(mod N ). Corollary. Let in the graph Rs (N, v) the node a ∈ Vi , i ∈ 1, s. We denote by Li = {lik }, i ∈ 1, s, k = 1, v, the set of marks of edges incident to the node a. Then the set of marks of the edges incident to any node b ∈ Vi , is Li . We call the set L = {lik }, i = 1, s, k = 1, v, a set of marks of edges (or generators) of the graph Rs (N, v). Two nodes a and b of the graph Rs (N, v) are connected by the edge (a, b) ∈ E, if and only if there exists a natural number li,k < N , where lik ∈ L, i ∈ 1, s, k ∈ 1, v, such that if a ≡ i(mod s), then b − a ≡ lik (mod N ), i.e. (a, b) ∈ E ⇔ (∃lik ∈ L)(a ≡ i(mod s))&(b − a ≡ lik (mod N )). Let Eik denote the set of edges with the mark lik : Eik = {(a, b) ∈ E | a ≡ i(mod s), b ≡ a + lik (mod N )}, where i ∈ 1, s, k ∈ 1, v. Edges from the set Eik have also the mark ljm = N − lik ,
(5)
where j ≡ i + lik (mod s), i, j ∈ 1, s, k, m ∈ 1, v, and, hence, the sets Ejm and Eik coincide. Let L∗ denote the minimal necessary set of marks. In order to go from the set L to the set L∗ , we must delete from the set L one mark from each pair of marks connected by relation (5). For the inverse transition from L∗ to L it is necessary to find an additional mark by relation (5) for each mark from L∗ . Thus, if the number of nodes N , the number of equivalence classes s, and the set of marks L (or L∗ ) are given, then Rs (N, v, g) graph is completely defined, and we can also use the following notation for it: G(N; L). Example. Let us consider the Petersen graph (Figure 1, left) on 10 nodes of degree 3 with g = 5, s = 2. One can see that the node i is joined to nodes i + 1, i + 2, i + 8 if i is odd and to nodes i + 4, i + 6, i + 9 if i is even (all the numbers are to be taken modulo 10). The Petersen graph is the graph R2 (10, 3, 5) with the set of marks (generators) L = {1, 2, 8; 4, 6, 9}. The minimal set of marks L∗ = {1, 2; 4}. Figure 1 (right) shows the graph R2 (20, 4, 5) on 20 nodes of degree 4. It has two equivalence classes of nodes and the set of marks L = {1, 3, 4, 16; 8, 12, 17, 19}. The minimal set of marks L∗ = {1, 3, 4; 8}. Let us present someproperties of the graphs Rs (N, v): s 1. V = i=1 Vi , Vi Vj = Ø for i = j, | Vi |= r, where i = 1, s. s v 2. E = i=1 k=1 Eik , | Eik |= r, where i = 1, s, k = 1, v, | E |= srv/2.
248
O. Monakhov and E. Monakhova
Fig. 1. Petersen graph (left). R2 (20, 4, 5) graph (right).
3. If nodes a, b ∈ V of the graph Rs (N, v) are connected, then s v b − a ≡ i=1 k=1 lik tik (mod N ), where tik is the number of edges with the mark lik , which belong to the path from a to b. 4. The graph Rs (N, v) [10] is described by a semigroup and it is isomorphic to the graph of the semigroup of transformations of the classes of equivalence. We now give a representation of some known network topologies as subclasses of Rs (N, v, g) graphs. The hypercubes can be described as Rs (2v , v, 4) graphs with s = 2v−2 . For example, for N = 23 the description of the hypercube R2 (23 , 3, 4) has the form L = {1, 2, 6; 2, 6, 7}. The circulant graphs [3,5,12] can be described as R1 (N, v, 4) graphs with s = 1 and g = 4. Circulant graphs are intensively researched in computer science, graph theory and discrete mathematics and are realized as interconnection networks in some computer systems (MPP, Intel Paragon, Cray T3D, etc.). Let us reduce the generally accepted description for them. A circulant is an undirected graph G(N ; l1 , l2 , . . . , lv/2 ) with N nodes, labelled as 1, 2, . . . , N , having i ± l1 , i ± l2 , . . . , i ± lv/2 (mod N ) nodes, adjacent to each node i. The marks L∗ = {li }, where 0 < l1 < . . . < lv/2 ≤ N/2, are generators of the finite Abelian group of automorphisms connected to the graph. Circulant graphs G(N ; 1, l2 , . . . , lv/2 ) with the identity generator are known as loop networks. For s = 2 and v = 3 the class of Rs (N, v, g) graphs includes the well known classes of chordal ring networks [1], generalized chordal ring networks [14] and generalized Petersen graphs [16,3]. For example, the parametric description of the chordal ring networks R2 (N, 3, g) have the form L∗ = {1, a; 1}, the generalized chordal ring networks R2 (N, 3, g) have the form L∗ = {a, b; c}, where a, b, c are the chords (different odd integers) and the generalized Petersen graphs R2 (N, 3, g) have the form L∗ = {a, b; c}, where a is even and b, c are odd integers.
An Algorithm for Discovery of New Families of Optimal Regular Networks
3
249
Optimization Problem for PRS Networks
The optimization problem considered in the paper is to find an optimal graph Rs (N, v) having the minimal diameter d for any given number of nodes N , degree v and number of equivalence classes s. This optimization problem is a problem of integer-valued programming with a nonlinear object function. Optimal graphs achieve optimal efficiency characteristics with respect to the information transmission delays, reliability, connectivity, and speed of communications [3,5, 12] under the implementation as interconnection networks in multimodule supercomputer systems. The following algorithms were proposed and realized for solving the problem: exhaustive search algorithm, an algorithm using the idea of branches and bounds, a genetic algorithm, simulating annealing and random search algorithm. However, these algorithms are usually used for a solution of optimization problems for fixed values of N under given values of v, g, s and give fixed values of generators. So it is necessary to develop new effective methods for the search for dense families of PRS graphs with analytical descriptions of generators, which allow to design parameters of descriptions of optimal graphs by formulas and to generate infinite families of such graphs. In the literature the following dense families of the PRS networks are known. They are either optimal or with the diameter less than or equal to a prescribed value. We will focus on the PRS graphs of degree 3, 4 and 6 as more interesting case for interconnection networks. – For v = 3, some families of generalized chordal ring networks R2 (N, 3, g) with the minimum diameter d have been found in [14]. The analytical descriptions of the orders N and the generators a, b, c are shown in Table 1. Table 1. N 6p2 + 4p 6p2 + 2p − 2 6p2 + 2p 6p2 − 2 6p2 + −4
d 2p + 1 2p + 1 2p + 1 2p + 1 2p
a b c −1 −6p − 1 1 −1 −6p + 1 1 −1 6p − 1 1 1 −1 −6p + 3 1 −1 −6p + 1
– For generalized Petersen graphs (v = 3) [16], an analytical description has been found for graphs with the largest order for given value of diameter d: N = (2d − 3)2 + 1, a = 1, b = 2d − 4 , c = 2d − 2. – For v = 4, an analytical solution has been found in [4,5,13], namely: for any N > 4, optimal circulant graph has the following generators: √ l1 = ( 2N − 1 − 1)/2, l2 = l1 + 1. (6) – For v = 6, the authors in [6] and Delorme (private communication in [3])) found a dense infinite family of circulant graphs (loop networks) with diameter smaller than or equal to d ≥ 3, where
250
O. Monakhov and E. Monakhova
N = 32 d3 3 + 8 d3 2 + 2 d3 , L∗ = {1, 4 d3 , (4 d3 )2 }
(7).
Note that in the most cases the discovery of infinite families of graphs with extremal properties was produced by experience and intuition of researchers. In this work, we develop a new approach using evolutionary computation [7, 8,15] to automatically generate analytical parametric descriptions of families of PRS networks with good extremal properties and to obtain families with better diameters than ones known in the literature. We consider a solution of the following problem: to find functions f1 , f2 , ..., fn , n ≥ 2, for analytical descriptions of generators li = fi (d, N ), i = 1, n, of optimal PRS networks G(N ; l1 , ..., ln ): (a) for a given range of N = Nmin ÷ Nmax , or (b) for a given function N = fN (n, d).
4
Discovery Algorithm
The computer discovery algorithm is based on the evolutionary computation and the simulation of the survival of the fittest in a population of individuals, each being presented by a point in the space of solutions of the optimization network problem. The individuals are presented by strings of functions (analytical representations of sets of generators). Each population is a set of generator sets f1 , f2 , ..., fn , n ≥ 2, for families of PRS graphs with orders Nmin ≤ N ≤ Nmax taking in the range all the values or some of them. In the last case, N may be also assigned as a function of n and d. The main idea of these algorithms consists in evolutionary transformations over sets of analytical descriptions of graphs (formulas) based on a natural selection: “the strongest” survive. In our case these solutions are graphs giving the best possible diameter. In the algorithm the starting point is the generation of the initial population. All individuals of the population are created at random, the best individuals are selected and saved. To create the next generation, new solutions are formed through genetic operations named selection, mutation and adding new elements (for a variety of population). The function F named as fitness function evaluates the sum of diameters of the graphs G(N ; f1 , f2 , ..., fn ) with generator set f1 , f2 , ..., fn , n ≥ 2, and orders N , Nmin ≤ N ≤ Nmax . The purpose of the algorithm is to search for a minimum of F . 4.1
Data Representation
The basic data in our program realizing the algorithm are sets of functions f1 , f2 , ..., fn ,n ≥ 2. Based on the known descriptions of the optimal graph families, we propose the following two generalized templates for functions fi . The first template is used for each function fi , i = 1, n: h j=1
[sgj (a/b)N x (d + ∆d)z ]tr j ,
An Algorithm for Discovery of New Families of Optimal Regular Networks
251
where: a, b ∈ {C}, | x |≤ n, | z |≤ n, h ≤ n + 1, ∆d ∈ {0, 1, ..., 2n}, sgj ∈ {+, −}, tr ∈ { p, p} is a type of rounding, C is a set of natural constants. An example of the presentation of functions fi (a set of generators or a chromosome) for n = 2 is stated in Table 2. Table 2. Example of a chromosome for n = 2 h=2 f1 f2 j = 1 sg11 a11 b11 x11 z11 tr11 sg21 a21 b21 x21 z21 tr21 j = 2 sg12 a12 b12 x12 z12 tr12 sg22 a22 b22 x22 z22 tr22
Also, we use the second template for functions fi , i = 1, n: a[((yN + sg1 ∆N )x + sg2 z)/b]tr , where: a, b ∈ {C}, | x |≤ n, z, y, ∆N ∈ {0, 1, ..., 2n}, sgj ∈ {+, −}, tr ∈ { p, p} is a type of rounding. Using the templates, we create an expression for each function fi and can already produce all evaluations and modifications on it. Thus, for fixed values of d and parameters Π = {a, b, x, y, z, ∆d, ∆N, sg, tr} we can calculate the values of functions fi (d, N ), i = 1, n. 4.2
Fitness Function
In the program realizing the computer discovery algorithm, the fitness function F is determined by the following way. In a cycle we change N from Nmin to Nmax . Based on the given analytical description of the family of the tested graphs, the diameter d(G) of the graph G(N ; f1 , f2 , ..., fn ) is computed for every N . Then it is compared with the diameter d(N ) of the optimal graph. If they are equal, then the graph G(N ; f1 , f2 , ..., fn ) is optimal. The fitness function F evaluates the sum of deviations (quadratic deviations) of diameters for the family of the tested graphs from the diameters d(N ), that is: (d(G(N ; f1 , ..., fn )) − d(N ))2 . F = Nmin ≤N ≤Nmax
The fitness function shows the quality of analytical description of the graph G(N ; f1 , f2 , ..., fn ). This fitness function was used for obtaining families of optimal PRS networks. 4.3
Operators of the Algorithm
The mutation operator is applied to the generators chosen randomly from the current population with a probability pm ∈ [0, 1]. Mutation represents a modification of an individual whose number is randomly selected. A modification
252
O. Monakhov and E. Monakhova
is understood as a replacement of an randomly chosen parameter from Π by another value selected at random from the available list constants. The creation of a new element (individual) is the generation of random parameters for the template of functions. It allows to add an element of chance to the creation of a population. The selection operator realizes the principle of the survival of the fittest individuals. It selects the best individuals with the minimal diameters (i.e. descriptions of the best families of PRS graphs) in the current population. 4.4
Iteration Process
In the search for the optimum of fitness function F the iteration process in the computer discovery algorithm is organized by the following way. First iteration: a generation of the initial population. It is realized as follows. All individuals of the population are created by means of operator new element (with a test and rejection of all “impractical” individuals). After filling the whole array of the population, the best individuals are selected and saved in an array best. One iteration: a step from the current population towards the next population. The basic step of the algorithm consists of creating a new generation on the basis of array best using selection, mutation and also adding some new elements. After evaluation of fitness function for each individual of the generation, we spend a comparison of the value of this function to values of fitness function of those individuals which are saved in the array best. In the case, if an element from the new generation is better than an element best[i], for some i, we locate the new element on place i and shift all remaining ones per a unit of downwards. Thus, the best element is located at the top of the array best. Last iteration (the termination criterion): the iterations are finished either after a given number of steps T = t or after finding a given number of optimal graphs in the given range of orders N . By producing a given amount of the basic steps of the algorithm, we obtain a set of functions f1 , f2 , ..., fn which describes a family of optimum (or nearly optimum) PRS graphs.
5
Experimental Results
We applied the discovery algorithm for obtaining families of PRS graphs of degree 3, 4 and 6. The number of iterations and population size were chosen by experimental way based on parameters from [7],[8]. The result of execution of the program shown that the algorithm found a parametric description of the form (6) for optimal degree 4 circulants in a large range of values of N (N = 100 ÷ 5000). This solution has been found after 500 iterations with a population of 40. This result corresponds to the known exact description of optimal circulant graphs for v = 4 and any N .
An Algorithm for Discovery of New Families of Optimal Regular Networks
253
For circulants with degree 6 the discovery algorithm found the description (7) of the family of graphs described in [6]. It also found a new family of circulant graphs of degree 6 with significantly larger number of nodes for given diameter than (7) (see Table 3).
Table 3. Families of circulant networks v=6 N d l1 Known 42, 292, 942, 2184, 4210, 7212 3, 6, 9, 12, 15, 18 1 family [6] New 55, 333, 1027, 2329, 4431, 7525 3, 6, 9, 12, 15, 18 1 family
l2 4 d3 8 2 d 9
l3 (4 d3 )2
+ 23 d 89 d2 + 2d + 2
New results have also been obtained for well known class of generalized Petersen graphs (v = 3) [16,3], described by PRS network R2 (N, 3, g) in our notation. The algorithm has automatically generated previously unknown descriptions of infinite family of PRS network R2 (N, 3, g) in the form L∗ = {1, 2d − 4; 2d − 2}, with s = 2 and the minimum diameter d ≥ 4 for each even N ≥ 16. The new description and some examples are represented in Table 4. The value of d was varied from d = 3 to d = 200. This solution was found after 45 iterations with a population of 100 individuals. The discovery algorithm found also new unified optimal description for any even number of nodes N ≥√16 for this class√of PRS networks R2 (N, 3, g) in the following form: L∗ = {1, 2 N − 2/2; 2 N − 1/2}. This solution was found after 158 iterations with a population of 100 individuals.
Table 4. New family of PRS networks with v = 3 N ≥ 16 Diameter a b (2d − 5)2 + 3 ≤ N ≤ (2d − 3)2 + 1 d 1 2d − 4 (2d − 3)2 + 3 ≤ N ≤ (2d − 2)2 + 2 d + 1 1 2d − 4 16 ÷ 26 4 1 4 28 ÷ 38 5 1 4 28 ÷ 50 5 1 6 52 ÷ 66 6 1 6 52 ÷ 82 6 1 8 84 ÷ 102 7 1 8 38028 ÷ 38810 100 1 196 38812 ÷ 39208 101 1 196
c 2d − 2 2d − 2 6 6 8 8 10 10 198 198
254
6
O. Monakhov and E. Monakhova
Future Work
We are developing our template-based evolutionary algorithm in the following directions. – During the evolutionary process we allow to change not only the parameters of the given templates, but also the structure of the templates. – We can use not only the mathematical formulas for the templates but also the program templates (e.g. iterations, loops and cycles), which describe the scanning of the complex data structures (matrixes, arrays, graphs, trees) and which can contain the formula templates in the body. This approach can be applied not only for discovery of mathematical formulas, but also for invention of new computational algorithms for the given data sets.
7
Conclusions
Represented approach has been used successfully to automatically generate descriptions of families of PRS networks with good extreme properties. The realized discovery algorithm found easily previously known families for graphs of degree 4 and 6. It also generated descriptions of new optimal families of PRS networks with degree 3 and found a new family of graphs of degree 6 which significantly improve previously known results. Some of these descriptions became a basis for theoretical investigations. Their validity was theoretically proved for any value of parameter d. The proposed method can be applied to the check of hypotheses about the quality of tested descriptions of graphs and existence of families of graphs with the properties desired. The discovery algorithm can also be useful to find families of graphs with extremal properties for other known classes of networks with parametric description. The extension of the presented template-based evolutionary algorithm can be applied not only for discovery of mathematical formulas, but also for invention of new computational algorithms.
References 1. B.W. Arden and H. Lee, “Analysis of chordal ring networks,” IEEE Trans. Computers, C-30, 1981, pp. 291-295. 2. A.T. Balaban, “Reaction graphs,” in Graph Theoretical Approaches to Chemical Reactivity, D. Bonchev and O. Mekenyan (eds.), Kluwer Academic Publishers, Netherlands, 1994, pp. 137–180. 3. J.-C. Bermond, F. Comellas, and D.F. Hsu, Distributed loop computer networks: a survey, J. Parallel Distributed Comput., 24, 1995, pp. 2–10. 4. J.-C. Bermond, G. Illiades, and C. Peyrat, An optimization problem in distributed loop computer networks, In Proc. of Third International Conference on Combinatorial Math. New York, USA, June 1985, Ann. New York Acad. Sci., 555, 1989, pp. 45–55.
An Algorithm for Discovery of New Families of Optimal Regular Networks
255
5. F.T. Boesch, and J.-F. Wang, Reliable circulant networks with minimum transmission delay, IEEE Trans. Circuits Syst., CAS-32, 1985, pp. 1286–1291. 6. S. Chen, and X.-D. Jia, Undirected loop networks, Networks, 23, 1993, pp. 257– 260. 7. D.B. Fogel, Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, Piscataway, NJ, IEEE Press, 1995. 8. J. Koza, Genetic Programming, Cambridge, M.I.T. Press, 1992. 9. D.-M. Kwai and B. Parhami, “Tight Bounds on the Diameter of Gaussian Cubes,” The Computer Journal, Vol. 41, No. 1, 1998, pp. 52-56. 10. O.G. Monakhov. Parametrical description of structures of homogeneous computer systems, Vychislitelnye sistemy (Computer systems), Novosibirsk, No. 80, 1979, pp. 3–17 (in Russian). 11. O.G. Monakhov, E. A. Monakhova, A Class of Parametric Regular Networks for Multicomputer Architectures. Int’l Scientific Journal “Computacion y Sistemas (Computing and Systems)”, Vol.4, No.2, 2000, pp.85-93. 12. O.G. Monakhov, and E.A. Monakhova, Parallel Systems with Distributed Memory: Structures and Organization of Interactions, Novosibirsk, SB RAS Publ., 2000 (in Russian). 13. E.A. Monakhova, On analytical representation of optimal two-dimensional Diophantine structures of homogeneous computer systems, Computing systems, 90, Novosibirsk, 1981, pp. 81–91 (in Russian). 14. P. Morillo, F. Comellas, and M.A. Fiol, The optimization of chordal ring networks, Communication Technology, World Scientific, 1987, pp. 295–299. 15. H.-P. Schwefel, T. Baeck, Artificial evolution: How and why?, Genetic Algorithms and Evolution Strategy in Engineering and Computer Science – Recent advances and industrial applications, Wiley, Chichester, 1997, pp. 1–19. 16. J.L.A. Yebra, M.A. Fiol, P. Morillo, and I. Alegre. The diameter of undirected graphs associated to plane tessellations. Ars Combinatoria 20B: (1985), pp. 159– 172.
Enumerating Maximal Frequent Sets Using Irredundant Dualization Ken Satoh and Takeaki Uno National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan {ksatoh, uno}@nii.ac.jp
Abstract. In this paper, we give a new algorithm for enumerating all maximal frequent sets using dualization. Frequent sets in transaction data has been used for computing association rules. Maximal frequent sets are important in representing frequent sets in a compact form, thus many researchers have proposed algorithms for enumerating maximal frequent sets. Among these algorithms, some researchers proposed algorithms for enumerating both maximal frequent sets and minimal infrequent sets in a primal-dual way by using a computation of the minimal transversal for a hypergraph, or in other words, hypergraph dualization. We give an improvement for this kind of algorithms in terms of the number of queries of frequency and the space complexity. Our algorithm checks each minimal infrequent set just once, while the existing algorithms check more than once, possibly so many times. Our algorithm does not store the minimal infrequent sets in memory, while the existing algorithms have to store them. The main idea of the improvement is that minimal infrequent sets computed from maximal frequent sets by dualization is still minimal infrequent even if we add a set to the current maximal frequent sets. We analyze the query complexity and the space complexity of our algorithm theoretically, and experimentally evaluate the algorithm to show that the computation time on average is in the order of the multiplication of the number of maximal frequent sets and the number of minimal infrequent sets.
1
Introduction
This paper presents an algorithm for enumerating all maximal frequent sets using dualization. Computing frequent sets in a huge amount of transaction data is an important task in data mining since it is related with computing association rules [Agrawal96]. However, the number of frequent sets is often quite large, thus a compact representation of the frequent sets had been necessary. For this aim, maximal frequent sets have been considered. Any subset of a maximal frequent set is a frequent set thanks to monotonicity of the supporting condition. Then, many algorithms have been proposed to compute maximal frequent sets such as Apriori [Agrawal96] and Max-Miner [Bayardo98]. Apriori is a level-wise bottom-up algorithm which constructs a maximal frequent set from the empty set by increasing the size of frequent sets one by one. Max-miner is basically a G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 256–268, 2003. c Springer-Verlag Berlin Heidelberg 2003
Enumerating Maximal Frequent Sets Using Irredundant Dualization
257
backtrack-based depth-first algorithm by adding an item one by one until the algorithm encounters a maximal frequent set and then backtracks to compute other maximal frequent sets. Yet another interesting algorithms to compute maximal frequent sets based on computing minimal transversals for a hypergraph, computing minimal hitting set, or, in other words, computing a dualization of monotone function [Fredman96] have been proposed [Gunopulos97a,Gunopulos97b]. These problems are equivalent to the problem of enumerating all minimal infrequent sets from the given all maximal frequent sets. The usage of these dualization algorithms is to compute a candidate of a maximal frequent set from the complements of the previously obtained maximal frequent sets. Then, the algorithm avoids frequency checks of any subset of the previous maximal frequent sets. In [Gunopulos97a,Gunopulos97b], they propose such kinds of algorithms. To find a candidate of a maximal frequent set, however, their algorithms computes all minimal hitting sets of the current maximal frequent sets, hence frequency checks for a candidate might be redundant. Moreover, if they use the algorithm by Fredman and Khachiyan [Fredman96], then the space complexity will be in the order of the number of maximal frequent sets and minimal infrequent sets obtained so far. In this paper, we propose a new algorithm which enumerates maximal frequent sets and minimal infrequent sets (minimal hitting sets) simultaneously. We firstly claim that minimal infrequent sets computed from some maximal frequent sets monotonically increase as a new maximal frequent set is added to the current one. Using this property, our algorithm only considers the newly appearing minimal hitting sets when we add a new maximal frequent set, and therefore, we can avoid the redundant frequency checks. In order to receive the benefit of this property, we use a particular type of minimal hitting set algorithms [Kavvadias99,Uno02]. These algorithms compute all minimal hitting sets in an incremental manner. The algorithms first consider the problem consisting of one subset of the given subset family, and find the minimal hitting sets of this problem. Then, the algorithms add a subset from the family one by one to the problem, and find all minimal hitting sets of each augmented problem. The algorithms do not store all the current minimal hitting sets in memory, but enumerate them in a depth first manner. In this paper, we interleave the execution of these algorithms in order to enumerate minimal infrequent sets incrementally and the computation of maximal frequent sets so that we can add a new subset, which correspond with a newly obtained maximal frequent set, to the current subset family during the execution. Thus, we can irredundantly enumerate all maximal frequent sets and minimal infrequent sets in an incremental manner. In this paper, we show that the number of queries in our algorithm is smaller compared with its estimation in the algorithms of [Gunopulos97a, Gunopulos97b], and the space complexity is linear order, w.r.t. the sum of the elements in all maximal frequent sets. Although the time complexity of the ir-
258
K. Satoh and T. Uno
redundant dualization algorithm is not bounded by a polynomial of input and output sizes, we show by experiments that the computation time is proportional to the product of the number of maximal frequent sets and the number of minimal infrequent sets.
2
Enumerating Maximal Frequent Sets by Dualization
We use a set representation in this paper. Note that we can extend our results to more general setting using a technique of [Gunopulos97a], but here we use a set notation for simplicity. However, we abstract a frequency check in a transaction data to check of the following monotonic property. Definition 1. A property of a set, p, is monotonic w.r.t. a set, Π, if the following holds. For every pair of subsets of Π, S1 and S2 , if S1 ⊆ S2 and S2 satisfies p then S1 satisfies p. We call Π the initial set. Definition 2. A subset of Π, S, is a maximal positive set w.r.t. a finite set Π and a monotonic property p if S satisfies p and there exists no proper superset of S, S , s.t. S ⊆ Π and S satisfies p. We denote {S|S is maximal positive w.r.t. Π and p} by Bd+ (p) and call it the positive border w.r.t. Π and p. Definition 3. A subset of Π, S, is a minimal negative set w.r.t. a finite set Π and a monotonic property p if S does not satisfy p and there exists no proper subset of S, S , s.t. S ⊆ Π and S does not satisfy p. We denote {S|S is minimal negative w.r.t. Π and p} by Bd− (p) and call it the negative border w.r.t. Π and p. If we regard p as f requency(S, T ) ≥ δ where T is a set of transaction data and f requency(S, T ) is the number of occurrences of an item set in T which contains S and δ is a threshold, then the positive border is the set of maximal frequent sets and the negative border is the set of minimal infrequent sets. Our task is to find all maximal positive sets which correspond to maximal frequent sets. There are relationships between the positive border and the minimal hitting sets of the negative border defined below. Definition 4. Let Π be a finite set and H be a subset family of Π. A hitting set HS of H is a set s.t. for every S ∈ H, S ∩ HS = ∅. A minimal hitting set HS of H is a hitting set s.t. there exists no other hitting set HS of H s.t. HS ⊂ HS (HS is a proper subset of HS).
Enumerating Maximal Frequent Sets Using Irredundant Dualization
259
Dualize and Advance[Gunopulos97a] 1. 2. 3. 4.
Bd+ := {go up(∅)} Compute M HS(Bd+ ). If no set in M HS(Bd+ ) satisfies p, output M HS(Bd+ ). If there exists a set S in M HS(Bd+ ) satisfying p, Bd+ := Bd+ ∪ {go up(S)} and go to 2. Fig. 1. Dualize and Advance Algorithm
We denote the set of minimal hitting sets of H by M HS(H). We denote the complement of a subset S w.r.t. Π by S. Let S be a set family. We denote {S|S ∈ S} by S. There is a strong connection between the positive border and the negative border by the minimal hitting set operation. Proposition 1. [Mannila96] Bd− (p) = M HS(Bd+ (p)) Using the following proposition, Gunopulos et al. proposed an algorithm called Dualize and Advance in Fig. 1 to compute the positive border w.r.t. Π and p [Gunopulos97a]. Proposition 2. [Gunopulos97a] Let Bd+ ⊆ Bd+ (p). Then, for every S ∈ M HS(Bd+ ), either S ∈ Bd− (p) or S satisfies p (but not both). In the above algorithm, go up(S) for a subset S of P is a maximal positive set which is computed as follows. 1. Select one element e from S and check S ∪ {e} satisfies p. 2. If so, S := S ∪ {e} and go to 1. 3. Else if there is no element e in S such that S ∪ {e} satisfies p, return S. In the above algorithm, we call a check of the satisfaction of p a query for p. Proposition 3. [Gunopulos97a] The number of queries for p in “Dualize and Advance” algorithm to compute Bd+ (p) is at most |Bd+ (p)| · |Bd− (p)| + |Bd+ (p)| · |Π|2 .
3
Algorithm to Avoid Redundant Dualization
We can reduce the above number of queries by using the following lemma. + + + + Lemma 1. Let Bd+ 1 and Bd2 be subsets of Bd (p). If Bd1 ⊆ Bd2 , + − − M HS(Bd+ 1 ) ∩ Bd (p) ⊆ M HS(Bd2 ) ∩ Bd (p)
260
K. Satoh and T. Uno
+ global integer bdpnum; sets bd+ 0 , bd1 ....; main() begin bdpnum := 0; construct bdp(0, ∅); output all the bd+ j (0 ≤ j ≤ bdpnum); end
construct bdp(i, mhs) begin if i == bdpnum /* minimal hitting set for ∪bdpnum bd+ j is found */ j:=0 then goto 1 else goto 2 1.
2.
if mhs does not satisfy p, return; /* new Bd− (p) element is found */ + bd+ bdpnum := go up2(mhs); /* new Bd (p) element is found */ bdpnum := bdpnum + 1; /* proceed to 2 */ + + + for every e ∈ bd+ i s.t. mhs∪{e} is a minimal hitting set of {bd0 , bd1 ..., bdi } do begin construct bdp(i + 1, mhs ∪ {e}); end return;
end Fig. 2. Algorithm to Check Minimal Negative Borders Only Once
− Proof: Suppose that there exists S s.t. S ∈ M HS(Bd+ 1 ) ∩ Bd (p) but + + − − S ∈ M HS(Bd2 ) ∩ Bd (p). Since S ∈ Bd (p), S ∈ M HS(Bd2 ). Thus, there + + exists bd ∈ Bd2 \Bd1 s.t. bd ∩ S = ∅. This means S ⊂ bd. However, since S ∈ Bd− (p), bd cannot satisfy p; contradiction. 2
Suppose that we have already found minimal negative sets corresponding with a subset Bd+ of the positive border. The above lemma means that if we add a maximal positive set to Bd+ , any minimal negative set we found is still a minimal negative set. Therefore, if we can use an algorithm to visit each element in the negative border only once, we no longer have to check the same element again even if maximal positive sets are newly found. Then, using such an algorithm, we can reduce the number of checks. The algorithm in Fig. 2 can do such pruning. In the algorithm in Fig. 2, we use an incremental version of the algorithm shown in Fig. 3 of computing minimal hitting sets. Let H be {S0 , ..., Sn }. We associate an index to each set in H to define a fixed order over sets. The algorithm in Fig. 3 exactly gives all minimal hitting sets of H without any redundant visit of the same minimal hitting set [Uno02]. We call compute mhs(0, ∅) and all minimal hitting sets will be output.
Enumerating Maximal Frequent Sets Using Irredundant Dualization
261
global S0 , ..., Sn ; compute mhs(i, mhs) /* mhs is a minimal hitting set of S0 , ..., Si */ begin if i == n then output mhs and return; else for every e ∈ S s.t. mhs ∪ {e} is a minimal hitting set of S0 , ..., Si do compute mhs(i + 1, mhs ∪ {e}); return; end Fig. 3. Algorithm to Compute Minimal Hitting Sets
Let us consider to use this algorithm in computing the minimal hitting sets of Bd+ (p) incrementally shown in Fig. 2. Suppose that we get a minimal hitting + − set mhs from {bd+ 0 , ..., bdi }. Then, either mhs ∈ Bd (p) or mhs satisfies p − according to Proposition 2. If mhs ∈ Bd (p) then, mhs will be included forever in the minimal hitting sets of any set of future maximal positive sets thanks to Lemma 1. Therefore, we can forget this mhs. If mhs satisfies p, we get a new maximal positive set by go up2(mhs), which is an improved version of go up. This is added to a set of maximal positive sets as bd+ i+1 . Then, this mhs becomes + + no longer a minimal hitting set for {bd+ 0 , ..., bdi , bdi+1 }. It is because that mhs ⊆ + bd+ i+1 and therefore any minimal hitting set for the new augmented set with bdi+1 should be a proper superset of mhs. In the algorithm in Fig. 2, we take the depth-first strategy to process mhs and so, we can always guarantee that processed mhs so far which is still in the − minimal hitting sets of bd+ i (i = 0, ..., n) is in Bd (p). Moreover, instead of using go up, we use the following procedure go up2.
1. For each element e in S, if S ∪ {e} satisfies p, then S := S ∪ {e} 2. Output S Let S be the output of this procedure. If this procedure does not add element e to S in Step 1, then S ∪ {e} does not satisfy p since S ∪ {e} ⊂ S ∪ {e}. Thus, this algorithm surely outputs a maximal positive set including the given set S. The number of queries in go up2 is |S| ≤ |Π|1 . Example 1. Let Π = {a, b, c, d} and Bd+ (p) = {{a, b}, {a, c}, {b, c, d}}. Note that Bd− (p) = {{a, b, c}, {a, d}}. We show the trace of our algorithm as follows. bdpnum := 0; construct bdp(0, ∅) Since i == bdpnum and ∅ satisfies p, we invoke go up2(∅). Suppose that {a, b} is obtained. bd+ 0 is set to {a, b}, and bdpnum := 1. Since bd+ is {c, d} and ∅ ∪ {d}(= {d}) and ∅ ∪ {c}(= {c}) are minimal hitting 0 sets of {{c, d}}, we invoke construct bdp(1, {d}) and construct bdp(1, {c}). 1
Actually, we can use this go up2 in “Dualize and Advance” algorithm as well.
262
K. Satoh and T. Uno
1. construct bdp(1, {d}) Since i == bdpnum and {d} satisfies p, we invoke go up2({d}). Suppose that {b, c, d} is obtained. bd+ 1 is set to {b, c, d}, and bdpnum := 2. + Since bd1 is {a} and {d} ∪ {a}(= {a, d}) is a minimal hitting set of {{c, d}, {a}}, we invoke construct bdp(2, {a, d}). a) construct bdp(2, {a, d}) Since i == bdpnum and {a, d} does not satisfies p, return to the caller. Note that {a, d} ∈ Bd− (p). Since a for-loop is finished return to the caller. 2. construct bdp(1, {c}) Since i = bdpnum, we go directly to 2. Since bd+ 1 is {a} and {c} ∪ {a}(= {a, c}) is a minimal hitting set of {{c, d}, {a}}, we invoke construct bdp(2, {a, c}). a) construct bdp(2, {a, c}) Since i == bdpnum, and {a, c} satisfies p, we compute go up2({a, c}). {a, c} is obtained. bd+ 2 is set to {a, c}, and bdpnum := 3. Since bd+ 2 is {b, d} and {a, c} ∪ {b}(= {a, b, c}) is a minimal hitting set of {{c, d}, {a}, {b, d}}, we invoke construct bdp(3, {a, b, c}). Note that {a, c} ∪ {d}(= {a, c, d}) is not a minimal hitting set of {{c, d}, {a}, {b, d}} and therefore construct bdp(3, {a, c, d}) is not invoked. i. construct bdp(3, {a, b, c}) Since i == bdpnum, but {a, b, c} does not satisfy p, return to the caller. Note that {a, b, c} ∈ Bd− (p). Since a for-loop is finished return to the caller. Since a for-loop is finished return to the caller. + + Therefore Bd+ (p) = {bd+ 0 , bd1 , bd2 } = {{a, b}, {b, c, d}, {a, c}}.
Note that {a, d} is obtained for M HS({{c, d}, {a}}), but since it is {a, d} ∈ Bd− (p), {a, d} is also for M HS({{c, d}, {a}, {b, d}}) according to Lemma 1. We can give an upper bound of the number of queries for the algorithm. The number of queries for each go up2(S) is at most |Π| and the number of times of calling go up2(S) is |Bd+ (p)|. If we find a new element in Bd+ (p), we add + + it as bd+ bdpnum and compute a minimal hitting set for {bd0 , ..., bdbdpnum }. Even − though this algorithm checks an element in Bd (p) only once, it is sufficient for computing Bd+ (p) thanks to Lemma 1. Therefore, the number of checks for which the algorithm encounters an element of Bd− (p) is |Bd− (p)|. Thus, we have the following theorem which states the correctness of the above algorithm and the upper bound of the number of queries for p. Theorem 1. The above algorithm outputs Bd+ (p) with at most |Bd− (p)| + |Bd+ (p)| · |Π| queries for p. Note that in the algorithm, we have to check the minimality of mhs ∪ {e}. In a naive way, we could do it by checking whether mhs ∪ {e} − {e } for every
Enumerating Maximal Frequent Sets Using Irredundant Dualization
263
e ∈ mhs is not a hitting set. By using Uno’s algorithm for computing minimal hitting sets[Uno02], we can do more efficient checking about this minimality. The algorithm in [Uno02] maintains the subset C of H such that any subset (an element of H) of C includes just one element of the current minimal hitting set. Subsets in C ensure the minimality of the hitting set, i.e., any element of the current minimal hitting set has to be included in a subset of C. When we add an element to the current minimal hitting set, we remove from C the subsets including the new element. Then, the minimality of the new hitting set can be done by checking whether any element of the hitting set is included in a subset of C. This is done in O(|Π|) time. We use this algorithm in the experiments. Note also that, the space complexity of the algorithm in Fig. 2 is O(ΣS∈Bd+ (p) |S|) since all we need to memorize is Bd+ (p) and once a set in Bd− (p) is checked, it is no longer necessary to be recorded. On the other hand, [Gunopulos97a] suggests a usage of Fredman and Khachiyan’s algorithm [Fredman96] which needs a space of O(ΣS∈(Bd+ (p)∪Bd− (p)) |S|) since the algorithm needs both Bd+ (p) and Bd− (p) at the last stage. We can also apply this result to learning a monotone DNF in a similar way to [Gunopulos97a]. Theorem 2. There is an algorithm to learn a monotone formula f for n propositions with at most |DN F (f )| + |CN F (f )| · n membership queries where |DN F (f )| and |CN F (f )| are the sizes of minimal DNF representation and minimal CNF representation of f respectively. The algorithm outputs both a DNF and a CNF representations of f . Unfortunately, the time complexity of the above algorithm may not be bounded by polynomial or quasi polynomial since the time complexity of dualization may take exponential time in |Π| whereas the time complexity of dualize and advance algorithm (if they use Fredman and Khachiyan’s algorithm for dualization) is O(tlogt ) where t = |Bd+ (p)| + |Bd− (p)|. However, in the next session, we empirically show that the computation time on average for random instances is in the order of the multiplication of the number of maximal frequent sets and the number of minimal infrequent sets.
4
Experiments
In this section, we show some computational experiments to evaluate the practical performance of our algorithm. In this experiment, we let p as f requency(S, T ) ≥ δ, that is, the original task of computing maximal frequent sets. All the examined problem instances were generated in a random manner. For any transaction of the instances, an element is included in the transaction with the same probability. To control the number and the average size of each maximal positive set, we set δ to 1. The maximal frequent sets of this setting actually corresponds with all maximal item sets in T in terms of set inclusion through frequency checks of an item set in T .
264
K. Satoh and T. Uno
The problem instances are classified into three groups: |Π| = 25, 100, and 400. For each group, we changed the number of transactions: 62, 250 and 1000. We also changed the probability so that we can control the average size of each maximal frequent set of a problem instance. In Tables 1, 2 and 3, we denote the average size of each maximal frequent set by |bd|, and the number of transactions by “#trans”. For each parameter, we generated 10 instances. |Bd+ (p)| and |Bd− (p)| are averages. Table 1. Problem name |Π| #trans |bd| |Bd+ (p)| |Bd− (p)| A1 25 62 3 49 440 A2 25 250 3 140 1505 A3 25 1000 3 378 3201 B1 25 62 12 58 6750 B2 25 250 12 221 32472 B3 25 1000 12 728 99770 C1 25 62 20 44 4514 C2 25 250 20 50 1178 C3 25 1000 20 17 27 Table 2. Problem name F1 F2 F3 G1 G2 G3 H1 H2 I1
|Π| #trans |bd| |Bd+ (p)| |Bd− (p)| 100 62 10 62 13369 100 250 10 249 94115 100 1000 10 993 438568 100 62 20 62 92832 100 250 20 250 1054813 100 1000 20 1000 9528606 100 62 30 62 590487 100 250 30 250 9896969 100 62 50 62 29280040 Table 3.
Problem name Y1 Y2 Y3 Z1 Z2
|Π| #trans |bd| |Bd+ (p)| |Bd− (p)| 400 62 20 62 85330 400 250 20 250 899584 400 1000 20 1000 6954018 400 62 40 62 746520 400 250 40 250 7175503
We examined our algorithm and a backtracking algorithm to evaluate the computation time, and the number of iterations and queries. For each parameter, we examined 10 instances and show the average result. Each instance was examined once. We stoped the computation if the execution time exceeded 10,000
Enumerating Maximal Frequent Sets Using Irredundant Dualization
265
sec, and wrote ’-’ on the corresponding cell of the table. These experiments had been done in a PC with a Pentium III 500MHz with 256MB memory, and the code was written in C. The results are shown in Table 4 to 6 as follows. “time” is the computation time (sec), “#query” is the number of the queries, and “#iter” is the number of iterations. The number of iterations of our algorithm is the number of calls of compute mhs, and the number of iterations of the backtrack algorithm is the number of subsets satisfying p (the number of frequent sets). Table 4. ours:time #query #iter backtrack:time #query #iter
A1 A2 A3 0.04 0.18 1.0 1126 3153 8299 572 1907 4467
B1 0.38 1202 11456
B2 4.0 4149 60268
B3 39 12302 214419
C1 0.25 750 12014
C2 0.17 646 4447
C3 0.07 195 140
0.007 0.02 0.08 2.3 7.3 27 65 165 609 2297 7229 19841 1778923 4759810 11538683 30287045 33484482 33554381 506 1594 4525 1286820 3307570 8090739 29345131 33401780 33554014 Table 5.
F1 F2 F3 G1 G2 G3 ours:time 1.2 15 184 6.8 139 3344 #query 6084 24269 96256 6053 24165 1013089 #iter 15096 106667 515798 109390 1241645 11625647 backtrack:time 1.4 7.8 27 – – – #query 3838932 23318373 104190659 – – – #iter 591524 3580526 12021442 – – – H1 H2 I1 ours:time 36 1232 1584 #query 6020 23960 5942 #iter 703055 12245554 37040987 backtrack:time – – – #query – – – #iter – – – Table 6. Y1 Y2 Y3 Z1 Z2 ours:time 14 203 3239 71 1521 #query 24718 99497 397354 24692 99370 #iter 91286 937355 7372595 782291 7868437 backtrack:time – – – – – #query – – – – – #iter – – – – –
In almost cases, our algorithm worked well. The backtrack algorithm was faster than our algorithm in few cases in which the average sizes of maximal
266
K. Satoh and T. Uno
positive sets are very small. In the other cases, our algorithm was quite faster than the backtrack algorithm. Particularly, our algorithm can solve several problems which the backtrack algorithm can not solve. The computation time of our algorithm is linear in |Bd+ (p)| · |Bd− (p)|. In almost cases, |Bd+ (p)| · |Bd− (p)| is in the range from 500,000× “time” to 2,500,000 × “time” (see Fig 4).
Fig. 4. Relation between |Bd+ (p)| × |Bd− (p)| and CPU time
We also checked the usage of the memory. We show three cases in Table 7 below. The column titled i/10 shows the number of the maximal positive sets which had been found in the first i/10 iterations. Roughly speaking, about 80% of maximal positive sets had been found in the first 20% iterations. This means that the algorithm takes almost necessary memory space at the beginning. This also means that if we stop the execution of our algorithm when the growing rate of the maximal positive sets becomes slow, the current maximal positive sets are a good approximation of the positive border. Table 7. Problem Name 1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10 B3 399 469 582 631 654 681 694 701 716 726 G2 215 243 249 250 250 250 250 250 250 250 Z1 57 61 62 62 62 62 62 62 62 62
5
Related Works
Although enumerating maximal frequent sets from a transaction data is proven to be NP-complete w.r.t. the size of transaction data [Boros02], it is important to pursue heuristics of the efficient computation of this problem since the problem is really important in knowledge discovery. Moreover, if we only allow a query
Enumerating Maximal Frequent Sets Using Irredundant Dualization
267
for p, by the result [Mannila96], we need at least |Bd+ (p)| + |Bd− (p)| queries. Therefore, the necessary number of queries |Bd+ (p)| + |Bd− (p)| · |Π| of our algorithm is closer to this lower bound. [Gunopulos97b] gives a randomized algorithm using the dualization where the randomness exists in computing some maximal frequent sets from a subset of Bd− (p). However, the algorithm always dualize all maximal frequent sets and this inherits a redundancy problem if they use a usual dualization algorithm. Note that our algorithm (as well as “Dualize and Advance” algorithm) is actually applicable not only to enumerating maximal frequent sets but also to any problem to enumerating maximal elements where the order relation has a monotone property. We currently investigate an application to computing a most preferable solution in soft constraints defined in [Satoh90] where maximal consistent sets of soft constraints are computed.
6
Conclusion
The contributions of this work are as follows. – We give an algorithm to enumerate maximal frequent sets using an irredundant dualization algorithm. – We give an analysis of complexities of the algorithm in that we show that the number of queries are at most |Bd− (p)| + |Bd+ (p)| · |Π| and the necessary space is O(ΣS∈Bd+ (p) |S|). – We empirically show that the computation time on average for random instances is in the order of the multiplication of the number of maximal frequent sets and the number of minimal infrequent sets.
References [Agrawal96] Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I., “Fast Discovery of Association Rules”, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, (eds), Advances in Knowledge Discovery and Data Mining, chapter 12, pp. 307–328 (1996). [Bayardo98] Bayardo Jr., R. J., “Efficiently Mining Long Patterns from Databases”, Proc. of the 1998 ACM-SIGMOD, pp. 85–93 (1998). [Boros02] Boros, E., Gurvich, V., Khachiyan, L., and Makino, K., “On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets”, Proc. of STACS 2002, pp. 133–141 (2002). [Fredman96] Fredman, M. L. and Khachiyan, L., “On the Complexity of Dualization of Monotone Disjunctive Normal Forms”, Journal of Algorithms 21(3), pp. 618–628 (1996) [Gunopulos97a] Gunopulos, D., Khardon, R., Mannila, H. and Toivonen, H., “Data mining, Hypergraph Transversals, and Machine Learning”, Proc. of PODS’97, pp. 209–216 (1997). [Gunopulos97b] Gunopulos, D., Mannila, H., and Saluja, S., “Discovering All Most Specific Sentences using Randomized Algorithms”, Proc. of ICDT’97, pp. 215– 229 (1997).
268
K. Satoh and T. Uno
[Kavvadias99] Kavvadias, D. J., and Stavropoulos, E. C., “Evaluation of an Algorithm for the Transversal Hypergraph Problem”, Algorithm Engineering, pp 72–84 (1999). [Mannila96] Mannila, H. and Toivonen, T., “On an Algorithm for Finding All Interesting Sentences”, Cybernetics and Systems, Vol II, The Thirteen European Meeting on Cybernetics and Systems Research, pp. 973–978 (1996). [Satoh90] Satoh, K., “Formalizing Soft Constraints by Interpretation Ordering”, Proc. of ECAI’90, pp. 585–590 (1990). [Uno02] Uno, T., “A Practical Fast Algorithm for Enumerating Minimal Set Coverings”, SIGAL83, Information Processing Society of Japan, pp. 9–16 (in Japanese) (2002).
Discovering Exceptional Information from Customer Inquiry by Association Rule Miner 1
Keiko Shimazu1, Atsuhito Momma , and Koichi Furukawa2 1
Information Media Laboratory, Corporate Research Group, Fuji Xerox Co., Ltd. 430 Sakai Nakai-machi Ashigarakami-gun Kanagawa 259-0157 Japan {keiko.shimazu, atsuhito.momma}@fujixerox.co.jp 2 Graduate School of Media and Governance, Keio University 5322 Endo Fujisawa-shi Kanagawa 252-8520 Japan [email protected]
Abstract. This paper reports the results of our experimental study on a new method of applying an association rule miner to discover useful information from a text database. It has been claimed that association rule mining is not suited for text mining. To overcome this problem, we propose (1) to generate a sequential data set of words with dependency structure from a Japanese text database, and (2) to employ a new method for extracting meaningful association rules by applying a new rule selection criterion. Each inquiry was converted to a list of word pairs, having dependency relationship in the original sentence. The association rules were acquired regarding each pair of words as an item. The rule selection criterion derived from our principle of giving heavier weights to co-occurrence of multiple items than to single item occurrence. We regarded a rule as important if the existence of the items in the rule body significantly affected the occurrence of the item in the rule head. Based on this method, we conducted experiments on a customer inquiry database in a call center of a company and successfully acquired practical meaningful rules, which were not too general nor appeared only rarely. Also, they were not acquired by only simple keyword retrieval. Additionally, inquiries with multiple aspects were properly classified into corresponding multiple categories. Furthermore, we compared (i) rules obtained from a sequential data set of words with dependency structure, which we propose in this paper, and those without dependency structure, as well as (ii) rules acquired through the association rule selection criterion and those through the conventional criteria. As a result, discovery of meaningful rules increased 14.3-fold in the first comparison, and we confirmed that our criterion enables to obtain rules according to the objectives more precisely in the second comparison.
1 Introduction Recent studies on text mining, a research area in data mining, have been drawing attention among researchers [18]. This is due to the recent rapid increase in digital documents on the Internet [2, 7].
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 269–282, 2003. © Springer-Verlag Berlin Heidelberg 2003
270
K. Shimazu, A. Momma, and K. Furukawa
Our purpose is the extraction of important information from call center data. Call centers are now regarded as the most important interface for companies to communicate with their customers. Call center operators must respond to various requests from their customers without offending them. In addition, call center records are said to contain precious information for understanding the trends of customers. However, operators often overlook such hidden information in analyzing records because they tend to rely solely on static clustering measures and empirical keywords. Also, information that call center operators handle in their daily work routine is quite different from that of top management, who typically review information from global and longterm perspectives. Consequently, call center records are not fully utilized for discovering business opportunities or avoiding risks. This fact is called a “bottleneck” in electronic Customer Relationship Management (eCRM). In our experiments, we aimed to discover important information that could not have been extracted by a set of keywords specified by a domain specialist. In [16], we proposed a process for important information discovery. In this process, call center inquiry data were first converted into sets of items that consist of important words with syntactic dependencies (preprocessing). Then, meaningful item sequences were generated by selecting items necessary for capturing the meaning of the original sentences and sorting them accordingly (rule selection). In this paper, we demonstrate the effectiveness of these steps (i.e., preprocessing and rule selection) by confirming the increase of meaningful rules acquired from real-world call center inquiries. The remainder of this paper is structured as follows: Section 2 summarizes related work. Section 3 explains the framework that we adopted for important information discovery from text data. Section 4 reports the results of the experiments on the call center records, an analysis for which is given in Section 5. Section 6 provides a conclusion to this paper.
2
Related Work
2.1 Techniques for Keyword Discovery For discovery of important information and knowledge from large amounts of data, identification of important words that represent the content of each document is key [14]. While many systems that have been put into practice calculate a particular word’s importance based on its frequency [15], there exist methods that can get rid of such inherent shortcomings brought by the use of words’ frequencies. For example, Matsuo proposed an algorithm that extracts important words by excluding general words [7]. It first identifies frequent words within a document and then calculates co-occurrence frequency of each frequent word and other words and finally extracts words that have higher occurrence. This is an extension of KeyGraph [12] and obtains better results in some cases than the original. In addition, another study reports that nouns are more important than verbs [11] and that semantic relationship among phrases is more useful [5]. Still, another study claims that the objective of a sentence can be identified by analyzing expression at the end of it [21].
Discovering Exceptional Information from Customer Inquiry
271
Besides identifying important words, some researchers introduce a structure to represent relationship among these words. For example, Zaki proposed an efficient algorithm to obtain frequent trees in a forest consisting of ordered labeled rooted trees [23]. He also proposed an algorithm for inducing frequent sequential patterns in [22], where he reported experimental results of word sequential pattern acquision. In order to avoid a calculation explosion, we adopted important parts of the tree, not the whole tree, as items for rule acquisition. 2.2 Utilization of Dependency Information In general, the results of text data analysis are represented as either (1) grammatical structure among phrases and words, a research outcome of syntactical analysis for English sentences, or (2) dependencies among phrases (Figure 1). The former representation is useful when sentences are composed accurately conforming to the corresponding grammar and is often used for translation of newspaper articles and research papers into different languages, as well as generation of summaries. Meanwhile, word order is relatively flexible and components can often be inherently omitted in Japanese sentences. In particular, call center inquiries strongly possess these properties since they tend to be recorded as memorandums without grammar. The latter representation is suitable in representation of such properties. Previous studies that take advantage of dependency information include ones that utilize dependencies related to verbs [11] and ones that select dependencies related to words at the end of sentences, regardless of their part-of-speech information. However, given an inquiry shown in Figure 2, for example, it is difficult to capture it’s meaning precisely with information derived from these methods. Based on this consideration, we incorporated all dependency information into the sequential data. 2.3 Discovery of Exceptional Rules Apriori 4.03 [3] proposes new methods that effectively select a subset of a large amount of association rules by comparing prior confidence and posterior confidence.
Fig. 1. Representations of Text Data
272
K. Shimazu, A. Momma, and K. Furukawa
Fig. 2. Sequential Data Generation
It can be regarded as an application of Matsuo’s method [7] into association rule mining. In our experiment, we employed this method to extract interesting rules. In supervised learning, often employed as a data mining engine, positive examples and negative examples are given to acquire rules characterizing the target concept. Meanwhile, sometimes it is difficult to decide whether an example is a positive or a negative. Inoue et al. [6] proposed a novel-learning algorithm that is able to deal with incomplete information by means of extended logic programming. They claim that their method enables learning of default rules [13] including exceptions. Suzuki proposed a method to discover default rules and exception rules simultaneously, by regarding rules with high support and confidence as the defaults [20]. When a rule Y X is captured as a default rule, a related rule Z / X is identified and an exception X is acquired. Here, X refers to an atom with the same attributes as rule X,Z X and different attribute values, and / indicates that the left part is insufficient to explain the right part. J-Measure [17] is a criterion to select rules through prior and posterior confidence in classification problems. It first computes mutual information between prior and posterior probabilities and then averages this information over posterior probability, and subsequently adopts the average of information. On the other hand our method utilizes the difference of prior and posterior confidence as the criterion to select noteworthy instances, each of which is represented as an association rule. It should be noted that the averaging operation is not appropriate in our domain since association rules are concerned with instances rather than general rules with variables. 2.4 Text Mining from Call Center Information Representative applications of text mining methods into call center data include one performed by Nasukawa’s system that is based on natural language processing techniques [10]. It allows users to analyze call center data from various points of view such as categories of inquiries with similar contents and characteristics of inquiries that require longer handling times. This system has already been put into practice with practical features such as seamlessly analyzing inquiries regardless of their media type
Discovering Exceptional Information from Customer Inquiry
273
including voice over telephone and electronic mail contents. However, to the best of our knowledge, reports that discover clues for brand new business opportunities and ones that allow users to proactively avoid risks have not been published. In our experiment, we aimed to identify important information that cannot be discovered by simply utilizing keywords that call center operators customarily use in their daily operation.
3 Framework for Meaningful Information Discovery from Text With the purpose of identifying important information that cannot be obtained by methods such as keyword search and conventional text data classification techniques, we adopted the overall framework shown in Figure 3. The framework is based on propositions made by professionals who deal with a large amount of text data (inquiries from customers) in their daily work routine. They claim that important information can be intuitively and effectively discovered not by reading whole documents but by simply skimming sequences of important words within them. 3.1 Conversion to Sequential Data Figure 4 illustrates our proposed procedure to prepare sequential data from inquiries, consisting of two steps. The first step generates sequential data including word pairs with dependency relationship as items, and the second step omits unnecessary items to identify the meaning of original inquiries and attach the meaning to the sequential data. 3.1.1 Parsing and Dependency Information Attachment In our experiments, we obtained association rules by regarding words in the inquiries as basic components, and adopting a minimum difference between the prior confidence and the posterior confidence as the rule selection criterion, rather than the conventional threshold with minimum support and minimum confidence. We anticipated that our method would enable us to identify important information classes that cannot be obtained with conventional methods such as keyword search. As a first step, each sentence in the inquiries was segmented into words by a dictionary developed solely for the inquiries1. Then, sequential data were generated with dependency structure (step1 in Figure 4)2. We believe that this not only eliminates redundancy in interpretation, but also contributes to accurate meaning identification. Then, we converted inquiry records into sequences of combinations of two words, between which syntactical dependencies exist. For example, the sentence , which means I put papers in a tray. ,
1 2
Morphological Analyzer “ChaSen” [8] was employed for word segmentation. During word segmentation, meanings of auxiliary verbs were interpreted and were incorporated into the sequential data in order to accurately capture meanings of the original sentences.
274
K. Shimazu, A. Momma, and K. Furukawa
Fig. 3. Framework for Important Informaiton Discovery
is converted to ( ), ( ), ( ), ( ) , put), (a tray in), (in put), (paper put). which is like (I , which means Papers Similarly, sentences , which means I put paper in. , were put in a tray , and ), ( ) , which is like are individually converted to ( (paper put), (a tray put) , and ( ), ( ) , which is (I put), (papers put) . One can determine that these three sentences like have an identical meaning because they posses the same combination ( ) , which is like (papers put) . We believe that this method makes it possible to identify the meaning of a sentence reliably by excluding ambiguity in interpretation, even if various expressions for a single meaning exist. 3.1.2 Extraction of Meaningful Sequential Data Next, meaningful sequences of items (pairs of words where dependency relationship exists) were generated manually by selecting items necessary for capturing the meaning of the original sentences and sorting them accordingly. Then, following our new observation that sentence meaning can be obtained solely by browsing its association rules, association rules with bodies consisting of items and heads representing the class for the sequence were generated (step 2 in Figure 4). ( ), ( OS ), ( ) For example, while a rule Purchase Request of Product for a specific OS , which contains (since, use), Purchase Request of (change, use), (our company, common OS environment) Product for a specific OS can be made, it is excluded in this step since the items in the body are meaningless. In this step, our purpose is not to classify rules into multiple classes, but to identify the interpretation of each rule and generate clusters of rules with identical interpretation, examples of which include “Complaint on Operations and/or Functional Specification” and “Acknowledgement on Performance”.
Discovering Exceptional Information from Customer Inquiry
275
Fig. 4. Preparation of Sequential Data
3.2 Important Pattern Discovery from Sequential Data 3.2.1 Important Information Acquisition with Prior and Posterior Confidence In text mining, it is not always the case that frequent words are important, particularly when focusing on the contents [12]. Rather, we regarded a rule important if the existence of the items in the rule body significantly affects the occurrence of the item in the rule head and applied this principle to rule selection. This criterion can be seen as a simplification of Matsuo’s proposal [9], and it has already been incorporated in Apriori 4.03 [3]. A rule is selected if the difference between its prior and posterior confidences exceeds a given threshold. The prior confidence of an association rule is the confidence of a rule with the head of the rule and an empty body. The posterior confidence is the confidence of the rule itself. For example, given an association rule {cheese, tomato} {bread} , its prior confidence is the confidence of a rule { } {bread} and its posterior confidence is the confidence of the rule itself (i.e. {cheese, tomato} {bread} ). Our assumption is that word co-occurrence is useful in extracting meaning of sentences. Thus, we expected that meaningful and useful rules could be extracted, while different from common trends in the majority of data, by selecting rules whose difference between prior and posterior confidence is large.
276
K. Shimazu, A. Momma, and K. Furukawa
3.2.2 Exception Rule Discovery by Default Rules We conducted experiments to verify our assumption that the rules derived by the method described in section 3.2.1 already include those obtained by Suzuki’s method. As a result, one meaningful exception rule was acquired by Suzuki’s method, which was one of the 20 rules that were obtained by our method.
4 Experiments on Call Center Records 4.1 Target Data In this experiment, we used 626 inquiries about a specific product (hereinafter called “product A”) from April 1 to July 31, 2002. The same experiments were conducted on 725 inquiries about product A from August 1 to October 30 2002, in order to identify differences in the results. 4.2 Meaningful Sequential Data Preparation Inquiries were converted into sequential data consisting of words with all dependency information, referring to the dictionary dedicated for the data. The average number of items per an inquiry is 14.9. After meaningful item sequences were obtained in section 3.1.2, each inquiry contains 7.1 items on average. Overall data contains 9,598 word occurrences, 1,950 distinctive words, and 8,157 distinctive items. When dependencies to verbs were solely employed in sequential data generation, each inquiry contains 7.5 items on average. 4.3 Pattern Acquisition from Sequential Data 4.3.1 Meaningful Sequential Patterns Irrelevant to Frequency In order to obtain important rules that are characterized by word co-occurrence, rules whose difference between the prior confidence and the posterior confidence exceeds 30% were extracted (the minimum support was set to 0.02). Table 1 lists 20 rule examples with the largest differences. Seven rules regard machine operations and/or functional specifications, which include two questionnaires. In addition to three rules on purchase operations, inquiries on the compatibility among different operating systems and questions on performance, both of which were not extracted with a conventional rule selection criteria (i.e., the minimum support of 0.6 and the minimum confidence of 40%), were obtained. Four rules appeared in both rule selection criteria. The maximum number and the minimum number of inquiries that match each rule are 10 and 2, whose average is 4.6. 4.3.2 Exception Patterns In order to compare our criterion with Suzuki’s exception patterns extraction algorithm, we applied his algorithm to our domain. Treating rules obtained with the above
Discovering Exceptional Information from Customer Inquiry
277
Table 1. Meaningful Rules Independent from Frequency
conventional selection criteria as the default rules, reference rules were searched by switching attribute values (classification classes) in the head. Then, two exception rules in Table 2 were obtained as rules whose confidence increases by adding items in the reference rule’s body into their bodies. The first rule turned out to be already obtained in section 4.3.1. The confidence factors for acquiring each exception rule are 45.45 and 26.08.
5 Discussion 5.1 Effectiveness of Difference between Prior and Posterior Confidence We compared the association rules filtered by our proposed rule selection criterion, the difference between prior and posterior confidence, with those selected by the conventional rule selection criteria, support and confidence (Table 3 and Table 4). In this experiment, we utilized word sequences of the inquiries in order to demonstrate the fact that our rule selection criterion contributes to meaningful rule acquisition, independent from the preprocessing (i.e., sequential data generation with dependency information). Among around 10,000 rules acquired using our proposed difference between prior and posterior confidence more than 700 rules was found to be meaningful. However, when the conventional criteria were applied, only 300 rules were acquired which contained 12 meaningful rules by lowering the confidence threshold to avoid computational explosion. We believe that these results strongly support the argument that the conventional association rule selection techniques are not directly applicable to text mining [2], while indicating the effectiveness of our method.
278
K. Shimazu, A. Momma, and K. Furukawa
5.2 Effectiveness of Preprocessing In addition to the sequential data with dependencies, we obtained meaningful rules from the sequential data without dependencies and compared them. With a target of acquiring approximately 10,000 rules in each experiment, we obtained the results summarized in Table 5. With incorporating dependency, the number of meaningful rules increased 14.3-fold, from 741 to 10,568, and coverage of 626 inquiries with meaningful rules also improved from 7% to 95%. Thus, we claim that our proposed preprocessing, consisting of dependency information addition and meaningful item sequence selection, certainly contributes to effective rule acquisition. Table 2. Exception Patterns
Table 3. Rules with Our Rule Selection Criterion
Table 4. Rules with Conventional Rule Selection Criteria
5.3 Meaningful Information Acquisition Within the head of association rules obtained with the conventional selection criteria, “Machine Operation/Functional Specification” appears in half of the rules. “Prepurchase Information Request/Purchase Procedure”, “Operational Scheme”, and “Complaint” appear in two rules, respectively. Assuming that this is the overall trend
Discovering Exceptional Information from Customer Inquiry
279
of the inquiries, this corresponds to the trend reported in monthly reports periodically prepared by a call center staff member. In other words, these facts can be obtained by accessing the inquiry database on the Intranet. On the other hand, our experiments revealed previously unrecognized facts: (1) inquiries on usage of files created by product A and attached to email are recognizable among those in the “Machine Operation/Functional Specification” and (2) inquiries and claims on usage and specification of product A’s new version released within one year, are pronounced. These facts prove that our method is able to capture keywords that tend to be overlooked, buried in a pile of information. Among association rules derived by the novel selection criterion, six rules were on file download operations from homepages that appeared among the 20 association rules. By closely observing these rules, the fact expressing a user’s perplexity in selecting proper connection protocol (i.e. HTTP or FTP) was obtained. This fact was also previously unrecognized in daily compiling operations. In addition, the fact that ), users of a particular scanner tend to issue complaints ( (X , , ), (Ver10.2, ) , or (manufactured by ( Complaint /Dissatisfaction X, scanner), (scanner, use), (Ver10.2, use) /Uncertainty ) was also overlooked by professionals. Note that this rule was also obtained by the exception rule discovery method [20]. In general, relevant keywords are not provided in advance in extracting noteworthy information from a large amount of data. Thus, keywords with low frequency tend to be forgotten and not to be used in text retrieval. Conversely, we believe that our method is effective in discovering noteworthy trends, independent from the overall trends of inquiries. 5.4 Effectiveness of Exception Rules The exception rules obtained by applying Suzuki’s algorithm (see Section 4.3.2) consist of one rule that was also acquired by our proposed method and one meaningless rule. From another data set (i.e., from August to October), no exception rule was obtained. In the exception rule discovery, a similar rule selection criterion was employed to the criterion based on prior and posterior confidences. In an example on medical data shown in [20], rules whose support and confidence exceed 20% and 75% were selected as the default rules. The maximum confidence of the reference rules is 50%, and rules whose support and confidence exceed 3.6% and 80% were identified as the exception rules. In our experiments, lower thresholds were employed in identifying frequent patterns with the conventional rule selection criterion and the minimum difference between prior and posterior confidences was set to less than 50% in exception rule extraction since the thresholds given in [20] did not provide successful results. We believe that this is caused by the difference in target data. The call center inquiries for our experiments differ from that in [20] in that various expressions for the same status exist. They were processed as different items. For ), (xxxx, yyyy) Purchase Request , which means example, (Ver10.2,
280
K. Shimazu, A. Momma, and K. Furukawa Table 5. Effects of Incorporating Dependencies
(Ver10.2, purchase), (xxxx, yyyy) Purchase Request , and (Ver10.2, ), Purchase Request , which means (Ver10.2, buy), (xxxx, yyyy) (xxxx, yyyy) Purchase Request , were individually obtained as distinct rules. Due to this fact, both support and confidence tend to be low. We consider that the key reason for the low support and confidence is the characteristics of the target data (total word occurrences: 9,598, distinctive words: 1,950, distinct items: 8,157) and that improvement in the dictionary (thesaurus) for preprocessing enables the employment of higher thresholds and leads to more useful results. However, we also assume that the complete elimination of variations in transcription is impossible, as long as we deal with “natural language”. In addition, association rules for medical data tend to employ definite situations such as “recovery” or “death” in the rule head, while the default rules for call center data dynamically change over time. We believe that our method is able to obtain important data without omission of the domains with volatile default rules than the exception rule discovery method.
6 Summary In this paper, we conducted experiments to identify meaningful information by applying an association rule acquisition algorithm to text mining. The distinctive features of our method are (1) dependencies among words were incorporated into the sequential data, and (2) association rules were selected based on the difference between the prior confidence and the posterior confidence. As a result, meaningful classes covering relatively few pieces of data were successfully extracted. In applying the exception rule discovery method based on non-monotonic reasoning to the same data set, we could not obtain effective results because of its strict rule selection threshold. Conversely, our method, with a relatively relaxed selection criterion, was able to acquire useful rules. In addition, we indicated the following two reasons for the effectiveness of our proposal where sequential data with dependency information are generated in preprocessing: (1) it preserves meanings of the raw data, and (2) it is applicable to a conventional data mining technique (i.e., association rule acquisition). In particular, reason (1) proves a suggestion posed by a domain professional and we expect that our method will contribute to lighten the arrangement and reporting operations by call center staff members. Meanwhile, Zaki claims that meaning of text can be preserved in text min-
Discovering Exceptional Information from Customer Inquiry
281
ing by taking word order into consideration in preprocessing [23]. We are concerned that incorporation of word order may cause further diffusion of association rules (i.e., lower support and confidence). We will closely examine this issue. In enhancing our exception rule selection method from association rules, we intend to adopt Suzuki’s proposal in [19] in the next step. It dynamically manipulates four different rule selection thresholds according to the number and importance of rules tentatively acquired. He claims that this method makes it possible to acquire noteworthy rules in accordance with the application domain of the database profile. In this experiment, we found a rule that can be seen as a clue for proactive risk avoidance (issues caused by a combination of a specific scanner machine and a software product). However, from another data set, such a rule was not obtained. We suppose that a novel method must be developed to identify information that can be used as a future predictor. Acknowledgements. We express our deep gratitude to Mr. Takemi Yamazaki, who has provided invaluable support and consideration in our research and experiments. In addition, we thank Mr. Yohei Yamane and Mr. Tetsushi Sakurai for their devoted assistance in our experiments. Further, we would like to take this opportunity to appreciate the beneficial and indispensable advice that we have received from the reviewers and the program committee members.
References 1. 2.
3. 4.
5.
6.
7. 8.
Agrawal R.: Fast Algorithms for Data Mining Applications, in Proceedings of the 20th International Conference on Very Large Databases (Santiago Chile, 1994), 487–489 Arimura H., Abe J., Fujino R., Sakamoto H., Shimozono S., and Arikawa S.: Text Data Mining: Discovery of Important Keywords in the Cyberspace. In Proceedings of Kyoto International Conference on Digital Libraries 2000 (Kyoto Japan, 2000), 121–126 Borgel C.: Apriori: Finding Association Rules/Hyperedges with the Apriori Algorithm. http://fuzzy.cs.uni-magdeburg.de/~borgelt/apriori/ Hearst M. A.: Untangling Text Data Mining (invited paper). In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (College Park MD, 1999) Hisamitsu T., Niwa Y., and Tsujii J.: A Method of Measuring Term Representativeness – Baseline Method Using Co-occurrence Distribution. In Proceedings of the 18th International Conference on Computational Linguistics (Saabrucken Germany, July 2000), 320– 326 Inoue K., and Kudoh Y.: Learning Extended Logic Programs. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (Nagoya Japan, August 1997), 176–181 Laurence S., and Giles L.: Searching the World Wide Web. Science, 280, 5360 (1998), 98– 100 Matsumoto Y., Kitauchi A., Yamashita T., Hirano Y., Matsuda H., Takaoka K., and Asahara M.: Morphological Analysis System ChaSen version 2.2.1 Manual. http://chasen.aist-nara.ac.jp/chasen/doc/chasen-2.2.1.pdf
282 9.
10. 11. 12.
13. 14.
15.
16.
17.
18. 19.
K. Shimazu, A. Momma, and K. Furukawa Matsuo Y., Ohsawa Y., and Ishizuka M.: KeyWorld: Extracting Keywords in a Document as a Small World. In Proceedings of the Fourth International Conference on Discovery Science (Washington D.C., 2001), 271–281 Nasukawa T., and Nagano T.: Text Analysis and Knowledge Mining System. IBM Systems Journal 40, 4 (Winter 2001), 967–984 Nagano T., Takeda K., and Nasukawa T.: Information Extraction for Text Mining. In IPSJ SIG Notes FI60-5 (2000), 31–38 (in Japanese) Ohsawa Y., Benson N. E., and Yachida M.: KeyGraph: Automatic Indexing by Cooccurrence Graph Based on Building Construction Metaphor. In Proceedings of 5th Advanced Digital Library Conference (Santa Barbara CA, April 1998), 12–18 Reiter R.: A Logic for Default Reasoning. Artificial Intelligence 13, 2 (1980), 81–132 Sakurai S., Ichimura Y., Suyama A., and Orihara R.: Inductive Learning of a Knowledge Dictionary for a Text Mining System. In Proceedings of the 14th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (Budapest Hungary, June 2001), 247–252 Segal R., and Kephart J.: MailCat: An Intelligent Assistant for Organizing E-Mail. In Proceedings of the 3rd International Conference on Autonomous Agents (Seattle WA, May 1999), 276–282 Shimazu K., Momma A, and Furukawa K.: Experimental Study of Discovering Essential Information from Customer Inquiry. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Washington D.C, August 2003) Smyth P., and Goodman R. M.: An Information Theoretic Approach to Rule Induction from Databases. IEEE Transactions on Knowledge and Data Engineering 4, 4 (1992), 301– 316. Smyth P., Pregibon D., and Faloutsos C.: Data-driven Evolution of Data Mining Algorithms. Commun. ACM 45, 8 (2002), 33-37 Suzuki, E.: Scheduled Discovery of Exception Rules. In Proceedings of the Second International Conference on Discovery Science (Tokyo Japan, December 1999), 184–195
20. Suzuki, E. and Tsumoto, S. Evaluating Hypothesis-Driven Exception-Rule Discovery with Medical Data Sets. In Proceedings of the Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining (Kyoto Japan, April 2000), 208–211 21. Tanabe T., Yoshimura K., and Shudo K.: Modality Expressions in Japanese and Their Automatic Paraphrasing. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (Tokyo Japan, 2001), 507–512 22. Zaki M. J.: Efficient Enumeration of Frequent Sequences. In Proceedings of the Seventh International Conference on Information and Knowledge Management (Bethesda MD, November 1998), 68–75 23. Zaki M. J.: Efficiently Mining Frequent Trees in a Forest. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Edmonton Canada, July 2002), 71–80
Automatic Classification for the Identification of Relationships in a Meta-data Repository Gerd Beuster, Ulrich Furbach, Margret Gross-Hardt, and Bernd Thomas Universit¨at Koblenz-Landau, Institut f¨ur Informatik {gb,uli,margret,bthomas}@uni-koblenz.de
Abstract. For a large company a prototype for automatic detection of similar objects in database systems has been developed. This task has been accomplished by transferring the database object classification problem into a text classification problem and applying standard classification algorithms. Although the data provided for the task did not look promising due to the small number of positive examples, the results turned out to be very good.
1
Introduction
Large companies manage huge amounts of data (i.e. data about their customer base, their suppliers, products etc.). Usually, there are many databases and applications that store and provide these data. Since these databases have been developed and managed independently, they are often heterogeneous in logical structure, attribute naming and semantics. Nowadays, companies face the need for an integrated view on their data. That is, they want to understand the relationships between data in different databases or between applications using the same database. Detecting similar objects and relationships between objects is a crucial integration task in enterprise application integration and business-to-business applications. In order to achieve these goals, heterogeneities within the data stored in different databases have to be recognized or even resolved. There are various approaches how to deal with heterogeneous data sources. Some are more tightly coupled and define common views on multiple data sources, whereas more loosely coupled approaches maintain the autonomy of distributed databases [10,2]. A recent development is the creation of meta-data repositories to manage meta-data about systems, databases and the data therein [4]. Meta-data repositories play the role of information brokers and provide applications and users with the information necessary to determine dependencies between data sources. A meta-data repository contains information about objects, tables and relationships in the various databases used in a company. It uses this information to analyze business processes, to provide information about marketing campaigns etc. The amount and accuracy of the meta data is critical for the quality of the meta-data repository. Since most databases used in a company are developed independent of each other, information about the same real world entity—e.g customer name and address information—is stored multiple times at different places, under different names and in different formats. In order to make the conglomerate of different database structures manageable and to avoid inconsistencies, these kind of dependencies should be detected and stored in the meta-data repository. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 283–290, 2003. c Springer-Verlag Berlin Heidelberg 2003
284
G. Beuster et al.
The success of a meta-data repository is based on the accuracy and completeness of the data. It has to be maintained continuously. This requires a lot of additional work, because meta-data usually is maintained manually. When a set of new objects is to be added to the meta-data repository, an expert uses her knowledge about the meta-data repository and the new objects in order to add it.
2
Motivation
In order to reduce the amount of work needed to maintain a meta-data repository, the company wizAI did a study on how this process can be automated. To illustrate the basic problem REFERENCE instance of reference reference_typ_id and our approach consider Figure reference_typ_id: 83 objekt_id_owner objekt_id_owner: 47 object_id_member 1. It is important to point out, that object_id_member: 45 object_id_reference object_id_reference: owner not the table rows or attribute names owner: member member: valid_from used within an application are stored valid_from: 1900-01-01 valid_to valid_to: 3000-01-01 creator creator: JAWNCCE as attribute values of an meta-data ... ... object instance. But the administrainstance of object OBJECT object_id: 47 tive repository user determines the object_id it_id: Repository it_id ZEBRA attribute values of the meta-data obname name: Repository object_type_id ZEBRA ject describing an application or cervalid_from object_type_id: 1 instance of object valid_to valid_from: 1900-01-01 object_id: 45 tain data of interest stored in this creator valid_to: 3000-01-01 it_id: PPS-Z description creator: BREX name: PPS-Z application. This is also the reaadditional ... object_type_id: 282 ... valid_from: 1900-01-01 son why the number of relationship valid_to: 3000-01-01 creator: BREX describes some of the data types (reference types) and the ac... stored in this application tual attribute values are not fixed, because a meta-data repository is Application Y Application X in almost all cases under constant change. In our case the company is running more than 150 different apFig. 1. Sketch of a meta-data repository plications where each application’s data-scheme can change from time to time. Though we can assume that the meta-data classes / objects are fixed according to their attributes, we do not know very much about some of the possible attribute values, except their data type. As for example attributes like description can take upto 2000 characters of free formulated text. Whereas attribute values for the attribute object type id are only known regarding the already inserted instances of the class object type id (not shown in the figure). For examples, the object type id: 1 stands for "application + database". Although this might yield important additional information, in a first step, we focused only on the information given by the attribute values of the meta-data objects. So, we turned our attention only on two data sources of the given meta-data repository, the object instances and the reference instances, to develope an approach for the automatic identification of relationships for objects that have to be added to the repository. Besides the two mentioned classes there are other additional instances of classes like reference type id, object type id, attribute type etc. stored in the meta-data repository.
Automatic Classification for the Identification of Relationships
285
The basic system idea is to learn automatically classifiers, which are able to determine if certain meta-data objects are in relationship to each other, and if so of which type these relationships are. From a conceptual point of view this appeared to be a standard classification problem as known in the machine learning area, but with a closer look on the none finite, none disjunctive attribute values, we were encouraged to start first experiments with our readily available text classification system MIC [1]. It turned out that after some transformations of the input data, the standard text classification methods were able to identify dependencies between meta-data objects with very promising results. Thus we developed a prototypical system within the study, that aids the user in adding objects plus the right relationships with existing objects already stored to the meta-data repository. Whenever a new object has to be added to the repository, the automatic classification system lists potential relationships of the object to other objects. Among these proposals the user has to decide if the suggestions of the system are correct. This user task requires a lot less expert knowledge and time than a complete manual maintenance.
3
Problem Definition
The classification task was to identify relationships between meta-data objects. We were provided with a list of relationship types, with descriptions of meta-data objects, and with some information about relationships between objects as shown in Figure 1. We will use the following terms: Object for an entity of a domain, e.g. a sysObject 1 Object 2 tem, data model, business object, etc. Let O1 , . . . , On be objects. Each object has Attributes of a number of characteristics A1 , . . . , Am Object 1 and 2 called attributes. An object is characterized by its attributes. We write this charclassifier classifier classifier acterization as < Oi .A1 , . . . , Oi .Am >. relationship relationship relationship ... type 1 type 2 type n Objects of the same kind are of the same object type. The type of an object is an attribute of the object. In the following, we post processing deal with typed relationships R over a set of objects S = {O1 , . . . , On }. R is a subset list of possible of {(Oi ×Oj ×t)|Oi , Oj ∈ S, 1 ≤ t ≤ n}. relationships for object 1 and object 2 We call the third element of the triple the type of the relationship. n is the total number of relationship types. We are only dealFig. 2. Classification system scenario 1 ing with typed relationships in the rest of this paper. It should be noted that the database objects are descriptions of data types, applications or business objects etc. and not actual instances of database entries. A meta-data repository is a set of objects, together with a set of typed relationships over these objects: M = (S, R).
286
G. Beuster et al.
To support the user in identifying relationships between database objects, two typical usage scenarios were identified. The first scenario models the situation where the user has identified two objects which might be in one or more relationships to each other. The system should make suggestions about the potential kinds of relationships between the objects. Since the number of relationship types is fairly large (in the example data provided by the client, there were about 100 relationship types), an aid of this kind can significantly speed up the process of updating the meta-data repository with new data. This can be formally defined as follows: Scenario 1. Given two objects O1 and O2 , and a meta-data-repository M = (S, R), find all potential relationships P that might hold between O1 and O2 , and present them to the user. The user selects P ⊆ P . An extended repository M is created by adding these relationships: M = (S ∪ O1 ∪ O2 , R ∪ P ). In the following we assume independence between the relationship types. This means, in order Object 1 to decide if a given object is in a certain relationship with some other object, we do not take into Attributes of object 1 consideration if the object is in some other relationship with the same or a third object. Therefore, classifier classifier classifier we can treat each relationship type separately. This relationship type relationship type ... relationship type 1 2 n allows us to reduce the problem of scenario 1 as follows: Given two objects, decide if they are in select object list of possible Meta Data according to the given relationship or not. Thus, the input for relationship types Repository relationship type the classifier is a tuple of two objects (O1 , O2 ), Object 2 Object 1 and the output a binary value indicating whether the two objects are in relationship with each other list of possible relationship Scenario 1 or not. It would also be possible to give a confitypes and objects dence value indicating how likely it is that the two objects are in the given relationship. For scenario 1 (see Figure 2) the objects are Fig. 3. Classification system scenario 2 combined and transfered into a feature representation. This feature vector is then classified by each of the classifiers for the various relationship types. Note that these classifications can be done in parallel for all relationship types. The classification results of all classifiers are then presented to the user. In scenario 2, the user presents a single object to the system, and the system returns a list of other objects from the meta-data repository together with the potential relationships between the given objects and the objects found by the system. Formally this is defined as follows: Scenario 2. Given an object O1 and a meta-data-repository M = (S, R), find all potential relationships P that might hold between O1 and some O2 ∈ S, and present them to the user. The user selects P ⊆ P . An extended repository M is created by adding these relationships: M = (S ∪ O1 , R ∪ P ). There is an obvious relationship between scenario 2 and scenario 1: In scenario 2, each object from the meta-data repository is combined with the new object and run through scenario 1. But following this na¨ıve approach is not advisable for two reasons:
Automatic Classification for the Identification of Relationships
287
Since the new object has to be checked against all existing objects for all relationship types, with N objects and R relationship types in the meta-data repository, N × R classification would be necessary for each new object. This would unnecessarily increase the complexity of the classification process. For this reason, the classification process in scenario 2 is split into two parts (depicted in Figure 3): In the first step, the object is used as input data of a relationship type classification process. In the second step, the classification system draws each of the objects from the meta-data-repository (in Figure 3 shown as object 2), and feeds the combination of the two objects into all the classifiers for the relationship types from step one (scenario 1). Since the new object has to be checked against all other objects, the number of classifications to compute is proportional to the number of objects already in the metadata repository. So the difference is that the new object is not checked for all relationship types with all other objects, but also for a reduced set of relationship types which look promising. Thus, for scenario 2 we learn n binary classifiers, one for each relationship type to use the provided classification information in step 2. Which is than used for the selection of potential relationship partners regarding the given object.
4 Transforming the Problem into a Text Classification Problem Section 2 already gave a sketch of our approach for identifying relationships in a metadata repository. But so far the only argument to use a text classification system for this task was its availability. Having a closer look at the data of the repository and the properties of the object attribute values leads to the assumption, that a text classification system in general seems to be a good choice. Since standard classification methods take vector representations of the objects to be classified as inputs, it would be reasonable to use the vector representation of the meta-data objects (Section 3) directly as the input data for classification algorithms. But this representation is not well suited for our classification task, because the nature of the attribute values (none disjunctive and none finite sets of possible values, as illustrated in Section 2) makes is hard or nearly impossible to find a binary encoding following the standard transformation using a binary attribute for each possible value of each attribute. Therefore the main idea of our approach is to assume that database objects are similar to text objects and thus can be handled by our system for automatic text classification (MIC [1]). We used a well known method from text classification: the attribute values of objects are treated as texts. For the classification tasks in scenario one and the second step in scenario two, the input data for the classifiers are combinations of two objects. In these cases, the textual representations of the two objects are concatenated. Whereby the textual representation of an object is created by treating attribute values as strings and concatenating all attributes of the object, where each attribute string is separated by a whitespace-character. As with standard classification, text classification also requires a text to be transformed into a vector representation. We use the relative frequency of words in a text as the input feature vector. For each word in the vocabulary of the training data set, the relative frequency of the word in the textual representation of the object is calculated. This is the number of appearances of the word in the text, divided by the total number of
288
G. Beuster et al.
words in the text. This is a common method for text representation, described e.g. in [6, page 183]. Since we need fixed vector sizes for some of the classifiers, only the n = 100 most informative words for the classification task, determined according to Shannon’s formula [9], are used in the vector representation of the text.
5
Learning and Results
For this project, we used scenario 1 three standard classification test e tp precision recall F1 fp fn methods: Na¨ıve Bayes ClassiS1,N2 8997 557 98,41 97,55 97,98 9 14 fiers [5], ID-3 Decision Trees S2,N2 18287 549 79,57 78,54 79,05 124 150 [7] and fully connected feedS3,N2 18090 525 75,54 83,20 79,19 170 106 forward, one hidden layer, scenario 2 backpropagation Neural NetS1,N1 2193 660 94,82 95,65 95,23 36 30 works [8]. We assume the S1,N2 9024 698 96,67 84,61 90,24 24 127 reader to be familiar with these S2,N1 2254 671 93,32 92,94 93,13 48 41 machine learning methods. S2,N2 10974 718 97,28 81,13 88,47 20 167 Two data sets were proS3,N1 2218 645 93,89 96,56 95,21 42 23 vided for this classification S3,N2 11026 672 97,39 85,50 91,06 18 134 task. The first data set contained 3579 objects, 79 difFig. 4. Decision Tree results ferent relationship types, and 18605 relationships between objects. The second, considerably smaller data set, contained 748 objects, 92 different relationship types, and 1080 relationships between objects. Since we used separate classifiers for each of the relationship types, we had on average 235 instances of relationships between two objects for the first data set, and 12 instances of relationships between two objects for the second data set. Since there were on average 235 relationship instances (data set one) resp. 12 relationship instances (data set two) for each relationship type, we had ≈ 6.57% resp. ≈ 1.60% positive examples in the data sets. This got even worse for scenario 1, because each object can be combined with every other object. For the first data set, there are 6402831 combined objects. Thereby, the number of positive examples was reduced drastically to ≈ 0.004%. For the second scenario, we had 279378 combined objects resulting in a ratio of ≈ 0.004% positive examples. This drastic disproportion between positive and negative examples did not allow us to use automatic classification algorithms successfully on these data sets. Therefore we used two additional methods to alleviate the disproportions in the data sets: Only objects of the same type: A constraint of the meta-data repository is that only objects of the same type can be in a given relation. There are 67 object types in the first data set, and 74 object types in the second data set. Restricted number of negative examples: The number of negative examples used for the training of classifiers were limited to 500 (S1), 2000 (S2), and 5000 (S3) examples. For training and testing the positive data was split randomly into two halves and negative examples were added to the training and testing data sets until the desired size
Automatic Classification for the Identification of Relationships
289
(S1 to S3) of the data sets was reached. Negative examples were selected according to the following classes for scenario 1 and 2: class scenario 1 scenario 2 N3 The the set of all object pairs, minus the positive The set of all objects not in examples. the set of positive examples. N2 Object pairs who are in a relation, but not in the Objects who are in relation, relation the classifier is trained for. N2 is a subset but not in the relation the of N3. classifier is trai:ned on. N1 Only object pairs of the same object types as the Objects of the same type as objects in the positive examples. N1 is a subset of the objects in the positive exN2. amples. For each scenario, type of object selection, and example size, the quality of the classifier was calculated. Since we assume independence between relationship types, these calculations were done independently for each relationship type (see Section 3). Results were accumulated over all relationship types. Figure 4 shows the accumulated classification results of the learned decision tree classifiers. Where e is the total number of examples in the testing data; tp the number of positive examples in e; precision the percentage of correctly classified positive examples in all examples that were classified positive, recall the percentage of correctly classified positive examples in all positive examples; F1 [3] (harmonic mean of precision and recall); fp the number of negative examples which have been classified wrongly as positive; fn number of positive examples which have been classified wrongly as negative. Some test constellations were not applicable, d.t. n.b. n.n. because for some class of examples the data set scenario 1 85,40 70,25 84,02 is empty. In some cases for any relationship type, scenario 2 92,22 90,69 92,13 there are no objects O1 and O2 such that O1 and O2 are of the same object type as the objects in the Fig. 5. Average of F1 results of all tests set of positive examples, and in some relationship to each other, but not in the relationship that should be learned (e.g. test scenario 1 and setting [S1,N1]). Figure 5 shows the average of the F1 values for each classification method and the two scenarios.
6
Conclusion
Existing machine learning techniques for text classification proved to be surprisingly robust for the application to a very different task. The provided data for this special classification task was not well suited for three reasons: It was not actual text, but descriptions of meta-data objects. There was a blatant disproportion between positive and negative examples. Using S1 for scenario 1 and arbitrary settings for scenario 2 all precision and recall values, have been between 71% and 98,5%, which are promising results due to the fact of using standard learning algorithms for this task. It should be noted that a good precision value alone does not prioritize one learning technique over the other. In the scenarios we illustrated in this paper, potential relationships are presented to a human user who decides whether the system’s suggestions is
290
G. Beuster et al.
correct (it is an relationship indeed) or not (there is no relationship). In these scenarios, false negative (existing relationships that are not detected) are a lot worse than false positives (non-existing relationships that are detected wrongly). False negatives literally get lost: They are not presented to the user, and therefore can not be added manually. On the other hand, false positives are presented to the user, so she can remove them. Based on these considerations, we can conclude that the optimization by restricting the negative example sets according to N1 yields the overall best results (for all three techniques increased recall values from 3% to 11%). Obviously this also results in decreased precision values upto 10%, which is acceptable due to the improved recognition of positive examples. Comparing all three techniques according to their F1 values (Figure 5) the decision tree classifiers showed the best classification results. Using the presented classification system reduced the necessity for human intervention drastically. Still, the system is not completely autonomous. Although the error rate is fairly low, it is still advisable and necessary to let a human review the suggestions of the system. Beside reducing the amount of human intervention, the human supervisor also needs less expert knowledge than a human classifying database objects unaided. When using a decision tree algorithm for classification, the researcher additionally can get insights into the structure of the data, and may develop meta-knowledge about what kinds of objects are in relationships. We think the approach presented in this paper can be improved further by changing the representation of the objects. So far, we treat them as plain text, ignoring all structural information (e.g. additional information about relationship types, object ids that are present in the meta-data repository, other relationships an object has). We expect improved behavior from better representations which preserve the structural information of the objects.
References 1. G. Beuster. MIC — A System for Classification of Structured and Unstructured Texts. Master’s thesis, University Koblenz, 2001. http://www/˜gb/papers/thesis mic/mic.pdf. 2. A. Bouguettaya, B. Benatallah, and A. K. Elmagarmid. Interconnecting Heterogeneous Information Systems. Kluwer Academic Publishers, 1998. 3. N. Chinchor. Muc-4 evaluation metrics. In Fourth Message Understanding Conference, pages 22–29. Morgan Kaufmann, 1992. 4. D. Marco. Building and Managing the Meta Data Repository: A Full Lifecycle Guide. John Wiley & Sons, 2000. 5. M. Maron. Automatic indexing: An experimental inquiry. Journal of the ACM (JACM), 8:404–417, 1961. 6. T. M. Mitchell. Machine Learning. McGraw-Hill International Editions, 1997. 7. J. Quinlan. Discovering rules by induction from a large collection of examples. In D. Michie, editor, Expert systems in the Micro-Electronic Age, pages 168–201. Edinburgh University Press, Edinburgh, 1979. 8. D. D. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. Nature, pages 533–536, 1986. 9. C. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 1948. 10. A. Sheth and J.Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22:183–236, 1990.
Effects of Unreliable Group Profiling by Means of Data Mining Bart Custers Tilburg University, Faculty of Law, P.O. Box 90153, 5000 LE Tilburg, The Netherlands [email protected]
Abstract. With the rise of data mining technologies, group profiling -i.e. ascribing characteristics to groups of people- has increasingly become a useful tool for policy-making, direct marketing, etc. However, group profiles usually contain statistics and therefore the characteristics of group profiles may be valid for the group and for individuals as members of that group, though not for individuals as such. When individuals are judged by group characteristics they do not posses as individuals, this may strongly influence the advantages and disadvantages of using group profiles. However, striving for more reliable group profiles only provides a partial solution to this problem, since perfectly reliable group profiles may still result in unjustifiable treatment of people. A broader solution to deal with the disadvantages of group profiles may be found in developing new ethical, legal, and technological standards that adequately recognize the possible harmful consequences of particular types of information. Keywords: Data mining, KDD, group profiling, personal data, data protection, reliability, distributivity, security, selection, stigmatization, confrontation, ethics.
1 Introduction Information and communication technologies have resulted in large databases with enormous amounts of data. From the need to discover knowledge from these large amounts of data, data-mining techniques have been developed in order to find patterns and relations in data. When characteristics are ascribed to people, we speak of profiles. Profiles concerning individuals are called personal profiles, sometimes also referred to as individual profiles or customer profiles. A personal profile is a property or a collection of properties of a particular individual. Profiles concerning a group of persons are referred to as group profiles. Thus, a group profile is a property or a collection of properties of a particular group of people. Ascribing characteristics to individuals may be done either correctly or incorrectly.1 If an individual is being judged upon information that was wrongly ascribed to him, most legal systems provide opportunities to have the information changed or deleted, possibly combined with compensation of damages.
1
For inference errors that may occur when ascribing characteristics, see [1].
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 291–296, 2003. © Springer-Verlag Berlin Heidelberg 2003
292
B. Custers
Group characteristics are more complex: they may be correct for the group as a whole and members of that group, though not for individuals as such. To explain the difference, we may use the following example. Suppose in street A 80 percent of the people wear glasses. Without any further knowledge, it may be suggested that there is a high probability (80 percent) that a person living in street A wears glasses. This is when this person is regarded as a member of the group of people living in street A. When these persons are considered as individuals as such, it will be clear immediately who wears glasses and who does not. It may be argued that, when group characteristics are incorrectly ascribed to individuals, there should be a right for people to have information changed or deleted. However, since group data is often anonymous data, it is usually not protected by data protection laws. Besides, most people are unaware of the group profiles they are being judged upon.2
2 Risks and Benefits of Group Profiles The use of group profiles may have various advantages and disadvantages. Starting with some general advantages, the search for patterns and relations in data may provide overviews of large amounts of data, facilitate the handling and retrieving of information, and help the search for immanent structure in nature. More closely related to the goals of particular users, group profiles may enhance efficacy (achieving more of the goal) and efficiency (achieving the goal more easily). Here, efficiency often means cost efficiency. For group profiles usually less information is required than for individual profiles (although reliability may not be so good). Group data is usually anonymous data and, therefore, it is in most (notably European) countries not protected by data protection law, which means that no costly and timeconsuming effort for obtaining informed consent has to be made. Group profiling also provides more opportunities for selecting targets. For instance, members of a high-risk group for lung cancer may be earlier identified and treated, or people not interested in cars will no longer receive direct mail about the subject. Socalled hit ratios will increase with the help of profiling, but also new groups of customers or risk-bearers may be discovered. Most of the disadvantages of using group profiles are closely connected to their advantages. One of the main applications of group profiles is selection, as indicated above. However, much selection may be unwanted or unjustified. When selection for jobs is performed on the basis of medical profiles, this may soon lead to discrimination.3 Unjustified selection may also occur in cases of purchasing products, acquiring services, applying for loans, applying for benefits, etc.
2
3
Many authors urge for more openness concerning the collection and use of data towards data subjects and the public in general. See for instance [2]. A case study in the U.S. showed that discrimination as a result of access to genetic information resulted in loss of employment, loss of insurance coverage, or ineligilibility for insurance. All cases of discrimination were based on the future potential of disease rather than existing (symptoms of) diseases [3].
Effects of Unreliable Group Profiling by Means of Data Mining
293
Some of the group profiles constructed by companies, government, or researchers may also become ‘public knowledge,’ which may lead to the stigmatization of particular groups. Another disadvantage may occur when people are confronted with information about a group they belong to. When supposedly healthy people are confronted with the fact that they will have only a limited lifetime left, this may upset their lives and the lives of others. In some cases, people may prefer not to know their prospects while they are healthy. Although it may seem that group profiles lead to a more individual approach (e.g. by customization), the use of group profiles may in fact lead to de-individualization. This is a paradox, but group profiles result in a tendency of judging and treating people on the basis of their group characteristics instead of on their own individual characteristics and merits [4]. Thus, the use of profiles may lead to a more one-sided treatment of individuals. As I will show in the next section, the effects of all these risks and benefits of group profiles are strongly influenced by the reliability of the profiles and their use.
3 Reliability When discussing the reliability of group profiles, it is important to distinguish distributive group profiles from non-distributive group profiles. Distributivity means that a property in a group profile is valid for each individual member of a group; nondistributivity means that a property in a group profile is valid for the group and for individuals as members of that group, though not for those individuals as such [5]. The reliability of a group profile may influence the effects, both positive and negative, of the use of the profile. The reliability of a group profile may be divided into two factors. The first is the reliability of the profile itself and the second is the reliability of its use. The creation of group profiles consists of several steps, in which errors may occur [6]. First, the data on which a group profile is based may contain errors, or the data may not be representative for the group it tries to describe. Furthermore, to take samples, the group should be large enough to give reliable results. In the data preparation phase, data may be aggregated, missing data may be searched for, superfluous data may be deleted, etc. All these actions may lead to errors. For instance, missing data is often made up, which is proved by the fact that a significantly large number of people in databases tend to have been born on the 1st of January (1-1 is the easiest to type) [7]. The actual data mining consists of a mathematical algorithm. There are different algorithms, each having its strengths and weaknesses. Using different data-mining programs to analyse the same database may lead to different group profiles. The choice of algorithm is very important and the consequences of this choice for the reliability of the results should be realized. For instance, in the case of a classification algorithm, the chosen classification criteria determine most of the resulting distribution of the subjects over the classes. As far as the reliability of the use of group profiles is concerned, this depends on the interpretation of the group profile and the actions that are taken upon (the interpretation of) the group profile. As was explained above, both the interpretation
294
B. Custers
and the actions determined depend on whether people are regarded as members of the group or as individuals as such. It should be noted that a perfectly reliable use of a group profile, i.e. 100 percent of the group members sharing the characteristic, does not necessarily imply that the results of the use are fair or desirable. Especially in the case of negative characteristics this may occur, for instance, when a group consisting of handicapped people only are all refused a particular insurance. Although the use of the group profile is perfectly reliable, it is not justified. Note that the difference between regarding people as group members or as individuals is not applicable to future properties. For instance, an epidemiological group profile with the characteristic that 5 per cent of a particular group will die from a heart attack does not provide any information on the question whether Mr. Smith, who is a member of this group, will die from a heart attack. And since Mr. Smith himself has no additional information on this, his perspective as a group member is no different from the perspective of someone outside the group. The fact that in non-distributive profiles not every group member has the group characteristic, has different consequences depending on whether the characteristic is generally regarded as negative or positive. This is illustrated in Figure 1. People in category A have the disadvantages of sharing the negative group characteristic and of being treated on the basis of this negative profile. This may result in an accumulation of negative things: first, there is the negative health prospect; on the basis of this prospect stigmatization and selection for jobs, insurances, etc., may follow. In category B people have the disadvantage of being treated as if they have the negative characteristic, although this is not the case. There may be an opportunity for these people to prove or show they do not share the characteristic, but they are ‘guilty Group characteristic: Negative Positive B
D People not having group characteristic
Group A
C People having group characteristic
Fig. 1. Not every group member necessarily has a group characteristic. This has different consequences depending on whether the characteristic is negative or positive.
Effects of Unreliable Group Profiling by Means of Data Mining
295
until proven innocent.’ Sometimes, proving exceptions is useless anyway, for instance when a computer system does not allow exceptions or when handling exceptions is too costly or time consuming. Sometimes people in category B may have an advantage. This is the case when measures are taken to improve the situation of the people with the negative characteristic. For instance, when the government decides to grant extra money to a group with a very low income, some group members not sharing this characteristic may profit from this. People in category C have the advantage of having the (positive) group characteristic as well as being treated on it. Similar to the people in category A, this may be accumulative. The group may get the best offers for jobs, insurance, loans, etc. Finally, category D contains the people who do not share the positive group characteristic. Their advantage may be that they are being treated on a positive characteristic, but the disadvantage is that they are not recognized as not having the positive characteristic or even having a negative characteristic. Lack of such recognition may become a problem when measures are taken to help the people with negative characteristics. For instance, people in category D may not be recognized as people running a great risk to get colon cancer and are thus easily forgotten in government screening programs. From Figure 1 it becomes clear that there is a difference between correct treatment and fair treatment. People in categories A and C can be said to be treated correctly since they are treated on a characteristic they in fact have. Whether this treatment is also fair, remains to be seen. Accumulation of negative things for people in category A and of positive things for people in category C may lead to polarization. People in categories B and D do not have the group profiles of the groups they belong to and are therefore being treated incorrectly. Incorrect treatment very probably also implies unfair treatment since it does not take into account the actual situation people are in.
4 Concluding Remarks As was shown in the previous sections, the reliability of a group profile may strongly influence its advantages and disadvantages. It is, however, clear that striving for more distributive group profiles will only provide a partial solution to this problem. Perfectly reliable profiles may still be used unjustifiably. The other end of the spectrum, i.e. prohibiting group profiles altogether, will not be a realistic solution either. A broader solution to the disadvantages of group profiles will have to be sought in new ethical and legal standards posing smart restrictions on the availability and use of particular types of information. Such restrictions may be enforced by law and regulations in combination with several security techniques [8]. Security techniques with regard to data mining do not only concern access controls, but also flow controls and inference controls [7], [9]. Particularly those types of information that are likely to result in harmful group profiles after data mining should be used with special care. For instance, it will be
296
B. Custers
recommendable not to put racial data and criminal data in a joint database as the relations that data mining may discover are likely to be harmful to groups of people. Data miners and technicians sometimes suggest that they only provide information and that it is the decision-makers who wrongfully use such information. In contrast, decision-makers often reply to this with the argument that they based their decisions on the information provided. Instead of blaming each other, a joint responsibility for the careful use of information will be preferable.
References 1. Harvey, J. (1990) Stereotypes and Group-Claims: Epistemological and Moral Issues, and Their Implications for Multi-Culturalism in Education, Journal of Philosophy of Education, Vol. 24, No. 1, 39–50. 2. Black, S.K. (2002) Telecommunications Law in the Internet Age, San Francisco: MorganKaufmann Publishers. 3. Geller, L.N., Alper, J.S., Billings, P.R., Barash, C.I., Beckwith, J., and Natowicz, M. (1996) Individual, Family, and Societal Dimensions of Genetic Discrimination: A Case Study Analysis, Science and Engineering Ethics, Vol. 2, No. 1, 71–88. 4. Vedder, A.H. (1999) KDD: The challenge to individualism, Ethics and Information Technology, No. 1, 275–281. 5. Vedder, A.H. (1996) Privacy en woorden die tekort schieten, in: Privacy in het informatietijdperk, S. Nouwt and W. Voermans (eds.), Den Haag: SDU Uitgevers, 17–30. 6. Frawley, W.J., Piatetsky-Shapiro, G. and Matheus, C.J. (1993) Knowledge Discovery in Databases: An Overview, in: Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley (eds.), Menlo Park, California: AAAI Press / The MIT Press. 7. Denning, D.E. (1983) Cryptography and Data Security. Amsterdam: Addison-Wesley. 8. Custers, B.H.M. (2001) Data Mining and Group Profiling on the Internet, in: Anton Vedder (ed.) Ethics and the Internet, Antwerpen/Groningen/Oxford: Intersentia, 2001. 9. Denning, D.E. and Schlörer, J. (1983) Inference Controls for Statistical Databases, IEEE Computer, Vol. 16, No. 7, 69–82.
Using Constraints in Discovering Dynamics Saˇso Dˇzeroski, Ljupˇco Todorovski, and Peter Ljubiˇc Joˇzef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia Abstract. We present a constraint-based approach to discovering differential equations. The approach is based on heuristic search through the space of polynomial equations and can use subsumption and evaluation constraints on polynomial equations. Constraints can be used to effectively guide the discovery of equations: using such guidance can make the difference between success and failure in the discovery of laws describing more complex systems. We illustrate this on the problem of reconstructing the differential equations describing a network of chemical reactions.
1
Introduction
Equation discovery [7] is the area of machine learning that aims at developing methods for computational discovery of quantitative laws, expressed in the form of equations, in collections of measured data. More formally, the task of equation discovery can be defined as follows: Given a set of measurements or observations of the system variables, find a set of quantitative laws that summarize the observations. The laws discovered by equation discovery methods typically take form of algebraic equations [7,12]. While algebraic equations are mainly used to establish models of static systems that have reached an equilibrium state, ordinary differential equations can be used for modeling the behavior of dynamic systems, i.e., systems that change their state over time. The equation discovery method Lagrange is capable of discovering ordinary differential equations [3]. Lagrange exhaustively searches the space of polynomial equations with a limited degree and with a limited number of terms. The polynomial equations can include terms that are based on the observed system variables and their time derivatives, which are introduced by numerical derivation. The search through the space of polynomial equations can become infeasible especially for solving realistic modeling task, where many system variables are observed. In these cases the reductions based on maximal degree and maximal number of terms are not enough to make the exhaustive search in Lagrange feasible. In this paper, we explore two different ways to address this problem. First, we define a set of subsumption language constraints that can be easily understood and used by domain expert to tailor the space of candidate equations. Second, we order the search space with a refinement operator that operates on polynomial equations and allows us to apply heuristic beam search through the space of polynomial equations. We then illustrate the use of our approach in the domain of modeling the dynamics of networks of chemical reactions, where language constraints on polynomial equations can be naturally defined and used. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 297–305, 2003. c Springer-Verlag Berlin Heidelberg 2003
298
2
S. Dˇzeroski, L. Todorovski, and P. Ljubiˇc
Constraints for Polynomial Equation Discovery
The use of constrains in knowledge discovery has been studied within the frameworks of inductive databases [5] and constraint-based data mining [1]. Here we consider inductive databases in the domain of polynomial equations. We define polynomial equations, syntactic/subsumption constraints on these, as well as evaluation primitives. We then discuss inductive queries in this domain. 2.1
Constraints in Inductive Databases
Inductive databases [5] embody a database perspective on knowledge discovery, where knowledge discovery processes are considered as query processes. In addition to normal data, inductive databases contain patterns (either materialized or defined as views). Data mining operations looking for patterns are viewed as queries posed to the inductive database. In addition to patterns (which are of local nature), models (which are of global nature) can also be considered. A general formulation of data mining [8] involves the specification of a language of patterns and a set of constraints that a pattern has to satisfy with respect to a given database. The set of constraints can be divided in two parts: language and evaluation constraints. The first only concern the pattern itself, the second concern the validity of the pattern with respect to a database. Inductive queries consist of constraints. The primitives of an inductive query language include language constraints (e.g., find association rules with item A in the head) and evaluation primitives. The latter are functions that express the validity of a pattern on a given dataset. We can use these to form evaluation constraints (e.g., find all item sets with support above a threshold) or optimization constraints (e.g., find the 10 association rules with highest confidence). Constraints thus play a central role in data mining and constraint-based data mining is now a recognized research topic [1]. The use of constraints enables more efficient induction as well as focussing the search for patterns on patterns likely to be of interest to the end user. While many different types of patterns have been considered in data mining, constraints have been mostly considered in mining frequent patterns. Few approaches exist that use constraints for other types of patterns/models, such as size and accuracy constraints in decision trees [4]. 2.2
The Language of Polynomial Equations
Given a set of variables V , and a dependent variable vd ∈ V , a polynomial equations has the form vd= P , where P is a polynomial over V \ {vd }. A r polynomial P has the form i=1 ci · Ti , where Ti are multiplicative terms, and ci are real-valued constants. Each term is a finite product of variables from V \{vd }, Ti = v∈V \{vd } v dv,i , where dv,i is (a non-negative integer) degree of the variable in the term. The degree of 0 denotes that the variable does not appear in the term. The sum of degrees of all variables in a term is called the degree of the term, i.e., deg(Ti ) = v∈V \{vd } dv,i . The degree of a polynomial is the maximum
Using Constraints in Discovering Dynamics
299
degree of a term in that polynomial, i.e., deg(P ) = maxri=1 deg(Ti ). The length of a polynomial is the sum of the degrees of all terms in that polynomial, i.e., r len(P ) = i=1 deg(Ti ). 2.3
Syntactic Constraints
We will consider two types of syntactic constraints on polynomial equations: parametric and subsumption constraints. Parametric constraints set upper limits for the degree of a term (in both the LHS and RHS of the equation), as well as the number of terms in the RHS polynomial. For example, one might be interested in equations of degree at most 3 with at most 4 terms. Such parametric constraints are taken into account by the equation discovery system Lagrange [3]. Of more interest are subsumption constraints, which bear some resemblance to subsumption/generality constraints on terms/clauses in first-order logic. A term T1 is a sub-term of term T2 if the corresponding multi-set M1 is subset of the corresponding multi-set M2 . For example, xy 2 is sub-term of x2 y 4 z. The sub-polynomial constraint is defined in terms of the sub-term constraint. Polynomial p1 is a sub-polynomial of polynomial p2 if each term in p1 is a subterm of some term in p2 . There are two options here: one may, or may not, require that each term in p1 is a sub-term of a different term in p2 . In looking for interesting equations, one might consider constraints of the form: LHS is a sub-term of t, t is a sub-term of LHS, RHS is a sub-polynomial of p, and p is a sub-polynomial of RHS. Here t and p are a term and a polynomial, respectively. These set upper and lower bounds on the lattice of equation structures, induced by the relations sub-term and sub-polynomial. Consider the following constraint: LHS is a sub-term of x2 y and both xy and z are sub-polynomials of RHS. The equation xy = 2x2 y 2 + 4z satisfies this constraint, under both interpretations of the sub-polynomial constraint. The equation x2 y = 5xyz, however, only satisfies the constraint under the first interpretation (different terms in p1 can be sub-terms of the same term in p2 ). 2.4
Evaluation Primitives
The evaluation primitives for equations calculate different measures of the degree of fit of the equation to a given dataset/table. Assume that i is an index that runs through records/rows of a database table. Denote with yi the value of the LHS of a given equation on record i; with yˆi the value of the RHS as calculated for the same record; and with y the mean value of yi over the table records. Six measures for the degree of fit for an equation to a dataset, well known from statistics, are: multiple correlation coefficient R and normalized standard devia N
tion S (R2 = 1 − i=1 N
(yi −ˆ yi )2 (yi
−y)2
N
i=1
; S2 =
1 N
N
i=1
(yi −ˆ yi )2
y 2 +e−y
2
), mean/maximum absolute
error (M eanAE = N1 i=1 |ˆ yi − yi |, M axAE = maxN |ˆ yi − yi |), mean/root i=1 √ N 1 2 yi − yi ) , RM SE = M SE). mean square error (M SE = N i=1 (ˆ
300
2.5
S. Dˇzeroski, L. Todorovski, and P. Ljubiˇc
Inductive Queries in the Domain of Equations
Inductive queries are typicaly conjunctions of primitive constraints, e.g., language and evaluation constraints. Evaluation constraints set thresholds on acceptable values of the evaluation primitives: M (e, D) < v; M (e, D) > v, where v is a positive constant and M is one of the measures defined above. Instead of evaluation constraints one can consider optimization constraints. Here the task is to find (the n best) e so that M (e, d) is maximal / minimal. Language constraints, as discussed above, can be parametric and/or subsumption constraints. It is rarely the case that an inductive query consists of a single constraint. Most often, at least one syntactic and at least one evaluation or optimization constraint would be a part of the query. For example, we might look for the equations, where the LHS is sub-polynomial of X 2 Y 3 and X + Z is a subpolynomial of the RHS, which have the highest multiple correlation coefficient.
3
Heuristic Search for Polynomial Equations
Using constraints in the discovery of polynomial equations helps the user focus on interesting hypotheses and can drastically reduce the space of possible equations. However, in realistic problems with many variables, these reductions are not enough to make the exhaustive search in Lagrange [3] feasible. This motivates us to consider heuristic search. In this section, we will present a heuristic search algorithm that searches through the space of possible polynomial equations. First, a refinement operator will be introduced that orders the space of polynomial equations. Then, the heuristic function is defined that measures the quality of each equation considered during the search. Finally, the search algorithm based on a beam search strategy is presented. 3.1
The Refinement Operator
In order to apply heuristic search methods to the task of discovering polynomial equations, we first have to order the search space of candidate equations. We will define a refinement operator, based on the syntactic constraints above, that orders the space of equations according to their complexity. Starting with the simplest possible equation and iteratively applying the refinement operator, all candidate polynomial equations can be generated. Assume we measure the complexity of the polynomial equation vd = P as len(P ). The refinement operator increases len(P ) by 1, either by adding a new linear r term or by adding a variable to an existing term. Given an equation vd = r by adding i=1 ci ·Ti , the refinement operator produces equations that increase r a new linear term (at most one for each v ∈ V \ vd ): vd = i=1 ci · Ti + cr+1 · v, where ∀i : v
= Ti . It also produces refined equations that increase the r degree of an existing term (at most one for each Tj and v ∈ V \ vd ): vd = i=1,i=j ci · Ti + Tj · v, where ∀i
= j : Tj · v
= Ti . Special care is taken that the newly introduced/modified term is different from all the terms in the current equation.
Using Constraints in Discovering Dynamics
301
Note that the refinements of a given polynomial P are superpolynomials of P . They are minimal refinements in the sense that they increase the complexity of P (len(P ) by one unit. The branching factor of the refinement operator depends on the number of variables |V | and number of terms in the current equation r. The upper bound of the branching factor is O((|V | − 1)(r + 1)) = O(|V |r), since there are at most |V | − 1 possible refinements that increase r and at most (|V | − 1)r possible refinements that increase d. The defined refinement operator is not optimal, in the sense that each polynomial equation can be derived in several different ways, e.g., z = x + y can be derived by adding x then y or vice-vesa by adding y then x. This is due to the commutativity of the addition and multiplication operators. An optimal refinement operator can be easily obtained by taking into account the lexical ordering of the variables in V . Then, only variables and/or terms) with higher lexical rank (than those already in) should be added to the terms and/or equations. While an optimal refinement operator is desired for complete/exhaustive search (to avoid redundancy), it may prevent the generation of good equations in greedy heuristic search. Suppose the polynomials x and z have low heuristic value, while y has a high heuristic value and x + y is actually the best. Greedy search would choose y and the optimal refinement operator that takes into account lexicographic order would not generate x + y. 3.2
The Search Heuristic
Each polynomial equation structure considered during the search contains a number of generic constant parameters (denoted by ci ). In order to evaluate the quality of an equation, the values of these generic constants have to be fitted against training data consisting of the observed values of the variables in V . Since the polynomial equations are linear in the constant parameters, the standard linear regression technique can be used, just as in Lagrange [3]. The quality of the obtained equation is usually evaluated using an evaluation primitive, such as mean squared error (MSE), defined in the previous section. We will use a MDL (minimal description length) based heuristic that combines the degree of fit with the complexity of the equation, defined as MDL(vd = P ) = len(P ) log N + N log MSE(vd = P ), were N is number of training examples. The second term introduces a penalty for the complexity of the equation. Thus, the MDL heuristic function introduces a preference toward simpler equations. 3.3
The Search Algorithm
Our algorithm Ciper (Ciper stands for Constrained Induction of Polynomial Equations for Regression) employs beam search through the space of possible equations. The algorithm takes as input the training data D, i.e, the training examples, each of them containing the measurements of the observed (numerical) system variables, and a designated dependent variable vd . In addition, a set of language constraints C can be specified. The output of Ciper consists of the b
302
S. Dˇzeroski, L. Todorovski, and P. Ljubiˇc
polynomial equations for vd that satisfy the constraints C and are best wrt the data D according to the MDL heuristic function defined in the previous section. Before the search procedure starts, the beam Q is initialized with the simplest possible polynomial equation of the form vd = c. The value of the constant parameter c is fitted against the training data D using linear regression and the MDL heuristic function is calculated (lines 1-3). In each search iteration, the refinements of the equations in the current beam are computed first and checked for consistency with the specified language constraints C. Those that satisfy the constraints are collected in Qr (line 5). In case when redundant equaprocedure Ciper(D, vd , C, b) tions are generated due to the 1 E0 = simplest polynomial equation (vd = c ) inoptimality of the refinement 2 E0 .MDL = FitParameters(E0 , D) operator, the duplicate equa3 Q = {E0 } tions are filtered out from the 4 repeat set Qr . The refinements are 5 Qr = {refinements of equations in Q computed using the refinethat satisfy the language constraints C} 6 foreach equation structure E ∈ Qr do ment operator defined in Sec7 E.MDL = FitParameters(E, D) tion 3.1. In case when redun8 endfor dant equations are generated 9 Q = {best b equations from Q ∪ Qr due to the inoptimality of the according to MDL } refinement operator, the du10 until Q unchanged during the last iteration plicate equations are filtered 11 print Q out from the set Qr . Linear regression is used to fit the constant parameters of the refinements against the training data D, and the MDL heuristic function is calculated for each refinement (lines 6-8). At the end of each search iteration, the best b equations, according to the MDL heuristic function, are kept in the beam (line 10). The search proceeds until the beam remains unchanged during the last iteration. Ciper can also discover differential equations. It can introduce time derivatives numerically, then look for equations of the form x˙i = P (x1 , x2 , ..., xn ), where x˙i denotes the time derivative of xi and P (...) denotes a polynomial.
4
Using Constraints in Modeling Chemical Reactions
To illustrate the use of constraints in {x5 , x7 } → {x1 }; {x1 } → {x2 , x3 } discovering dynamics, we address the {x1 , x2 , x7 } → {x3 }; {x3 } → {x4 } task of reconstructing a partially specified {x4 } → {x2 , x6 }; {x4 , x6 } → {x2 } network of chemical reactions. The part of the network given in bold is assumed to be unknown, except for the fact that x6 and x7 are involved in the network. This is a task of revising an equation-based model. A network of chemical reactions can be modeled with a set of polynomial differential equations (see, e.g., [6]). The reaction rate of a reaction is proportional to the concentrations of inputs involved (product, e.g. x5 · x7 ). It influences the rate of change of all inputs (negatively) and all outputs (positively). The equation structures (left) / full equations (right), corresponding to the partial/full network, are given below.
Using Constraints in Discovering Dynamics
303
Partial structure/Full equations x˙1 = 0.8 · x5 · x7 − 0.5 · x1 − 0.7 · x1 · x2 · x7 x˙2 = 0.7 · x1 + 0.2 · x4 + 0.1 · x4 · x6 − 0.3 · x1 · x2 · x7 x˙1 = −c · x1 + c · x5 − c · x1 · x2 x˙3 = 0.4 · x1 + 0.3 · x1 · x2 · x7 − 0.2 · x3 x˙2 = c · x1 + c · x4 − c · x1 · x2 x˙4 = 0.5 · x3 − 0.7 · x4 · x6 x˙3 = c · x1 + c · x1 · x2 − c · x3 x˙5 = −0.6 · x5 · x7 x˙4 = c · x3 − c · x4 x˙6 = 0.2 · x4 − 0.8 · x4 · x6 x˙5 = −c · x5 x˙7 = −0.1 · x1 · x2 · x7 − 0.1 · x5 · x7
The full equations were simulated for 1000 time steps of 0.01 from a randomly generated initial state (each variable randomly initialized in the interval (0,1)), thus providing a trace of the behavior of the 7 system variables over time. The domain of modeling networks of chemical reactions lends itself naturally to the use of constraints in polynomial equation discovery. On one hand, parametric constraints have a natural interpretation. A limit on r, the number of terms in an equation, corresponds to a limit on the number of reactions a compound is involved in. A limit on d, the degree of terms, corresponds to a limit on the number of compounds that are input to a chemical reaction. On the other hand, subsumption constraints can also be used in a natural way. A partially specified reaction network gives rise to equations that involve subpolynomials of the polynomials modeling the entire network. The knowledge of the partial network can be used to constrain the search through the space of possible equations. The polynomial structures in the equations for x˙1 ... x˙5 in the partial network should be supolynomials of the corresponding equations in the complete network. These subpolynomial constraints were given to Ciper together with the behavior trace for all 7 variables. No subsumption constraints were used for the equations defining x˙6 and x˙5 . No parametric constraints were used for any of the equations. Ciper successfully reconstructs the equations for the entire network, i.e., for each of the 7 system variables, for each of the two beam sizes used (64 and 128). Discovery without constraints, however, fails for two equations: for x˙1 (beam 128) and x˙2 (for both beams). Unconstrained search inspects more equations than constrained search (for x˙2 and beam 128, 18643 and 12901 equations): exhaustive search with d ≤ 3 and r ≤ 4 considers 637393 equations.
5
Discussion
We have presented a constraint-based approach to discovering dynamics, using parametric and subsumption constraints on polynomials. We have formulated a refinement operator for polynomial equations and developed a heuristic equation discovery method. We have used this approach to discover dynamics of chemical reaction networks. In this domain, both parametric and subsumption constraints have a natural interpretation. Subsumption constraints can be used to specify, e.g., a partially known network as prior knowledge of chemical reactions. Constraints proved crucial for the successful reconstruction of an example network. Our work is related to recent efforts to formulate primitives for inductive databases and constraint-based data mining in a number of different pattern
304
S. Dˇzeroski, L. Todorovski, and P. Ljubiˇc
domains [2]. While most of the work on this topic is concerned with the discovery of frequent patterns, our work focusses on equation-based predictive models. From the area of equation discovery, Lagramge [10] is most closely related: it performs heuristic search and allows for the specification of general constraints in the form of context-free grammars (which can also specify non-polynomial equations). But grammars are not always easy to understand and specify for the domain expert, whereas the subsumption constraints in Ciper are natural to interpret and use. In addition, while Lagramge uses expensive non-linear optimization for parameter fitting, Ciper uses linear regression. Work on the revision of equation-based models [11,9] is also related. Similarly to [11] and unlike [9], the presented approach allows for a straightforward encoding of existing partial models in the form of subsumption constraints. The approach of [11] is more general and allows for encoding arbitrary initial models, but this comes at the cost of decreased understandability and efficiency of parameter estimation, as discussed above. Directions for further work include the formulation and exploitation of further (types of) constraints in the domain of modeling reaction networks, such as considering a given maximal network of reactions and looking for a subnetwork. In this case, superpolynomial constraints might be used to focus/constrain the search for equations. Experiments on real data are also needed to thorughly evaluate our approach. The formulation and use of constraints in other application domains of equation discovery is also an interesting direction for further work. Acknowledgement. This work was supported in part by the projet cInQ (Consortium on discovering knowledge with Inductive Queries), funded by the European Commission under the FET arm of the IST programme.
References 1. R. Bayardo. Constraints in data mining. SIGKDD Explorations, 4(1), 2002. 2. De Raedt, L. Data mining as constraint logic programming. In Computational Logic: From Logic Programming into the Future. Springer, Berlin, 2002. 3. S. Dˇzeroski and L. Todorovski. Discovering dynamics: from inductive logic programming to machine discovery. J. Intelligent Information Systems, 4:89–108, 1995. 4. M. Garofalakis and R. Rastogi. Scalable data mining with model constraints. SIGKDD Explorations, 2(2):39–48, 2000. 5. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of the ACM, 39(11):58–64, 1996. 6. Koza, J.R., Mydlowec, W., Lanza, G., Yu, J., and Keane, M.A. Reverse engineering of metabolic pathways from observed data using genetic programming. In Proc. 6th Pacific Symposium on Biocomputing, pp. 434–445. World Scientific, Singapore, 2001. ˙ 7. Langley, P., Simon, H. A., Bradshaw, G. L., & Zythow, J. M. (1987). Scientific discovery. Cambridge, MA: MIT Press. 8. H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241–258, 1997.
Using Constraints in Discovering Dynamics
305
9. Saito, K., Langley, P., Grenager, T., Potter, C., Torregrosa, A., Klooster, S.A. Computational revision of quantitative scientific models. In Proc. 4th Intl. Conference on Discovery Science, pp. 336–349. Springer, Berlin, 2001. 10. Todorovski, L., and Dˇzeroski, S. Declarative bias in equation discovery. In Proc. 14th Intl. Conference on Machine Learning, pp. 376–384. Morgan Kaufmann, San Francisco, CA, 1997. 11. Todorovski, L., and Dˇzeroski, S. Theory revision in equation discovery. In Proc. 4th Intl. Conference on Discovery Science, pp. 390–400. Springer, Berlin, 2001. 12. Washio, T., and Motoda, H. Discovering admissible models of complex systems based on scale-types and identity constraints. Proc. 15th Intl. Joint Conference on Artificial Intelligence, pp. 810–817. Morgan Kaufmann, San Francisco, CA, 1997.
SA-Optimized Multiple View Smooth Polyhedron Representation NN Mohamad Ivan Fanany and Itsuo Kumazawa Imaging Science and Engineering, Tokyo Institute of Technology {fanany,kumazawa}@isl.titech.ac.jp http://kumazawa-www.cs.titech.ac.jp/˜fanany/MV-SPRNN/mv-sprnn.html
Abstract. Simulated Annealing (SA) is a powerful stochastic search method that can produce very high quality solutions for hard combinatorial optimization problem. In this paper, we applied this SA method to optimize our 3D hierarchical reconstruction neural network (NN). This NN deals with complicated task to reconstruct a complete representation of a given object relying only on a limited number of views and erroneous depth maps of shaded images. The depth maps are obtained by Tsai-Shah shape-from-shading (SFS) algorithm. The experimental results show that the SA optimization enable our reconstruction system to escape from a local minima. Hence, it gives more exact and stable results with small additional computation time.
1
Introduction
Solving the task of obtaining 3D models of real world objects from their twodimensional (2D) images is a very active inquiry in the Computer Vision and Computer Graphics communities. This is an important problem in these areas with many applications such as virtual reality, animation, and object recognition. The task, however, is complicated due to three fundamental difficulties, namely under-determined constraint provided by the images, the presence of local minima, and lack of robustness in the reconstruction algorithm [1]. In this paper, we present a robust NN to fuse several erroneous depth maps of a conventional SFS technique taken from few views into a more complete and more accurate representation of a given object. This NN is named as MV-SPRNN (Multiple View Smooth Polyhedron Representation NN). To deal with multiple views, our NN stores vertices of an initial 3D polyhedron model and represents its projectional depth images. By comparing the depth images of the model with the depth images of the object for each view, our NN updates the model’s vertices using error back propagation method to approximate the object’s shape. Three characteristics distinguish our approach from other approaches to reconstruct a 3D shape from its images using NN or other function minimization techniques. First, our NN incorporates multiple views observed under a fixed light source condition, whereas other approaches use a single view either with a fixed light source [2] or a varying light source [3] (the NN of photometric stereo). Second, our NN receives erroneous depth maps from a conventional Tsai-Shah [4] SFS G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 306–310, 2003. c Springer-Verlag Berlin Heidelberg 2003
SA-Optimized Multiple View Smooth Polyhedron Representation NN
307
algorithm and try to correct them by its capability of handling multiple view data, whereas other approaches [5,6] receive shaded images and try to obtain accurate depth maps before using them for 3D reconstruction. Third, our NN performs SA optimization in a hierarchical reconstruction framework, whereas other approach [7] performs the SA optimization in a non-hierarchical framework. Due to its inherent multiple view structure, our NN scheme can correct the erroneous depth in one view using information obtained from other views. Due to its hierarchical strategy our NN can generate finer representation from coarser model and smaller size image.
2
MV-SPRNN
In our NN, we use a polyhedron model to represent a 3D shape of an object. The MV-SPRNN (Multiple-View Smooth Projected Polyhedron Representation NN) iteratively represents and refines the model to reconstruct the object. The representation process (also called as mapping process) refers to mapping the polyhedron model onto its projectional depth maps representation as seen from multiple views. The refinement process (also called as learning process) refers to learning how to update vertices of the polyhedron model based on error between teacher’s depth maps and NN’s depth maps produced by the mapping process. Before the two process could be performed the camera parameters are computed using self-calibration method [8,9].For the detail of these mapping and learning processes of the NN please refer to [10]. Differ with [10], we add a momentum term in the learning process.
(a)
(b)
Fig. 1. (a) Hierarchical reconstruction scheme, (b) Model’s shape refinement
3
Hierarchical and Annealing Strategies
In order to achieve a well balance between the size of training images and the number of vertices of the polyhedron model in each reconstruction stage, we perform a hierarchical reconstruction process as follows. We heuristically determine appropriate size of images for a given number of vertices to be trained. Several levels of input images and several levels of initial shape’s resolution are provided as shown in Fig. 1 (a). Then we sub-divide the reconstruction result of
308
M.I. Fanany and I. Kumazawa
the previous stage of learning as a new initial 3D polyhedron for the next stage of learning. We perform simple subdivision procedure to increase or refine the resolution of the initial 3D shape as depicted in Fig. 1 (b). In this study, we consider each vertex of the 3D polyhedron model as if it is a crystalline molecule in the liquids or in the metal due to following reasons. First, our system is a complex system with many degrees of freedom, where its complexity sharply increases as we add the number of vertices to be trained. Second, it is possible to get stuck in local minima or metastable results along the training process. Third, it is possible to destruct an near optimal state that has been learned in previous steps of the training process. To deal with these problems, we refer to the SA optimization method summarized in [11].
4
Experiments
In this paper, we performed three kinds of experiments: an experiment without the momentum term (Plain); an experiment with the momentum term (Momentum); and an experiment with momentum term in conjunction with simulated annealing reinforcement Momentum+SA. In the Momentum+SA experiment, we tried two settings: Fixed-Cooling-Rate (FCR) setting Second, and FixedInitial-Temperature (FIT) setting. In all three experiments, we set the learning rate fixed: η = 1.0E − 9, and used the three first-level training images (300×382 pixels) and a first-level initial polyhedron model (642 vertices, 1280 faces).
(a)
(b)
(c)
(d)
Fig. 2. Performance of the Plain (a), and the Momentum (b). (c) Momentum+SA with FCR setting (ζ = 0.9; (d) with FIT setting (T0 = 1.0E7)
The reconstruction result is evaluated for each view by comparing mean square error (MSE) of the depth map recovered by our reconstruction scheme
SA-Optimized Multiple View Smooth Polyhedron Representation NN
(a)
(b)
(c)
309
(d)
Fig. 3. The initial 3D model and reconstruction results of first-level stage in hierarchical learning. In each column, we observed the 3D model from three horizontal viewing angles: 0, -45, +45 degrees. (a) The initial 3D model generated by subdividing the zerolevel model. (b) The Plain experiment results after 400, and (c) after 800 iterations. (d) The Momentum+SA experiment results after 1200 epoches through the FIT setting where the cooling-rate is set to ζ = 0.99.
and the depth map recovered by the SFS method alone, with a true depth map obtained by a 3D scanner. The reconstruction performance was defined as follows. The performance index of our NN was defined as P = MSESFS − MSENN .
(1)
For N views, instead of a single performance index, we measured the average of performance index defined as P¯ = E[P ]. (2) The rate of convergence was defined as the number of epochs needed to converge, i.e., to attain P¯ > ρ, where ρ is set to 0.003. The reconstruction stability was defined as a standard deviation of P¯ for 1600 epochs, denoted as S(P¯ ). The computational time was define as an average of computation time for 1600 epochs. We summarized best results obtained from each experiments in Table 1. It was shown that the Momentum+SA (FIT:ζ = 0.99) experiment attained the highest P¯ with convergence rate not so differed with the Momentum experiment, without need much additional computation time. Even though overall stability index numerically almost not differed from the momentum experiments, we observed a final stable result in the Momentum+SA after about 800 iterations. This final stable state was not attained in the Momentum experiment.
310
M.I. Fanany and I. Kumazawa Table 1. Results summary. Plain Highest P¯ Convergence rate Stability (S(P¯ )) Computation time
5
0.00384 625 epochs 4.25E10−4 39.05 sec
Momentum (µ = 0.9) 0.00374 70 epochs 1.31E10−3 40.67 sec
Momentum+SA (FIT:ζ = 0.99) 0.00390 90 epochs 1.21E10−3 40.80 sec.
Conclusions
In this paper, we have presented an integrated framework of 3D shape reconstruction system based on neural network and multiple view approaches. This system is capable to fuse erroneous and few view depth maps of shaded images to produce a complete and accurate 3D model. Using SA, this system showed better reconstruction performance in comparison with the standard Tsai-Shah SFS approach and stabilized the reconstruction process, while retain the faster convergence rate. In future work, we plan to reduce the dimensionality of our input data using principal component analysis to further cut the computational cost.
References 1. Hartley, R., Zissermann A.: Multiple View Geometry in Computer Vision. Cambridge University Press, (2000) 2. Lehky, S. R., Sejnowski, T. J.: Network model of SFS: neural function arises from both receptive and projective fields. Nature 333, 452–454, 1988 3. Iwahori,Y., Woodham, R.J., Ozaki, M., Tanaka, H., Ishii, N.: Neural Network Based Photometric Stereo with a Nearby Rotational Moving Light Source, IEICE Trans. Inf & Syst., E80-D, 948–957, 1997 4. Tsai, P.S., Shah, M., 1994. Shape from Shading Using Linear Approximation. Image and Vision Computing Journal., 12 (8). 487–498, 1994 5. Wei, G Q., Hirzinger, G.: Learning Shape from Shading by a Multilayer Network. IEEE Trans. On Neural Network, 7 (4), 985–995, 1996 6. Wei, G.Q., Hirzinger, G.: Parametric Shape-from-Shading by Radial Basis Functions. IEEE Trans. PAMI, 19 (4), 353–365, 1997 7. Chowdhury, R., Chellappa, R., Krisnamurthy, S., Vo, T.: 3D Face Reconstruction from Video Using A Generic Model. Proceeding ICME, 2002 8. Faugeras, O., Luong, Q. T., Maybank, S.: Camera Self-calibration : theory and experiments. Proc. ECCV’92, 321–334, 1992 9. Hartley, R.: In defence of the 8-point algorithm. Proc of the 5th Int. Conf. on Comp. Vision, IEEE Comp. Soc. Press: Boston, MA, 1064–1070, 1995 10. Kumazawa, I., Ohno M.: 3D Shape Reconstruction from Multiple Silhouettes: Generalization from Few Views by Neural Network Learning. LNCS 2059, 687–695 (2001) 11. Romeo, F., Vincentelli, A. S.: A theoretical framework for Simulated Annealing. Algorithmica 6, 302–345 (1991)
Elements of an Agile Discovery Environment Peter A. Grigoriev and Serhiy A. Yevtushenko TU Darmstadt, Alexanderstr. 10, 64283 Darmstadt, Germany {peter,sergey}@intellektik.informatik.tu-darmstadt.de
Abstract. Machine learning methods and data mining techniques have proved to be quite helpful in a number of discovery tasks. However, the most popular modern tools in this area do not tend to back the discovery process properly. In this paper we investigate the reasons that prevent modern data mining tools from becoming convenient and productive discovery environments. We come up with principles of an agile discovery environment, i.e. a data mining-driven software designed to support the process of discovery.
1
Introduction
Machine Learning (ML) and Data Mining (DM) methods and tools have proved to be quite helpful in a number of activities that human beings tend to spend time on. One of the most exciting of those activities is apparently the process of discovery. Whatever the other activities are, in this paper we will concentrate on how ML/DM tools could be useful for the discovery process. And what we aim to show, is that they could be much more useful for it, as they are being now. Modern Data Mining tools. We aim to show that the ideology of the current DM tools is going the way somewhat different to towards the discovery process support. Current DM tools tend to support business processes, rather than the discovery process. We also aim to show that the major weakness of the current data mining tools for the discovery process is that they underestimate the role of discoverer, which, from our perspective should remain central. Finally, we aim to show that the current data mining tools are totalitarian systems, rather than partner systems, as they should be. Agile Discovery Environment. Looking at all those weaknesses of the current DM tools, we dream of an agile discovery environment. We dream of an environment, where: – the intuition of a discoverer is boosted to the maximal extend, – personal knowledge evolution is the way better models are achieved, – a discoverer can look at the regularities in his data from totally different viewpoints, including diverse model types and different elaboration levels – discoverers, as a community, have all the tools to maintain, communicate and share their knowledge. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 311–319, 2003. c Springer-Verlag Berlin Heidelberg 2003
312
P.A. Grigoriev and S.A. Yevtushenko
Extreme Programming. The core values of agile methodologies for software development were expressed in the “Manifesto for Agile Software Development” [1]. It introduced several principles, two of which we consider very important also for the agile discovery. Here are these two principles: 1. individuals and interactions over processes and tools, 2. responding to changes over following a plan. Extreme Programming (XP) is the most popular among the agile methodologies. Acceptance of agile methodologies and XP in particular has lead to significant positive changes in the working process of many software developers. Thus we believe that some values and practices of agile software development are also useful in support of discovery process. We dream of an agile discovery environment that would bring significant improvements to the work of researchers. QuDA. So we dream of an agile discovery environment. We dream of and we work on implementing one. QuDA1 is a data mining toolbox oriented to back up the discovery process. It wouldn’t be a surprise, if we announce that we develop this software using the principles of agile software development and the methodology of extreme programming. Please note that we do not aim to prove here that QuDA is somewhat superior to the other systems (even with respect to the requirements that we formulate in this paper). QuDA is nothing more than a working and constantly evolving prototype for an agile discovery environment, which implements some of the features and fulfills some of the requirements that we state in this paper. QuDA is intended to be a test-bed for the community to try whether our ideas are useful or not. And if some of them appear useful we would like to encourage other data mining software developers to implement them in their own tools.
2
A Perfect Discovery Workbench
We understand the process of discovery as a process of creating a good model for a certain phenomenon of interest (although this doesn’t seem to be quite terminologically correct, as C. Piscopo and M. Birattari pointed out in [2]). There are lots of diverse understandings on which models could be named good ones. We will not go in deep detail on this topic, we’ll assume instead, that a good model is a model, which the discoverer is confident of and satisfied with. Later in this section we will see how this understanding affects the list of requirements on a convenient discovery workbench. 2.1
Schneiderman’s Recommendations
Apart from the data mining community, several other scientific communities contribute to better discovery tools creation. Three most active of them are the 1
QuDA used to be an acronym for “Qualitative Data Analysis”. However, as more and more quantitative methods have been implemented, this name has lost its original meaning and became just a proper name.
Elements of an Agile Discovery Environment
313
communities of automated scientific discovery, data visualization, and knowledge representation. Our major claim is that modern discovery tools involving data mining techniques could be improved and should be improved. This point of view is not new to the researchers among the data mining community as well as the other communities we’ve named. Schneiderman in [3] gives the following four recommendations: 1. 2. 3. 4. 2.2
integrate data mining and information visualization to invent discovery tools, allow users to specify what they are seeking and what they find interesting, recognize that users are situated in a social context, respect human responsibility when designing discovery tools. Elements of an Agile Discovery Environment
As we have mentioned in the introduction, a limited number of principles has made it possible to create agile environments for software development. Basic principles of the agile programming, namely evolution and responding to changes are, from our perspective, also the key issues for an agile discovery. Respecting these key issues we will now extend and rearrange the Schneiderman’s recommendation list and expand some of its items. Here come our requirements for an agile discovery environment (ADE). Requirement 1. An ADE should respect and support human responsibility in discovery process. This implies following abilities that an ADE should have: 1. present the model in a way discoverer can comprehend it; 2. allow discoverer to incorporate his personal background knowledge; 3. allow discoverer to intervene on and roll-back each of the model creation steps; 4. allow discoverer to modify any created model the way he wants2 ; 5. support incremental model refinement (both handmade and automatic) when new information arise. Requirement 2. An ADE should support means for boosting the discoverers intuition. For that purpose data/knowledge visualization as well as other machine-human interaction languages could be involved. We will not go any deeper on this topic. Consider [3] for details. Requirement 3. An ADE must support the evolution of discoverers personal knowledge of the topic under investigation. Thus any machine learning tools used in an ADE should be oriented to support human learning. This point again implies with necessity that models should be presented in a way discoverer can comprehend them3 . Another implication is that an ADE should be capable of 2 3
Of cause under his responsibility. This does not prohibit any use of black-box-like models, such as Neural Networks, Support Vector sets, or frequency tables of the Naive Bayes classifier. Those models just cannot be the output, the final point of the discovery process. They can, however, give a good hint on the causality nature of the corresponding domain, and sometimes they can serve as an estimate on accuracy which one can (theoretically) achieve.
314
P.A. Grigoriev and S.A. Yevtushenko
turning itself into a convenient reference book (if not an encyclopedia) on a particular application domain. Putting it in Bill Gates words, an ADE should be capable of providing domain-specific information “at discoverers fingertip”. Requirement 4. An ADE should support team discovery. In most cases, a discoverer indeed operates in a social context. Thus discoverers, as a community, should have all the tools they may need to maintain, communicate and share their knowledge. This implies at least the following three points: 1. an ADE should support standard data and model representation formats, as different discoverers can use different tools; 2. an ADE should scrupulously track back the history of the discovery process (not only other discoverers have a perfect right to verify the results, they may also want to apply a successful scheme to another problem); 3. an ADE should support model conversion tools, (You prefer to work with decision trees, and I prefer to work with rule-sets? Fine! My ADE will break down your tree into a set of rules; your one will build a tree from my ruleset4 ) Requirement 5. An ADE should keep the discoverer open-minded. Totalitarian systems have proved unpromising not only in the social life. Thus with an ADE, a discoverer should be able to look at the regularities in his data from totally different viewpoints, including diverse model types and different elaboration levels, switching his viewpoint rapidly (and willingly). This actually implies once again the need for model conversion tools.
3
Modern Data Mining Tools and Agile Discovery
Let us now see how good the modern wide-spread data mining tools are to fulfill our expectations. For this review we have selected the 5 most popular tools, based on the KD-Nuggets poll, held in June 2002 [4]. 3.1
Modern Data Mining Tools ...
Based on the KD-Nuggets Poll results we selected the five most popular data mining tools: SPSS Clementine [5], Weka [6], SAS [7], CART/MARS [8], and SPSS/AnswerTree [9]. SPSS Clementine provides a visual programming environment for specifying sequence of steps for performing data mining tasks. It supports a rich set of data transformation utilities as well as graphical reports and provides a vast amount of data mining algorithms. Weka is one of the best known open source data mining toolkits. It provides an impressive amount of data mining algorithms and several advanced facilities for their comparison. The last version of Weka provides a visual programming environment, similar to the one of Clementine. Drawbacks of Weka include limited visualization facilities, bad interoperability with other data mining tools and 4
Which seems to be a little bit more complicated task, but still it’s achievable.
Elements of an Agile Discovery Environment
315
limited facilities for exploration of models. For many algorithms, the user can only get a textual representation of a model, which is often not easy to interpret and navigate. SAS Enterprise Miner provides a comprehensive set of tools for all the stages of a data mining process. The system has a powerful set of data visualization tools and provides machine learning algorithms from seven categories (decision trees, linear and logistic regression, neural networks, principal components, two-stage memory-based reasoning and ensemble models). It also provides rich reporting facilities. CART/MARS is actually a set of single-mining task tools. CART provides ability to build decision trees and regression trees. MARS provides a modeling technique called Multi-Variate Regression Splines. These tools implement quite sophisticated learning algorithms, but unfortunately, from the whole data mining cycle they fully support only stages of model creation and creation of reports. The support for the stage of data assessment and preparation is pretty limited one. AnswerTree is a decision tree induction system, that allows users to build and interactively explore decision or regression trees using one of the four algorithms implemented in system. Positive tendencies. If one analyses features of aforementioned tools, the following strong points can be selected: 1. high level of automation for performing/repeating sequence of atomic data mining steps (SPSS, Weka, SAS), 2. good visualization techniques (SPSS, SAS), 3. support for interactive model exploration (SPSS/Answer tree), 4. the latest versions of tools tend to support model interchange with a help of PMML standard, 5. decision tree tools usually allow transformation of decision trees into rule sets. The negative tendency that we see is that the modern data mining tools... 3.2
... Aren’t Agile Discovery Environments
Here follows the list of typical limitations of the reviewed tools. 1. The data mining process is actually finished as a model is built. One can evaluate the model on a new dataset, but there are no obvious means to “tune”, or evolve it with respect to the information change. 2. The machine learning methods, which support background knowledge usage, such as Inductive Logic Programming [10] or Quasi-Axiomatic Theories of the JSM-method [11] are out of the modern popular data mining toolbox. None of the five systems has any hint on background knowledge usage. 3. The models are usually bound to only one possible knowledge representation, produced by the output algorithms, and can not be translated into other representations.
316
P.A. Grigoriev and S.A. Yevtushenko
4. The abilities of a discoverer to modify models per hand are extremely limited, if any5 . As you can see from this list, the most popular data mining tools stay far away from agile discovery environments. To stress this point, let us name one more circumstance. We strongly believe that a perfect discovery framework for data mining must support a work-flow that does not include any use of machine learning algorithms at all. The discoverer should be able to hypothesize on his own, seeing how the model he is creating evolves (e.g. using a test dataset) without employing the intuition of the others, packed into machine learning methods. This “hand-craft” mode of the model creation is not supported by any of the five systems. The only exception in this tendency appears to be Weka, were somebody implemented this mode for decision trees6 . This tendency supports our claim that the role of a human being in discovery process is being considerably undervalued in the current data mining systems.
4
Elements of Agile Discovery Environment in QuDA
QuDA [12] is a data mining toolbox designed to back up the process of discovery. The authors develop this software for about two years now at the Intellectics lab of the Darmstadt University of Technology. In this section we will address how several principles of the agile discovery environment are implemented in QuDA. Please note once again that by no means we aim show that QuDA is somewhat superior to the other data mining tools that we have mentioned in Section 3. This doesn’t hold even with respect to the the list of requirements that we have formulated earlier in this paper. At the moment, QuDA does not fulfill all of the requirements7 . However, it already fulfills a good combination of them, and thus can serve for the community as a test-bed for defining whether or not our ideas indeed useful for the discovery process. Requirement 1. An ADE should respect and support human responsibility in discovery process. 1. ... present the model in a way discoverer can comprehend it... QuDA operates with decision trees, decision lists, and rulesets. Most effort is put on working with rulesets (Figure 1). 2. ...allow discoverer to incorporate his personal background knowledge... QuDA supports different modes for background knowledge incorporation. For example, users can create and manage hierarchies 8 , which define custom generalization procedure for an attribute (or a group of attributes) and determine 5 6 7 8
I.e. with SPSS/AnswerTree you can grow or cut a certain subtree, but you cannot modify the tests in its nodes. But note: only for decision trees, although Weka supports numerous other model types. Of cause we hope that someday it will. Intrinsically this equals introducing a custom attribute type.
Elements of an Agile Discovery Environment
317
the way missing values are treated. Apart from that, users can add their own rules to any ruleset as well as manually create a ruleset, and then merge it into any generated one. 3. ...allow discoverer to modify any created model the way he wants... Users of QuDA can add, delete and modify single rules in any ruleset. They can also filter out rules according to a number of build-in heuristics and even specify their own criteria to do so. As a user performs changes, he gets an immediate feedback on how these changes affect the model consistence/coverage over a given test/validation dataset (Figure 1).
Fig. 1. Ruleset navigation in QuDA
Requirement 2. ...support means for boosting the discoverers intuition. QuDA supports a number of techniques for data visualization, model visualization, and also model comparison visualization. One of the techniques, that somehow integrates three of the above-mentioned is called Generic Logic Diagram (GLD)9 . Surprisingly enough, none of the 5 most popular data-mining toolboxes has this feature... QuDA does have it. Requirement 3. ... support the evolution of discoverers personal knowledge. QuDA integrates all the information about a data mining project in a single document. This includes one or more related datasets, obtained models, model comparison results in forms of graphics and tables, researcher’s own notes, documentation, etc. QuDA documents are intended to help researchers keep all the relevant information on a given problem domain within easy reach. Requirement 4. ...support team discovery. QuDA supports PMML [14], as the machine-readable standard on data mining models interchange. It also 9
We’ve learned this technique from the Machine Learning in C++ (MLC), see [13].
318
P.A. Grigoriev and S.A. Yevtushenko
supports text/HTML export of all the types of models it can build. The tracking of the model creation history is not yet implemented, but we are working on it and consider this feature extremely important. Requirement 5. ...keep the discoverer open-minded... Like several other tools we have reviewed in Section 3, QuDA supports diverse model types and includes several model comparison and estimation utilities, such as crossvalidation, learning curves, and GLDs to help users in choosing an appropriate model type.
5
Conclusions
We characterized a number of principles for an agile discovery environment. The two most important of those principles are: 1. models should be built evolutionary, and 2. the right place to build them is the head of his majesty discoverer. We have shown that the most popular data-mining toolboxes are not qualified as agile discovery environments at the moment. Moreover, it seems that they do not tend to become ones in the nearest future. On the example of the QuDA system we demonstrate how several features of agile discovery environment could be implemented. We encourage the developers of data mining software to try those features with QuDA and implement the ones they find appropriate in their own systems. Acknowledgments. We would like to thank the German Federal Ministry of Education and Research for supporting the DaMiT project (damit.dfki.de). QuDA was first developed as a supplementary system for this online tutorial. Also we would like to thank the reviewers for their valuable comments.
References 1. Manifesto for agile software development, 2001. http://agilemanifesto.org/. 2. C. Piscopo and M. Birattari. Invention vs. discovery. a critical discussion. In Discovery Science. 5th International Conference, DS 2002, pages 457–462, Berlin, 2002. Springer Verlag. 3. Ben Shneiderman. Inventing discovery tools: Combining information visualization with data mining. In Discovery Science. 4th International Conference, DS 2001, pages 17–29, New York, 2001. Springer Verlag. 4. KD-Nuggets. Kdnuggets: Polls: Data mining tools you regularly use, June 2002. http://www.kdnuggets.com/polls/2002/data mining tools.htm. 5. http://www.spss.com/SPSSBI/Clementine/. 6. Weka 3: Machine learning software in Java. http://www.cs.waikato.ac.nz/ml/weka/. 7. http://www.sas.com. 8. http://www.salford-systems.com/products-cart.html.
Elements of an Agile Discovery Environment
319
9. http://www.spss.com.sg/products/answertree.htm. 10. Ronald de Wolf Shan-Hwei Nienhuys-Cheng. Foundations of inductive logic programming. Lecture notes in artificial intelligence. Springer, 1997. 11. Viktor Finn. Plausible reasoning of JSM-type in open (+/-) worlds. In Architectures for Semiotic Modeling and Situation Analysis in Large Complex Systems, Monterey, CA, USA, 1995. 12. http://ki-www2.intellektik.informatik.tu-darmstadt.de/˜jsm/QDA. 13. http://www.sgi.com/tech/mlc/. 14. Data Mining Group. Predictive model markup language (PMML). http://www.dmg.org/.
Discovery of User Preference in Personalized Design Recommender System through Combining Collaborative Filtering and Content-Based Filtering Kyung-Yong Jung, Jason J. Jung, and Jung-Hyun Lee Department of Computer Science & Engineering Inha University, Inchon, Korea [email protected], [email protected], [email protected]
Abstract. More and more recommender systems build close relationships with their users by adapting to their needs and therefore providing a personal experience. One aspect of personalization is the recommendation and presentation of information and products so that users can access the recommender system more efficiently. However, powerful filtering technology is required in order to identify relevant items for each user. In this paper we describe how collaborative filtering and content-based filtering can be combined to provide better performance for information filtering. We propose the personalized design recommender system of textile design applying both technologies as one of the methods in the material development centered on customer’s sensibility and preference. Finally, we plan to conduct empirical applications to verify the adequacy and the validity of our personalized design recommender system.
1 Introduction More and more services are available on the Internet through recommender systems. In general, these recommender systems are focused on providing information or to sell products. Often both are done at the same time, e.g., E-commerce sites which provide detailed information about their products. The user has been recognized as a very valuable asset to the recommender system. Therefore, the recommender systems try to tie the users to their service by letting the users more efficiently (e.g., less time consuming) access the pieces information and products, which they prefer. Often, users may define preference through user profiles, which are then used to personalize the visits to that recommender system, so that the recommender system presents a customized view adapted to the user’s interests. This trend of personalized design recommender system, in contrary to the static collection of hypertext documents, necessitates new technologies and tools to adapt to users. One key technology is information filtering, so that important objects can an automatically identified and presented to users. Techniques, which have been proposed for information filtering fall in two classes: collaborative filtering and content-based filtering. In order to build better performing filtering system both techniques can be combined. In recent G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 320–327, 2003. © Springer-Verlag Berlin Heidelberg 2003
Discovery of User Preference in Personalized Design Recommender System
321
research approaches of the combining both techniques have been studied[1,7]. The purpose of our research is to explore the combination of collaborative filtering and content-based filtering on textiles. The rest of the paper is organized as follow: In Section 2 we describe the sensibility engineering collaborative filtering. In Section 3 we describe content-based filtering. Section 4 describes our combined approach, and an experimental study is presented in Section 5. Finally, we conclude in Section 6.
2 Sensibility Engineering Collaborative Filtering 2.1 The Extraction of Representative Sensibility Adjective from Database
We decide 18 pairs of the sensibility adjective by using a method that extracts adjectives from a dictionary, the magazine, the literature and etc. The pairs of adjective about textile used in questionnaire are shown {(spaced, compact), (dark, bright), (rural, urban), (childish, adult), (ungraceful, graceful), (pure, sexy), (conservative, progressive), (female, male), (curve, linear), (dull, clear), (cold, hot), (simple, complicated), (mechanical, natural), (old, new), (static, active), (retro, modern), (oriental, western), (soft, hard)}. 512 users (259 males and 253 female) evaluated each textile according to type of the meaning division scale from –2 to +2 rating (5 phases). The textile database by the sensibility adjective consists of the user ratings, user-profiles, and information about textile. All the textiles are classified each 20 textile with 4 categories of Imaginal, Natural, Geometry, and Artificial. After selecting each pair of adjective, which expresses sensibility, we decided 36sensibility adjective. The questionnaires on web (http://hci.inha.ac.kr/SulTDS) are based on the degree of the sensibility that the users feel about 60 textiles. The composition of questionnaires based on web follows to answer 10 phases on 6 textiles in one page. It is difficult to evaluate the sensibility of the human objectively due to the ambiguity and quality. In addition, it is difficult to grasp the image of abstract users because the expression comes out by the restrict adjectives only. In this paper, we tried to grasp the general sensibility of users through a method that extracts a sensibility adjective about textile. Algorithm is below which draws representative sensibility adjective based on the data evaluated on each textile by users. The reason of using 5 adjectives is that one adjective is not sufficient for expressing the sensibility of textiles. First step is classifying the sensibility data on each textile. Second step is calculating means and standard deviations of each sensibility adjective. Third step is sorting sensibility adjective by its mean (Selecting top 5 adjectives in descending). Last step is establishing the database of sensibility adjectives on textiles. The results of survey are open on http://hci.inha.ac.kr/SulTDS/result.htm. 2.2 Textile Based Collaborative Filtering
Collaborative filtering systems recommend objects for a target user based on the opinions of other users by considering how much the target user and the other users have agreed on other objects in the past[6]. This allows this technique to be used in any type of objects and thus build a large variety of service, since collaborative
322
K.-Y. Jung, J.J. Jung, and J.-H. Lee
filtering systems consider only human judgments on the value of objects. These judgments are usually expressed as numerical ratings, revealing the user’s preference for objects. In the personalized design recommender system, we apply a commonly used algorithm, proposed in the GroupLens project [4,8] and also applied in Ringo [2,3], which is based on vector correlation using the Pearson correlation coefficient. Usually the task of a collaborative filtering technique is to predict the rating of a particular user u for a textile i. The system compares the user u’s rating with the rating of all other users, who have rated the considered textile i. Then a weighted average of the other users rating is used as a prediction. If Iu is set of textiles that a user u has rated then we can define the mean rating of user u by Equation (1). 1 (1) r = r
∑
u
| I u | i∈Iu
u ,i
Collaborative filtering algorithms predict the rating based on the rating of similar users. When Pearson correlation coefficient is used, similarity is determined from the correlation of the rating vectors of user u and the other users a by Equation (2). (2) ∑ (ru ,i − ru )(ra,i − ra ) w( u , a ) =
i∈I u ∩ I a
∑
i∈I u ∩ I a
(ru ,i − ru ) 2 •
∑ (r
i∈I u ∩ I a
a ,i
− ra ) 2
It can be noted that w∈[-1, +1]. The value of w measures the similarity between the two users’ rating vectors. A high absolute value signifies high similarity and low absolute value dissimilarity. The general predict formula is based in the assumption that the prediction is a weighted average of the other users rating. The weights refer to the amount of similarity between the user u and the other users by Equation (3). The factor k normalizes the weights. (3) p collab (u , i) = ru + (1 / ∑ w(u , a )) ∑ w(u , a )(ra ,i − ra ) a∈U i
a∈U i
Sometimes the correlation coefficient between two users is undefined because they not rated common objects (Iu Ia = Ø). In such cased the correlation coefficient is estimated by a default voting (wdefault = 2), which is the measured mean of typically occurring correlation coefficient [2,5,6].
3 Content-Based Filtering It is reasonable to expect that textiles with similar content will be almost equally interesting to users. The problem is that defining textile content and textile similarity is still an open problem. Ongoing research in multimedia indexing is focusing on two directions. First, each textile is described by a textual caption, and captions are compared using techniques derived from document retrieval. Second, analysis and recognition technique are applied to the textile pixels to extract automatically features that are compared using some distance measure in the feature space. We focus on the latter approach, because it can be entirely automated. In our prototype, we have currently implemented two feature extraction components, derived from the work described in color histograms and textile design coefficients.
Discovery of User Preference in Personalized Design Recommender System
323
3.1 Color Histograms and Textile Design Coefficients
The original textiles are available in RGB format, where each pixel is defined by the value (0-255) of the three components red, blue, and green. We project these values in the HSV(Hue, Saturation, Value) space which models more accurately the human perception of colors. The HSV coefficients are quantized to yield 166 different colors. For each textile, the histogram of these 166 colors is computed (proportion of pixels with a given quantized color). To compare two textiles design, we compute the L1 distance between their color histograms with the following Equation (4). hi(j) represents the percentage of number of pixels of textile i with the color j. (4) d color (k , l ) = L1 (hk , hl ) = ∑ | hk ( j ) − hl ( j ) | d color ∈ [0..2] j
While color histograms do not take into account the arrangement of pixel, textile design coefficients can be computed to characterize local properties of the textile. We are using a wavelet decomposition using the two-dimensional Haar transform, by which a number of sub-textiles corresponding to frequency decomposition are generated. These sub-textiles are quantized to binary values, so that each pixel of the original textile is associated with binary vector of length 9. The histogram of these 9 vectors (length 512 = 2 ) is the feature vector associated to analysis of the textile. As previously for color distance, the L1 distance is used to measure the distance between textiles by Equation (4). In order to determine the similarity between textiles design, all the textiles in the database are decomposed into sub textiles using wavelet decomposition. From the decomposition a feature histogram is derived, which can then be compared by the use a vector metric. We use a linear estimated for the content-based prediction, which is illustrated in the following Equation (5). Pcolor represents the prediction for textile i for user u. (5) ∑a∈C (i ) ru ,a p color (u , i) = ∑ λ j j
j
| C j (i ) |
If a prediction is to be made for a user u and a target textile i, all the textiles previously rated by user u are grouped into distance classes Cj(i)={a Iu : dcolor (k,l) } according to color-based distance to target textile i. Each class is associated with a weight λ j . The prediction is then the weighted sum of the mean ratings of each class. The weights λ j are estimated through linear regression by using a separated subset of the rating. 3.2 Correlation between Content-Based Distance and Ratings Assigned by Users
Using the individual ratings, those users assigned to textiles, and the previously described content-based distances between textiles, we measure a correlation between textile distance and the difference of ratings, which the same user assigned to the textiles. The result for color histograms and textile coefficients are potted in Figure 1. For each user and for each textile, that a user rated, all occurring distance between the textiles were collected together with the according absolute difference of ratings. The distance are then sorted and grouped. The mean distances of each group
324
K.-Y. Jung, J.J. Jung, and J.-H. Lee
determine the value for the x-axis (dcolor(i,j), dtextile(i,j)). For each group the mean absolute rating difference determines the y-coordinate (|ru,i - ru,j|). &RORU
_M XU ,X U _ HF QH ILG J QLW D5 Q DH 0
7H[WLOH
0HDQ,PDJHGLVWDQFH G FRORULM G WH[WLOHLM
Fig. 1. Correlation of distance between color-histograms of textiles and rating differences
These measurements suggest, that textile which are close in color or in textile receive in general similar ratings by the same users. Later, we describe how we derive artificial ratings considering this relationship, which can then be used to improve collaborative filtering.
4 Combining Collaborative Filtering and Content-Based Filtering in Personalized Design Recommender System In cases where collaborative filtering is limited by an insufficient amount of users and ratings, a combination of collaborative filtering and content-based filtering should lead to better filtering performance. Besides the improvements of performance for case of sparsity, a system which uses a combined approach can also recommend items which have not yet received any ratings e.g., new items, which is not possible for a system relying only on collaborative filtering. In the following we present the combination of collaborative filtering and content-based filtering.
4.1 Deriving Artificial Users from Textile Design Metrics
We pursue to extend the database used for collaborative filtering technique, so that artificial ratings are inserted, which are coherent with the content-based distances. For each described distance metric of Section 3 and for each real user u a corresponding artificial user ucolor and utextile is derived. The artificial users are assigned the same rating as the original user u, so that if ru,i is defined, then rucolor,i = rutextile,i = ru,i. Additionally, artificial ratings are derived for some textiles, which the original user u had not rated. The artificial ratings are content-based predictions for that particular
Discovery of User Preference in Personalized Design Recommender System
325
user. That means that some un-rated items are assigned a predicted rating, based on similarity between the rated items and the item whose score should be predicted. In order to perform a content-based prediction, we define a restricted neighborhood for of a textile i within user-profile of a user u, which contains textiles rated by user u, which distance is below a threshold T. These neighboring textiles are then user to predict a score for the artificial ratings. The prediction formula for color is shown in Equation (6). (6) p color (u , i ) = mean j∈{ j∈I |d (i , j )≤T } (ru , j ) color
color
u
In this perspective, the database is extended for color as following Equation (7). ru color ,i
if ru ,i is defined . ru ,i color color color (i , j ) ≤ T } is not empty = p (u , i ) if { j ∈ I u | d undefined eles
(7)
The extended database is then used with collaborative filtering algorithm, which has been described earlier. By extending existing users the possibility of correlation with the artificial users is increased. In fact, a user u correlates perfectly with his counterparts ucolor and utextile, which causes the content-based prediction to be strong part of the collaborative prediction of user u and transitively also of all other users according to their similarity to user u. 4.2 Extensive Prediction of User Preference through Combining Collaborative Filtering and Content-Based Filtering
For the following considerations we assume an existing collaborative filtering system, as design recommender system. The combination with content-based filtering is therefore rather an extension of collaborative filtering. As research leads to additional content-analysis tools, the extension approach should not limit the number of contentbased extensions. Therefore we combine the content-based filtering predictors with the collaborative filtering predictor pcollab(u,i), as described in Section 2, linearly using the following Equation (8). (8) p comb (u , i ) = µ collab p collab (u , i ) + µ color p color (u , i) + µ textile p textile (u , i ) ∑µ =1 The weights µ {collab ,color ,textile} are estimated by the use of linear regression with a setaside subset of the rating, so that the weights are adapted to the relevance of a predictor, e.g. higher for color based than for textile-based prediction. The estimation of the weights should be repeated as the rating database flows, in order to take into account a precision gaining collaborative predictor. Both presented combination approaches use parameters to control the mix of collaborative filtering and content-based filtering. For the linear combination approach the parameters( µ collab , µ color , µ textile ) determine the weight in the sum of each predictor. The algorithm is designed so that the parameters are adjusted automatically. We found through our experiment the following value for these parameters: µ collab =0.54, µ color =0.43, µ textile =0.03 This shows that the most important component in the linear combination is collaborative filtering closely followed by color-based prediction. Textile-based prediction has only negligible importance.
326
K.-Y. Jung, J.J. Jung, and J.-H. Lee
5 Evaluation In order to measure the performance of the prediction more robustly, the division into test-set and training set was repeated 20 times. After each run the prediction is evaluated using mean absolute error (MAE) and correlation as distances between the test-set and predicted set. These measurements were then averaged [2,5,6,7]. &RORUWKUHVKROG
7H[WLOHWKUHVKROG
'(96WDQGDUG'HYLDWLRQRI0$(
( $ 0 IR Q RWL DL YH ' G DUG QD W 6
($ 0
&RORUWKUHVKROG 7H[WLOHWKUHVKROG
(a) Mean absolute error (MAE)
&RORUWKUHVKROG
(b) Standard deviation of MAE (DEV)
Fig. 2. Variation of the color and textile thresholds: Tcolor and Ttextile
The graph, Figure 2(a), illustrate how the prediction precision in terms of MAE changes dependent on the choice of a parameter Tcolor and Ttextile. A lower MAE can be observed for both parameters Tcolor and Ttextile greater than zero (a threshold equal to zero corresponds to no artificial ratings), indicating that each criteria can lead to extensions of the rating database which improve the results of the collaborative filtering algorithm. However, the impact of color seems to be more significant than textile. Figure 2(b) depicts the standard deviation of MAE of Figure 2(a). It is notable that with increasing Tcolor The standard deviation of the MAE decrease, indicating that the predictions become more robust, i.e. higher absolute errors are more likely avoided for textile similar measurements were made. Table 1 lists the measured precision for the previously discussed predictors. Here, it is interesting to note the improvements of the combined approach compared to the collaborative, content based, and sensibility adjective approach. An improvement of mean absolute prediction error is the combined prediction over the collaborative prediction can be identified. Further, an improvement of standard deviation of the absolute error (DEV) can be observed indicating, that the predictions are more robust using the combination, i.e. large prediction errors are likely avoided. The increase of the mean correlation (COR) indicates that the overall ordering of the textiles in the test-set is more respected by the prediction when the combination is used instead of collaborative prediction by itself.
Discovery of User Preference in Personalized Design Recommender System
327
Table 1. Prediction precision of collaborative, content-based, sensibility adjective, and combined predictor
Prediction method Collaborative filtering Content-based filtering Sensibility adjective Combined predictor
MAE 0.704 0.735 0.709 0.681
DEV 1.397 1.405 1.301 1.159
COR 0.353 0.477 0.395 0.383
6 Conclusions It is important for the strategy of product sales to investigate the customer’s sensibility and preference degree in the environment that the process of material development has been changed focusing on the customer center. In this paper we identify collaborative filtering and content-based filtering as independent technologies for information filtering. However, information filtering is a hard problem, and cannot be addressed by one filtering technology alone. Due to limitations of both collaborative filtering and content-based filtering, it is useful to combine these independent approaches to achieve better filtering results and therefore better the design recommender system. In the future, we plan evolve the extension algorithm to achieve better performance.
References 1. M. Balabanovic, and Y. Shoham, “Fab: Content-based, Collaborative Recommendation,” Communication of the Association of Computing Machinery, 40(3), pp. 66–72, 1997. 2. J. S. Breese, D. Heckerman, C. Kadie, "Empirical Analysis of Predictive Algorithms for Collaborative Filtering," Proc. of the 14th Conference on Uncertainty in AI, 1998. 3. J. Herlocker, et al., “An Algorithm Framework for Performing Collaborative Filtering,” In Proc. of ACM SIGIR'99, 1999. 4. Badrul M. Sarwar, et al., “Using filtering agents to improve prediction quality in the grouplens research collaborative filtering system,” In Proc. of ACM CSCW’98, 1998. 5. K. Y. Jung, J. K. Ryu, and J. H. Lee, "A New Collaborative Filtering Method using Representative Attributes-Neighborhood and Bayesian Estimated Value," Proc. of ICAI’02, USA, pp. 709–715, 2002. 6. K. Y. Jung, J. H. Lee, "Prediction of User Preference in Recommendation System using Association User Clustering and Bayesian Estimated Value," LNAI 2557, 15th Australian Joint Conference on Artificial Intelligence, pp. 284–296, 2002. 7. M. Pazzani, “A Framework for Collaborative, Content-Based and Demographic Filtering,”AI Review, pp. 393–408, 1999. 8. P. Resnick, et. al., "GroupLens: An Open Architecture for Collaborative Filtering of Netnews," Proc. of ACM CSCW'94, pp. 175–186, 1994.
Discovery of Relationships between Interests from Bulletin Board System by Dissimilarity Reconstruction Kou Zhongbao, Ban Tao, and Zhang Changshui State Key Laboratory of Intelligent Technology and Systems, Department of Automation, Tsinghua University, Beijing 100084, P.R.China {kzb98, bantao00}@mails.tsinghua.edu.cn, [email protected]
Abstract. In this paper, we propose a new method to analyze people’s interests by simulating the gradually changing and transferring mechanism among interests. Different boards in Bulletin Board System (BBS) which focus on various topics serve as representation of people’s interests. A technique named Dissimilarity Reconstruction (DSR) is put forward to discover relationships between the interests. DSR tries to grasp the intrinsic structure of the data set by the following steps. First, Vector Space Model (VSM) representations of the interests are obtained by taking users in BBS as terms and the numbers of messages they post as weights. Second, dissimilarities are calculated from the interest vectors. Finally, the nonlinear technique Isomap is engaged to map the interests into the intrinsic dimensional space of the data set where Euclidean distance between two interests well represents their relationship.
1
Introduction
Bullitin Board System (BBS) is an electronic message center where one can review messages left by others, and post his/her own messages. In BBS, people with common interests may communicate each other and discuss on their concerned topics. Then BBS can be deemed as a social network whose actors are users and connections are set up by users’ common interests. [1] Bulletin boards in BBS often serve specific interest groups, and messages posted in a board often involve the theme of the board. So we can deem each board as a specific kind of interest. In this paper, we put forward a technique named Dissimilarity Reconstruction (DSR) to represent the interests, i.e. the boards, in a low dimensional space where Euclidean distance between two interests well represents their relationship. Discovery of relationships between interests may not only be of great significance to the area of sociology, but also be applied in other fields such as collaborative filtering and recommender systems [2,3]. The remainder of this paper is organized as follows. Section 2 describes the proposed method and the processing of BBS data. Section 3 demonstrates the results. Concluding remarks are given in Section 4. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 328–335, 2003. c Springer-Verlag Berlin Heidelberg 2003
Discovery of Relationships between Interests
2 2.1
329
Dissimilarity Reconstruction VSM Representations of Interests
First, we represent interests as vectors in a multi-dimensional space by Vector Space Model (VSM) [4]. VSM is a conventional model to represent documents and queries in information retrieval field. One’s posting activities on a specific board in the BBS may well present his/her interest in the topic: the more messages one posts, the greater one’s interest is. Thus we can build the VSM representation of the interests by taking users in BBS as terms and numbers of messages they post on different boards as weights. We downloaded 1.5 million messages from SMTH BBS, the biggest BBS of P.R.China. The messages are posted by 51793 users from Oct. 28, 2001 to Dec. 29, 2001 on 185 boards. Each board has a name denoting its topic and all boards are grouped into 7 blocks. The 7 blocks in SMTH BBS are computer science, amusement, culture, social information, subjects, sports and sentiment. Boards in the same block often focus on various aspects of the theme of the block, for example, boards “Basketball” and “Football” are included in the block concerning sports. After removing users with too few messages (for their behaviors cannot demonstrate their preferences properly; here, users with less than 100 messages are omitted) and those who have only posted on one specific board (for they do not help to combine different interests), 2922 users are preserved. Thus, each interest is denoted by a 2922-dimensional vector. 2.2
Dissimilarities between Interests
In order to acquire the dissimilarities between interest vectors, here we give the definition of “similarity” between two boards first. When a user posts messages on a number of boards, he/she becomes a common poster of the boards. Two boards are defined to be “similar” when they have many common posters and numbers of messages from each poster are comparative. Similarity between two boards here is not content-based, but is defined on users’ ratings. In this way, two boards with a great similarity may have few common topics between each other. For example, “Family-life” in SMTH BBS discusses affairs in family lives such as everyday affairs, family relationship, happiness and depression, etc.; yet “Shopping” concerns information and experiences on shopping. These two boards do not have much relationship in contents, however, as females often favor both of the topics, they become “similar” due to the large amount of common users. Since boards with relative topics, for example, different boards in the same block, often attract lots of common posters, they share great “similarities”. The similarity sim(k1 , k2 ) between two interests can be computed as the cosine of the angle between their associated vectors [4]:
330
K. Zhongbao, B. Tao, and Z. Changshui m
sim(k1 , k2 ) =
(vk1 ,i · vk2 ,i ) v¯k1 · v¯k2 , = i=1 m m |¯ vk1 |2 · |¯ vk2 |2 ( vk21 ,i ) · ( vk22 ,i ) i=1
(1)
i=1
vk2 are their associated vectors where k1 ,k2 denote two arbitrary interests, v¯k1 ,¯ respectively and m is the dimensionality of the vectors. From sim(k1 , k2 ), the dissimilarity D(k1 , k2 ) between two interests can be computed as (2) D(k1 , k2 ) = 1 − sim(k1 , k2 ). Obviously, we get D(k1 , k2 ) ∈ [0, 1]. The greater D(k1 , k2 ) is, the more different interests k1 and k2 are. This accords with the intuitionistic concept of distance. Compare with similarity, dissimilarity is more convenient in later computation and statement and thus it is deemed as the measurement of distance in later discussion. 2.3
Dimensionality Reduction with Isomap
Since the VSM representations of interests obtained above are with high dimensionality and sparsity, they are susceptible to noise and the underlying semantic structure is difficult to capture. To solve the problem of retrieving rational relationships between these representations, dimensionality reduction techniques are often involved. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are popular techniques for dimensionality reduction based on matrix decomposition [5]. Other than these methods, we borrow the idea from a nonlinear dimensionality reduction technique, Isomap [6], to achieve the goal of dimensionality reduction. Isomap can retrieve the structure of data and find the intrinsic relationships among objects hidden in complex natural observations. It is usually more efficient than linear ones in dimensionality reduction, i.e. the intrinsic dimensionalities it discovers are often much lower than those by its linear alternatives. Fig. 1a gives an example of an S-like manifold resident in the 3-dimensional space. The manifold is with intrinsic dimensionality of 2. Isomap first constructs the neighborhood graph based on the Euclidean distance, and then computes the geodesic distances between pairs of samples in the data set. Finally, based on the geodesic distances, Isomap scatters the samples into a reconstructed lowdimensional space by Multidimensional Scaling (MDS) [7] technique. Fig. 1b shows the reconstructed two-dimensional embedding recovered by Isomap, which well preserves the geodesic distances. In Ref. [6], experiments on real world data sets were carried out using Isomap. Rational low dimensional representations retrieved by the method showed its ability to discover the essential geometric structure. Why should we adopt Isomap to do dimensionality reduction for interest vectors? What merits will Isomap lead to in building the relationships of interests? To answer these questions, we should take the intrinsic mechanism within
Discovery of Relationships between Interests
331
Fig. 1. Using Isomap to capture the underlying global geometry of a data set. (a) An S-like manifold with intrinsic dimensionality of 2 in the 3-dimensional space. Bold segments show the geodesic path between two points (marked with squares). Gray lines between samples show the neighborhood graph built by connecting each point with its k (here k=7) nearest neighbors. (b) 2-dimensional embedding of the manifold in (a) found by Isomap. Bold segments show the geodesic path corresponding to that in (a) while length of bold line is the Euclidean distance between the two points.
the transferring process of interests into account. Take the following case as an example. When browsing a web page on the World-Wide-Web, we often follow the hyperlinks in the page to a new one. After shuttling for dozens of times, the page we finally arrive at may have nothing to do with the one we started with. Interests may change in a similar way. We often become interested in new things concerning our original scope of interests while rarely care for those unacquainted things. Thus we enlarge our scope of interests in a gradually transferring manner. This gradually changing and transferring mechanism among interests may well reflect the underlying structure of the VSM representations of the interests. The steps of neighborhood graph constructing and geodesic distance calculating in Isomap can just serve as the simulation of this mechanism. DSR also includes these two steps, which ensure DSR to find the intrinsic relationships among interests. Different from Isomap, these steps in DSR are not based on the Euclidean distances in high dimensional space, but on the dissimilarities between vectors. From 2.2, dissimilarity array Dn×n can be obtained, where D(k1 , k2 ) is calculated by Eq. (1) and Eq. (2). The next step is to map the interests into a low-dimensional space following the main procedure of Isomap: 1. Determine which interests are neighbors based on the dissimilaritiy array, and build a graph, where vertices are the interests and edges are constructed by connecting each interest to all of its k nearest neighbors;
332
K. Zhongbao, B. Tao, and Z. Changshui
2. Calculate the geodesic distances between all pairs of interests by summing up the dissimilarities between neighboring interests along the shortest path in the graph; 3. Construct an embedding of the data in a low-dimensional space by classical MDS [7] according to the geodesic distances array. The intrinsic dimensionality of the data set is determined from the Residual Variance (RV)-dimensionality curve [6] of DSR. New configuration of the interests are build in the low dimensional space. Thus we get the mapping from high dimensional space to the intrinsic dimensional space of the data set.
3
Results
Fig. 2a shows the RV-dimensionality curve of DSR on SMTH BBS data set. There is an “elbow” on the curve, i.e., a point at which the curve ceases to decrease significantly with added dimensions. The corresponding dimensionality can be deemed as the intrinsic dimensionality of the data set. Here the intrinsic dimensionality is estimated as 11 and each interest is represented by an 11dimensional vector in the space. Compared with the original 2922-dimensional representations of VSM, dimensionality reduction through DSR is efficient. To test the rationality of the DSR representation, the Euclidean distances between pairs of interests in the 11-dimensional space are calculated. To be more perceptible, average distance between interests in the same blocks and those across blocks are showed in Fig. 2b. In Fig. 2b, interests are grouped by blocks, the gray level of a square denotes the average distance between two corresponding blocks. The darker a square is, the closer the interests in the two corresponding blocks are. Apparently, the squares along the diagonal are usually darker than those away from the diagonal. As we have commented before, boards in the same block are more likely to have common posters. Vicinities of these interests in the intrinsic space reconstructed by DSR are reasonable. Table. 1 shows ten pairs of closest interests and ten pairs of farthest interests. The Euclidean distances in the 11-dimensional space reconstructed by DSR between them are also listed. All the distances are normalized to 1. Obviously, pairs of interests on the left of Table. 1 share some common features. For example, arts and literature are the two most important element of culture while movie and pop music are two most popular entertainments. On the other hand, the pairs of interests on the right of Table. 1 have little to do with each other. For example, we can hardly put chess , a mental game, and things like mathematical tools and programming tools like C++ builder together. We can see that distances in the reconstructed space of DSR do give good representations of the dissimilarities between interests. To explore the relationships among interests, the 2-dimensional projection of the embedding found by DSR is showed in Fig. 3. Here, the distance between two interests in the 2-dimensional embedding space is a proximation of the corresponding distance in the intrinsic dimensional space because of omission of other dimensions. Despite the error caused by the projection, some interesting features
Discovery of Relationships between Interests
333
Fig. 2. DSR on SMTH BBS data set. (a) RV-dimensionality curve of DSR. Arrow shows the intrinsic dimensionality found by DSR. (b) Average distances between interests in different blocks in the intrinsic space. Interests are grouped and ordered by blocks. Distances between interests are calculated in the 11-dimensional space reconstructed by DSR and have been normalized to 256-gray-level for visualization. The darker a square is, the closer the interests in the two corresponding blocks are. Table 1. Ten pairs of closest interests and ten pairs of farthest interests in the 11dimensional space reconstructed by DSR Distance Interest 1
Interest 2
Distance Interest 1
Interest 2
0.093 0.126 0.141 0.146 0.155 0.174 0.183 0.184 0.185 0.185
Literature Mud Builder Shopping Entrepreneur Beauty Embedded PopMusic Hardware OceanScience Linux
0.927 0.937 0.938 0.942 0.945 0.956 0.965 0.970 0.980 1
MathTools Chess Wisdom Chess Graduation Chess Chess Signal Chess Astronomy
Arts Mud Food Business Shopping Circuit Movie CompMarket Aero FreeBSD
Banquet VR 3D AI MathTools C++Builder ClassicMusic NumComp Banquet C++Builder Banquet
emerge. The paths in Fig. 3 are the shortest paths as calculating the geodesic distances in the graph. (Please refer to the second step in Isomap introduced in section 2.3.) The paths give examples of the similarities transferences among concerned interests. The first path goes from “AI” to “Unix”, through “Programming”, “FreeDevelop” and “FreeBSD”. A probable explanation of the emergence of this path is: researchers on artificial intelligence require the knowledges of programming; many programmers are keen on free software developing; FreeBSD is a free software based on General Public License (GPL); and FreeBSD is an
334
K. Zhongbao, B. Tao, and Z. Changshui
operation system developed from Unix. The second path starts from “Joke”, bypasses “Emprise”, “Comic”, “Game”, “SportsGame”, and finally arrives at “WorldSoccer”. The explanation can be: joke and emprise novel are two widely accepted entertainments; both emprise novel and cartoon are about fantastic plots; cartoon and computer game are two attractive arts in modern society; sports game is a kind of computer game; football games such as Fifa2002 are the most welcomed sports games. Thus, DSR discovers the internal relationships among interests and captures the structure of the data set.
Fig. 3. 2-dimensional projection of the embedding acquired by DSR and the gradual transferences among interests. Each point or square in the figure stands for an interest. Segments connecting the squares show the shortest paths from “AI” to “Unix” and that from “Joke” to “WorldSoccer” in the neighborhood graph.
4
Conclusion
In this paper, we proposes a new method to analyze people’s interests. The boards in BBS are deemed as representation of people’s interests; a technique named DSR is put forward to represent the interests in the intrinsic dimensional space of the data set. DSR can be carried out following the steps below: first, get the VSM representations of interests; second, compute the dissimilarities between pairs of interests and deem them as the measurements of distances; finally, reconstruct the intrinsic dimensional space and get the configuration of the interests by Isomap. By simulating the gradually changing and transferring mechanism among interests, DSR grasps the structure of the data set. With DSR configuration, relationships between interests are well represented by Euclidean distances in the intrinsic dimensional space, and some valuable features appear in the intrinsic dimensional embedding of the data set.
Discovery of Relationships between Interests
335
References 1. Kou Z.B., and Zhang C.S.: Reply Networks on a Bulletin Board System. Phys. Rev. E 67 (2003) 036117 2. Balabanovic M.: Learning to Surf: Multiagent Systems for Adaptive Web Page Recommendation. Ph.D. dissertation, Stanford University: Department Of Computer Science, (1998) 3. Billsus D. and Pazzani M.J.: Learning collaborative information filters. Proc. Fifteenth International Conference on Machine Learning, Madison, WI, (1998) 46–53 4. Salton G., and McGill M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983) 5. Duda R.O. etc.: Pattern Classification. John Wiley & Sons, New York (2001) 6. Tenenbaum J.B. etc.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290 (2000) 2319–2323 7. Steyvers M.: Multidimensional Scaling. In: Encyclopedia of Cognitive Science. Macmillan, London, (2002)
A Genetic Algorithm for Inferring Pseudoknotted RNA Structures from Sequence Data* Dongkyu Lee and Kyungsook Han** School of Computer Science and Engineering, Inha University, Inchon 402-751, Korea [email protected], [email protected]
Abstract. Pseudoknotted RNA structures are much more difficult to predict than non-pseudoknotted RNA structures both from the computational viewpoint and from the practical viewpoint. This is in part due to the unavailability of an exact energy model for pseudoknots, structural complexity of pseudoknots, and to the high time complexity of predicting algorithms. Therefore, existing approaches to predicting pseudoknotted RNA structures mostly focus on so-called H-type pseudoknots of small RNAs. We have developed a heuristic energy model and genetic algorithm for predicting RNA structures with various types of pseudoknots, including H-type pseudoknots. This paper analyzes the predictions by a genetic algorithm and compares the predictions to those by a dynamic programming algorithm.
1 Introduction An RNA pseudoknot is a tertiary structural element, which forms when nucleotides in a loop base-pair with nucleotides outside the loop. Pseudoknots are not only widely occurring structural motifs in all kinds of viral RNA molecules, but also responsible for several important functions of RNA. The RNA structure with pseudoknots is much more difficult to predict than the RNA secondary structure because prediction of pseudoknots should consider tertiary interactions as well as secondary interactions. In our previous work on classic, H-type pseudoknots [1, 2], we showed that a genetic algorithm often predicts suboptimal structures in terms of free energy, which are closer to known structures than the optimal structures predicted by a dynamic programming algorithm. There have been several attempts to predict pseudoknotted RNA structures using a genetic algorithm [3, 4, 5, 6, 7]. Most of these works focus on predicting H-type pseudoknots only, and cannot be applied to RNAs with complex pseudoknots or to large RNAs. Difficulties in predicting pseudoknotted RNA structures arise from several things. First, there currently exists no energy model available for pseudoknots of nonclassic
*
**
This work has been supported by the Korea Science and Engineering Foundation (KOSEF) under grant R05-2001-000-01037-0. To whom correspondence should be addressed.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 336–343, 2003. © Springer-Verlag Berlin Heidelberg 2003
A Genetic Algorithm for Inferring Pseudoknotted RNA Structures
337
type. According to the broad definition of pseudoknots [8], 14 types of topologically distinct pseudoknots are possible. The most commonly occurring pseudoknots are of the H-type, where H stands for hairpin loop. 76% of the known pseudoknots in PseudoBase [9] are the H-type pseudoknots, and the rest 24% are complex pseudoknots of other types [10]. Therefore, we should be able to predict these complex pseudoknots, too. Second, pseudoknotted RNA structures are much more complex than RNA secondary structures, and therefore most algorithms for predicting pseudoknotted RNA structures have too high time-complexity to be practical. This paper describes an approximate energy model for RNA pseudoknots of any type and a genetic algorithm for predicting pseudoknotted structures for long RNA sequences.
2 Genetic Algorithm We developed a steady-state genetic algorithm that uses thermodynamic free energy as the fitness function. Given an RNA sequence, the algorithm first identifies all possible stems from the sequence and constructs a stem pool as the initial population. It computes the value of the fitness function for each structure of the initial population and evolves the structures through crossover and mutation with probabilities of 0.2 and 0.001, respectively. The algorithm terminates when there is no change in the structures over the generations. A binary string is used to represent a genome. When the total number of possible stems is n, the genome is represented as a string of n bits, where the i-th bit is set to 1 if the i-th stem is included in the structure. Fig. 1 shows an example of a stem table, genome representation, and structure for a given sequence of NFG_L6 RNA [9]. input sequence: CGCUCAACUCAUGGAGCGCACGACGGAUCACCAUCGAUUCGACAUGAG index 0 1 2 3 4 5 6 7 8 9 10 11
start 1 8 24 3 11 21 21 8 28 13 26 32
end 18 48 41 48 34 36 41 16 47 40 35 46
size 5 6 6 4 4 3 3 4 3 3 3 3
energy -16.73 -11.82 -8.70 -7.29 -8.12 -6.24 -5.76 -4.50 -5.46 -3.18 -6.12 -4.26
genome: 12 bits 1 1 1 0 0 0 0 0 0 0 0 0
Fig. 1. The genome representation of NFG_L6 RNA. When stems 0, 1, and 2 are selected from the stem table for the pseudoknot shown in the right, the binary string for the genome becomes 111000000000.
338
D. Lee and K. Han
The algorithm generates 3 stem pools. First, it generates all possible stems of minimum 3 base pairs identified from a covariation matrix, and calculates the stacking energy for each stem. In order to consider the interaction distance of a stem, the stacking energy of a stem is modified by equation (1) when the ratio of the stem size to the interaction distance of the stem < 1. The stacking energy of a stem for which the ratio ≥ 1 is not modified since the modified energy becomes too low. Modified stacking energy = stacking energy + stacking energy × (stem_size/interaction distance of the stem)
(1)
We sort the stems in increasing order of their energy values. The list of these stems becomes what we call fully zipped stem pool. We then remove consecutive wobble pairs at both ends of a stem since these wobble pairs are not stable enough. After removing all consecutive wobble pairs at both ends of a stem, we delete short stems consisting of 1 or 2 base pairs from the list, and recalculate the stacking energy for the remaining stems. The remaining stems make the second stem pool what we call partially zipped stem pool. Finally we generate a pseudoknot stem pool by finding all possible pairs of stems that form a typical H type pseudoknot. At this step, we consider the number of connecting loops and the size of pseudoknot stems only. The three stem pools can be used selectively in generating initial populations. Without the pseudoknot stem pool, the genetic algorithm predicts secondary structures only. We generate structures that include every stem in the stem pools. After choosing a stem from a stem pool, we insert all other stems that can coexist with the chosen stem topologically (detailed method of the topology test is described in section 3). If we use the pseudoknot stem pool, we first select 2 stems from the pseudoknot stem pool, and insert all possible stems from other stem pools. Since the topology test examines the overlapping relation of stems only, structures with various types of pseudoknots are generated.
3 Topology of a Pseudoknot To calculate the free energy of the RNA structures of various types formed during the evolving process, we developed a topology decision algorithm that uses a linked list. The binary string representation of a genome is converted to a linked list, and each node of the linked list represents a stem. The nodes are sorted with respect to the starting number of the stems so that a stem with the smaller starting number is examined earlier than that with the larger starting number. The topology decision algorithm is composed of 2 different topology tests: crossing test and nesting test. In the crossing test, it examines whether there is a stem crossed by the given stem. In the nesting test, it examines whether the given stem contains other stems within it. If the given stem contains < 2 stems, the topology of the given stem is determined easily. If the given stem contains ≥ 2 stems, it determines whether it belongs to a nested loop or a multiple loop (see Fig. 2).
A Genetic Algorithm for Inferring Pseudoknotted RNA Structures
(A)
339
(B) Fig. 2. (A) Nested loop. (B) Multiple loop.
The topology of a pseudoknot is determined by the following algorithm: 1. Read the start index and end index of the first stem. 2. For the first stem in the structure, do the crossing test and nesting test. 3. If there exists a crossing stem, do the crossing test and nesting test for the crossing stem. 4. If there exists a stem included, do the crossing test and nesting test for the stems included. 5. If there are no crossing stems or stems included, decide the topology of first stem and move to next stem and do the crossing test and nesting test. 6. Until the last stem of the structure, do the crossing test and nesting test repeatedly. Pseudoknot structures generated during the evolving process of a genetic algorithm can be classified into simple pseudoknots and complex pseudoknots. A simple pseudoknot is one of the basic 14 pseudoknot types [8]. A complex pseudoknot is composed of a simple pseudoknot and a secondary structure element. Some of the complex pseudoknots are actually found in natural RNA, but others are not. Coxsackie B3 virus [11], alfalfa mosaic virus [9], and DiGIR 1 RNA [12], for example, contain complex pseudoknots shown in Fig. 3. Our algorithm can predict the complex pseudoknots in Fig. 3. The complex pseudoknots shown in Fig. 4 are somewhat hypothetical since they have not been identified in natural RNA, and are not considered by our algorithm.
(A)
(B)
(C) Fig. 3. Complex pseudoknots that can be predicted by our algorithm. (A) Chain of H type pseudoknots. (B) Pseudoknots with nested loops. (C) H type pseudoknots in a multiple loop.
(A)
(B)
Fig. 4. Hypothetical complex pseudoknots. (A) 3 stems are crossed each other. (B) H type pseudoknot within a pseudoknotted stem.
340
i
D. Lee and K. Han
i’
i’
j
j
j
i
i’
j
j
i’
i’
j
j
Fig. 5. Example of energy calculation for complex pseudoknots.
Our algorithm uses an approximate energy model of H type pseudoknot [13] to calculate the free energy of pseudoknots of various types. It first separates a pseudoknot element from other structure elements and calculates the pseudoknot energy by first applying the energy rule of H type pseudoknot, and repeats this process for other pseudoknot elements. Fig. 5 shows an example of computing energy for a complex pseudoknot. The complex pseudoknots shown in Fig. 4 are discouraged by giving them high energy value during the energy calculation. A multiple loop is given a penalty when computing its energy since the current energy model for a multiple tends to give too low energy.
4 Results The genetic algorithm was implemented into a program called PseudoFolder using C++ builder 5.0 of Inprise Company on 1.61 GHz Pentium 4 PC with 256 MB memory. The structures predicted by the genetic algorithm can be immediately visualized by another program called PseudoViewer [10, 14]. We also implemented a user interface to select an initial population method. PseudoBase [9] contains structure data for the pseudoknot region only for instead of the entire structure, the structure data in PseudoBase were used as test cases for short RNAs. As test cases for long RNAs with various types of pseudoknots, we used the structure data in literature. We also ran dynamic programming algorithms [15, 16, 17] on the same test cases to compare their predictions with those by our algorithm. Fig. 6A shows the known structure of Coxsackie B3 virus RNA [11]. It is a very complex structure with several H type pseudoknots within a multiple loop. The structure predicted by PseudoFolder, shown in Fig. 6B, is similar to the known structure although not the same. The pseudoknot stem pool was used for initial population and free energy was used as the fitness function. It takes 9.28 seconds on average to predict this structure. The dynamic programming algorithm took much longer to predict the structure with H type pseudoknots only or no pseudoknot. PseudoFolder successfully predicted the pseudoknotted structures of short RNA sequences in PseudoBase. However, for pseudoknots whose energy cannot be computed using the current energy model, PseudoFolder fails to predict a correct structure. Fig. 7A is the known structure of DiGIR 1 RNA [12]. It has a complex pseudoknot with a multiple loop and nested internal loops. PseudoFolder could not calculate the free energy of the pseudoknot of this type, and therefore predicted a structure different
A Genetic Algorithm for Inferring Pseudoknotted RNA Structures
(A)
341
(B)
Fig. 6. (A) Known structure of Coxsackie B3 virus RNA. (B) Structure predicted by PseudoFolder.
from the known structure, as shown in Fig. 7B. The current energy model should be improved to be able to consider this type of pseudoknot. For long sequences with more than 200 bases, PseudoFolder could predict structures similar to known structure in a few hundred seconds. But a dynamic programming algorithm [17] failed to predict structures for long sequences due to the computational complexity. Dynamic programming algorithms can predict optimal structures having the most stable free energy. However, the energy model of RNA structure is not complete and dynamic programming algorithms take too much time to be practical.
5 Conclusions Predicting pseudoknotted RNA structures is a more difficult problem than predicting non-pseudoknotted RNA structures. This is partly because there currently exists no energy model available for pseudoknots of nonclassic type, and because the time complexity of prediction algorithms is very high due to the structural complexity of the pseudoknots themselves. We have developed an approximate energy model for complex pseudoknots and developed an algorithm for classifying pseudoknot types based on the topology of structural elements. We also developed a program called PseudoFolder using a genetic algorithm for predicting RNA structures with pseudoknots of any type. Experimental results showed that PseudoFolder often predicted suboptimal structures in terms of free energy but the structures predicted by it were better than those by dynamic programming algorithms.
342
D. Lee and K. Han
(A) (B) Fig. 7. (A) Known structure of DiGIR 1 RNA. (B) Structure predicted by PseudoFolder with free energy as the fitness function.
From our experience with a genetic algorithm, the initial population and the selection step seem to be the most essential part when predicting RNA pseudoknots. In principle, the initial population should contain all possible structures. In practice, however, all possible structures cannot be considered because the total number of possible structures is exponential in x, where x being the sequence length. The mutation and the crossover steps generate variants but not necessarily improvement in solutions. Good solutions are chosen and bad solutions are rejected based on a "fitness function" during the selection step. Therefore, choosing a proper fitness function is important for generating a good solution. Free energy is the most often used as the fitness function when predicting RNA secondary structures. However, an exact energy model is not available for pseudoknots and thus establishing a good energy model for pseudoknots is important for predicting correct structures. The development of the algorithm is not complete, and is being extended in several directions. First, the energy model is being extended to handle other types of pseudoknots, which were not considered in the current version of PseudoFolder. Second, the program will be made available as a web-based application program so that it can be executed anywhere using a web browser.
A Genetic Algorithm for Inferring Pseudoknotted RNA Structures
343
References 1. 2. 3.
4. 5.
6. 7. 8. 9. 10. 11. 12.
13.
14. 15. 16. 17.
Lee, D., Han, K.: Prediction of RNA pseudoknots–comparative study of genetic algorithms. Genome Informatics 13 (2002) 414–415 Lee, D., Han, K.: A Genetic Algorithm for Predicting RNA Pseudoknot Structures. LNCS 2659 (2003) 130–139 Gultyaev, A.P., van Batenburg, F.H.D., Pleij, C.W.A.: The computer simulation of RNA folding pathways using a genetic algorithm. Journal of Molecular Biology 250 (1995) 37– 51 Shapiro, B.A. and Wu, J.C.: An annealing mutation operator in the genetic algorithms for RNA folding. Computer Applications in the Biosciences 12 (1996) 171–180 Shapiro, B.A., Wu, J.C., Bengali, D., Potts, M.J.: The massively parallel genetic algorithm for RNA folding: MIMD implementation and population variation. Bioinformatics 17 (2001) 137–148 Benedetti, G., Morosetti, S.: A genetic algorithm to search for optimal and suboptimal RNA secondary structures. Biophysical Chemistry 55 (1995) 253–259 Shapiro, B.A., Navetta, J.: A massively parallel genetic algorithm for RNA secondary structure prediction. Journal of Supercomputing 8 (1994) 195–207 Pleij, C.W.A.: Pseudoknots: a new motif in the RNA game. Trends in Biochemical Sciences 15 (1990) 143–147 van Batenburg, F.H.D., Gultyaev, A.P., Pleij, C.W.A., Ng, J., Olihoek, J.: PseudoBase: a database with RNA pseudoknots. Nucliec Acids Res. 28 (2000) 201–204 Han, K., Byun, Y.: PseudoViewer2: visualization of RNA pseudoknots of any type. Nucleic Acids Res. 31 (2003) 3432–3440 Deiman, B.A., Pleij, C.W.A.: A vital fearue in viral RNA. Seminars in Virology 8 (1997) 166–175 Einvik, C., Nielsen H., Nour, R., Johansen, S.: Flanking sequences with an essential role in hydrolysis of a self-cleaving group l-like ribozyme. Nucleic Acids Res. 28 (2000) 2194– 2200 Abrahams, J.P., van den Berg, M., van Batenburg, E., Pleij, C.: Prediction of RNA secondary structure, including pseudoknotting, by computer simulation. Nucleic Acids Res. 18 (1990) 3035–3044 Han, K., Lee, Y., Kim, W.: PseudoViewer: automatic visualization of RNA pseudoknots. Bioinformatics 18 (2002) S321–S328 Rivas, E., Eddy S.R.: A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology 285 (1999) 2053–2068 Akutsu, T.: Dynamic programming algorithm for RNA secondary structure prediction with pseudoknots. Discrete Applied Mathematics 104 (2000) 45–62 Reeder, J., Giegerich, R.: http://bibiserv.techfak.uni-bielefeld.de/pknotsrg/
Prediction of Molecular Bioactivity for Drug Design Using a Decision Tree Algorithm Sanghoon Lee, Jihoon Yang , and Kyung-whan Oh Department of Computer Science, Sogang University 1 Shinsoo-Dong Mapo-Ku Seoul 121-742, Korea [email protected], {jhyang, kwoh}@ccs.sogang.ac.kr
Abstract. A machine learning-based approach to the prediction of molecular bioactivity in new drugs is proposed. Two important aspects are considered for the task: feature subset selection and cost-sensitive classification. These are to cope with the huge number of features and unbalanced samples in a dataset of drug candidates. We designed a pattern classifier with such capabilities based on information theory and re-sampling techniques. Experimental results demonstrate the feasibility of the proposed approach. In particular, the classification accuracy of our approach was higher than that of the winner of KDD Cup 2001 competition.
1
Introduction
Drugs consist of small organic molecules that achieve their desired activity by binding to a target site on a receptor. The first step in the discovery of a new drug usually involves identifying and isolating the receptor to which it should bind, followed by testing many small molecules for their ability to bind to the target site. This leaves researchers with the task of determining what separates the active (binding) compounds from the inactive (non-binding) ones [1]. Machine learning can thus be an appropriate choice for the classification task. In general, the problem of analyzing structure, function, and localization of biological data can be solved by classifying feature patterns of the data [2]. We can understand and identify key characteristics of data by classifying feature vectors. However there might be several issues we need to consider when using classification algorithms for biological data (including the dataset for drug design used in this paper). First, a dataset can contain a number of irrelevant, redundant features. In this case, including inappropriate feature can make the classification result less accurate. Second, the examples in the training set might not be drawn from the same distribution where the test examples are drawn. Furthermore, the class distribution of patterns (i.e. number of patterns in each class) can be quite biased. Third, the number of patterns in a dataset is relatively much smaller than the number of features, which incurs high chance of over-fitting.
This research was partially supported by Korea Research Foundation Grant (KRF2002-003-D00133) to Jihoon Yang.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 344–351, 2003. c Springer-Verlag Berlin Heidelberg 2003
Prediction of Molecular Bioactivity for Drug Design
345
We aim to produce an efficient classifier for biological data using a decision tree learning algorithm. Our classifier is designed considering the three issues mentioned above: feature subset selection, cost-sensitive classification, and overfitting avoidance. By feature selection, irrelevant, redundant features are eliminated to produce a subset with relevant features only. Since most of biological data consist of very large number of features, feature selection is important. It is also necessary to consider non-uniform costs for misclassification. For instance, predicting a good drug target as mediocre will be more expensive than predicting a mediocre as good. In addition, it is also important to check how well a training set reflects the distribution of real world data, especially when the training set is relatively small compared to the instance space. Preventing over-fitting is thus also of interest. Against this background, we introduce a decision-tree based classifier using entropy-based feature selection, re-sampling-based costsensitive classification, and cross-validation-based stopping criterion, and verify its outstanding performance with real-world biological data (from KDD Cup 2001 competition) which will be described in detail in Section 2.3.
2
Related Work
This section briefly introduces related techniques for feature subset selection and cross-validation, and summarizes proposed approaches in KDD Cup 2001. 2.1
Feature Subset Selection
A number of approaches to feature subset selection have been proposed in the literature [3][4][5]. These approaches involve searching for an optimal subset of features based on some criteria of interest. Feature selection algorithms can be broadly classified into the following three categories according to the characteristics of the search strategy employed: exhaustive search, heuristic search, and randomized search [6]. Exhaustive search strategy is most appropriate when the number of features is sufficiently small, since it finds the optimal feature subset. However, statistical heuristics [7][8][9] or randomized heuristics [6][10][11][12] are used commonly since there are too many features in most cases. Each strategy has advantages as well as disadvantages in a specific domain. In many cases, however, as large as search space becomes, statistical heuristics may become more reasonable than randomized heuristics because of the relatively low computational cost. 2.2
Cross-Validation for Accuracy Estimation
There are several methods to validate the learning model. One of the most widely used techniques is k-fold cross-validation. In k-fold cross validation, training data are partitioned into disjoint k folds of the same size. Then the classification accuracy for each fold is computed as follows: At each time, one of the k folds is used as a validation set and the others as a training set. The average accuracy of
346
S. Lee, J. Yang, and K.-w. Oh
k-run is called k-fold cross-validation accuracy. It is known that cross-validation can make reliable prediction on unknown test set [13]. Generally, 10-fold crossvalidation yields the best performance in accuracy estimation [14]. 2.3
KDD Cup 2001
KDD Cup 2001 is focused on data from genomics and drug design [1]. Among the tasks, Task 1 is about the prediction of molecular bioactivity for designing a hemostatic. The dataset used here is thrombin dataset (also used in our experiments), and it has various representative characteristics of biological data. The training set consists of 1,909 compounds (i.e. samples or patterns) tested for their ability to bind to a target site on thrombins - a key receptor in blood clotting. Among the compounds, 42 are active (i.e. binds well to the target site) and the others are inactive. Each compound was described by a single feature vector comprised of a class value (A for active, I for inactive) and 139,351 binary features that describe three-dimensional properties of the molecule [1]. The test set contains 636 additional compounds that were in fact generated based on the assay results recorded for the training set. Therefore the test set has a different class distribution from the training set. In KDD Cup 2001 competition, a total of 114 groups submitted predictions for Task 1, the thrombin binding problem. In evaluating the accuracy, an average cost model was used, since the data set contains much less active classes than inactive ones. In other words, the average of true positive and negative accuracies (i.e. weighted average accuracy) is used for assessing the performance of classifier [1]. The winner of task 1 achieved 71.1% of test accuracy, and 68.4% of weighted average accuracy [1]. And the second place winner achieved 72% of test accuracy and 64.3% of weighted average accuracy.
3 3.1
Our Approaches Feature Subset Selection Using GINI Index
Although there are many sophisticated feature subset selection techniques, we just employed a simple statistical heuristic to reduce the computational cost. The proposed technique is composed of the following steps: 1) Information gains are computed to measure the amount of information that each feature contains. 2) A feature subset which consists of features that have information gain above specific threshold δ is created. The 10-fold cross-validation accuracy is then computed using C4.5 algorithm for the dataset considering selected features only. (details on optimality check is explained in Section 3.3.) The process terminates if the feature subset satisfies stopping criterion; otherwise step 2) is repeated with a decreased value of δ. Generally, the information gain means Shannon’s information gain [15], but this can be formulated with other information measures (e.g. chi-square, GINI index, etc. [15]). Since the computational cost of GINI index is less than that of
Prediction of Molecular Bioactivity for Drug Design
347
Shannon’s entropy, we used GINI index to calculate the information gain. GINI index is defined as c p2i G(S) = 1 − i=1
where, c is the number of class, and pi is the proportion of S belonging to class i. The information gain is defined as IG(S, A) = G(S) −
v∈V alues(A)
|Sv | G(Sv ) |S|
where, V alues(A) is the set of all possible values for feature A, and Sv is the subset of S for which feature A has value v. A feature with larger information gain contains more information that needed to predict classes. Therefore, we can organize features according to their information gain in a decreasing order and form subsets from the top. In the case of biological data with large number of features, most of the features are often meaningless. Thus, it is possible to obtain an optimum feature subset without irrelevant features by selecting features with high information gain. 3.2
Bootstrap
In the case of thrombin data, relatively small active examples are included in the training set. Therefore, a feature with high discrimination capability on active examples can have small information gain. Since misclassification cost of the active example is higher than the inactive one, an appropriate technique —resampling technique in this paper— should be developed. We can define a cost-weighted information gain by re-sampling the active examples of original data. To be precise, we just duplicate the active examples at a certain rate. When we compute the information gain (i.e. in the process of the feature subset selection), we use the re-sampled data of active examples. Even though we manage to decrease the information gain of a feature that is important in predicting the inactive class, the information gain of the features with high discrimination capability on the active class have increased. This means that we can obtain a feature subset which consists of features that can predict the class with high misclassification cost. 3.3
Cross-Validation for Deciding Optimal Feature Subset
In Section 3.1 we mentioned that in order to find out whether a feature subset, which was produced by the value of δ, is an optimal one, we use 10-fold crossvalidation. If we decide a point that has highest cross-validation accuracy as an optimum, then there would be high chance for biological data to be overfitted [13]. This is because the training set occupies only a small portion of the instance space and therefore not sufficient enough to represent the instance space. So the result that is optimized on the training set might have high accuracy of
348
S. Lee, J. Yang, and K.-w. Oh
itself but it can produce bad result on the real test set. This implies that we need an alternative criterion to avoid over-fitting. When the number of features included in the feature subset increases, due to decrease of threshold δ, we can see that the cross-validation accuracy also increases. But since most of the features of biological data are often irrelevant ones, the steep improvement of cross-validation accuracy only takes place until relevant features are included in the subset at the beginning. At later times, the improvement of cross-validation accuracy becomes steady even though features with smaller information gain are included. Therefore, the increase rate of subset’s cross-validation accuracy becomes close to 0 or begins to fluctuate. From these facts, we can say that the point, where nothing changes or begin fluctuating, is the time when irrelevant features are included. For this reason, we use first optimum point of cross-validation to decide optimal feature subset. If the accuracy holds up in a similar level, choosing small features may yield more general prediction without over-fitting.
4
Experiments
To demonstrate the feasibility of proposed approaches, we conducted two experiments. The first one is to find out how well the proposed feature subset selection technique performs and whether the optimality check of the feature subset is suitable for biological data. In the second experiment, we focused on how well we can consider the misclassification cost by using example re-sampling technique. In each experiment, we obtained the accuracy by using C4.5 Algorithm and we had used weighted average accuracy (see Section 2.3) to consider the misclassification cost. The dataset used was thrombin training (introduced in Section 2.3) set and test set respectively. 4.1
Cross-Validation and Test Accuracies (without Re-sampling)
We first, using thrombin training set, computed each information gain of the features and organized a feature subset by selecting features with highest information gain. We set up the threshold to have value between 0.014 and 0.0055 and set the interval to decrease by 0.0005. The initial threshold value and the interval are determined by the characteristic of the data so we can say the value is arbitrary. The number of feature subset generated was 17 and the number of features in each feature subset was between 5 and 2932. Next, we computed the 10-fold cross-validation accuracy of the feature subset and as the cross-validation accuracy changed, we determined the optimal feature subset by using the criterion we presented in Section 3.3. The cross-validation accuracy of the training set is depicted in Fig. 1 (bold line). In the figure, we can see that the accuracy of the feature subset monotonically increases until point 49. But at point 83, the accuracy becomes smaller compare to point 49 and after point 83 we can see that the accuracy fluctuates with similar values. Therefore, if we use the previous criterion, we can make the
Prediction of Molecular Bioactivity for Drug Design
349
Fig. 1. Cross-validation and Test Accuracies (without Re-sampling)
optimal feature subset to be the subset with 49 features. Next, to verify whether the determined feature subset is an optimal one, we measured the test accuracy of the feature subset. The result is also depicted in Fig. 1 (thin line). We can see that the change of test accuracy is similar to the one of crossvalidation accuracy. But feature subset with highest cross-validation accuracy does not produce the highest test accuracy. As we can see, feature subset with 49 features produces the highest test accuracy compared to the one with 31 features. The feature subset shows that it is identical with the previously determined optimal subset and this implies that the criterion we used to determine the optimal subset was appropriate. By using the feature subset, we acquired test accuracy of 75.39% (unweighted) and weighted average accuracy of 67.55%. 4.2
Accuracies of Feature Subset with Re-sampling
In the second experiment, we tried to verify whether we could improve the accuracy when using re-sampling technique. First, to determine a suitable ratio of re-sampling, we generated feature subset by using various value of re-sampling ratio. The value used here is between 2 and 8. (i.e. we duplicated active examples by multiples of 2 through 8). Also, we computed the cross-validation accuracy of the subset generated by the method described. From the result, we discovered that when the re-sampling ratio was 3 or higher, there was a steep decline of cross-validation accuracy. Therefore, we had omitted the result and instead, in Table 1, we have shown the cross-validation accuracy of the subset that was generated by using re-sampling ratio 2 and 3. In the table, we can see that the approach that did not use re-sampling tends to be better when the number of features considered are small. But as the Table 1. Cross-Validation Accuracy according to Re-sampling Ratio Number of Features without Re-sampling Re-sampling P*2 Re-sampling P*3 31 73.68 % 55.95 % 55.95 % 49 74.89 % 55.95 % 55.95 % 83 72.40 % 72.19 % 70.50 % 216 74.92 % 78.20 % 68.50 %
350
S. Lee, J. Yang, and K.-w. Oh
number of features being considered increase (i.e. more than 83 features), the cross-validation accuracy when re-sampling ratio is 2 tends to be higher than others. Generally, the cross-validation accuracy is in the highest when the ratio is 2, so we generated several feature subsets fixing the ratio to 2. We computed the cross-validation accuracy of each feature subset and depicted the result in Fig. 2 (bold line).
Fig. 2. Cross-Validation and Test Accuracies (Re-sampling Ratio 2)
In Fig. 2, we can see that the fluctuation of cross-validation accuracy is larger than the one without using re-sampling. However, as in the case that did not use re-sampling, the improvement of cross-validation accuracy occurs only until a certain level. From the result, we can determine the feature subset with 83 features or 216 features as an optimal feature subset. To verify whether the determined feature subset is optimal, we measured the test accuracy of each subset. The result is also shown in Fig. 2 (thin line). The result shows that we acquired the highest accuracy when the subset has 83 features and the second highest accuracy when the subset has 216 features, as we had expected. It can be confirmed that both two subsets yield good test accuracy, although there were difficulties to decide optimal subset clearly, because of the fluctuation of cross-validation accuracy. When the number of features in the subset is 83 and 216, the test accuracy (unweighted) is 80.28% and 79.02%. Also weighted average accuracy is 72.13% and 71.08%, respectively. There is significant improvements comparing this result with KDD Cup 2001 winner, in both unweighted accuracy and weighted average accuracy.
5
Conclusion
In this paper, we proposed approaches to handle the issues of feature subset selection, cost-sensitive classification, and over-fitting avoidance to solve the problems of classifying large dimensioned, imbalanced, and non-representational data, which are the characteristics of most biological data. These techniques are already in use in different fields, but the significance of this paper is that we have shown how to effectively classify complex biological data only using the simple techniques. Experiments with thrombin data are conducted and produced outstanding results.
Prediction of Molecular Bioactivity for Drug Design
351
The performance of the winner in KDD Cup 2001 (Task 1) is an unweighted accuracy of 71.1% and weighted average accuracy of 68.4%. Through the proposed methods in this paper, we can achieved significant improvement in both of unweighted accuracy of 80.28% and weighted average accuracy of 72.13%. In many problems of classifying large dimensional, biased, and non-representative data, this entropy-based feature subset selection and accuracy estimation using cross-validation techniques can be used for good data-preprocessing as well as accurate classification.
References [1] Christos Hatzis, David Page(2001), KDD-2001 Cup The Genomics Challenge. [2] Cynthia Gibas, Per Jambeck : Developing Bioinformatics Computer Skills, O’Reilly (2001). [3] Siedlecki, W and Sklansky, J. (1988). On automatic feature selection. International Journal of Pattern Recognition, 2:197–220. [4] Langley, P. (1994). Selection of relevant features in machine learning. In Proceedings of the AAAI Fall Symposium on Relevance, pages 1–5, New Orleans, LA. AAAI Press. [5] Dash, M. and Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1(3). [6] Yang, J. and Honavar, V. (1997). Feature Subset Selection Using A Genetic Algorithm. In: Proceedings of the GP-97. Stanford, CA. pp. 380–385. [7] Nucciardi, A. and Gose, E. (1971). A comparison of seven techniques for choosing subsets of pattern recognition. IEEE Transactions on Computers, 20:1023–1031. [8] Roberto Battiti. (1994). Using Mutual Information for Selecting Features in Supervised Neural Net Learning. IEEE Transaction on Neural Networks, Vol. 5, No. 4, July, pages 537–550. [9] Al-Ani, A., Deriche, M.(2002). Feature selection using a mutual information based measure. Pattern Recognition, 2002. Proceedings. 16th International Conference on, Volume: 4 , 2002. Pages 82–85. [10] Siedlecki, W. and Sklansky, J. (1989). A note on genetic algorithms for large-scale feature selection. IEEE Transactions on Computers, 10:335–347. [11] Brill, F., Brown, D., and Martin, W. (1992). Fast Genetic selection of features for neural network classifiers. IEEE Transactions on Neural Networks, 3(2):324–328. [12] Richeldi, M. and Lanzi, P. (1996). Performing effective feature selection by investigating the deep structure of the data. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 379–383. AAAI Press. [13] Ng, A. Y. (1997). Preventing “over-fitting” of cross-validation data. In Proceedings of the 14th International Conference on Machine Learning (ICML), pp.245–253, Nashvilli, TN. [14] Ron Kohavi. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. International Conference on Artificial Intelligence(IJCAI). [15] Richard O. Duda, Peter E. Hart, David G. Stork : Pattern Classification, Wiley Interscience (2001).
Mining RNA Structure Elements from the Structure Data of Protein-RNA Complexes Daeho Lim and Kyungsook Han* School of Computer Science and Engineering, Inha University, Inchon 402-751, Korea [email protected], [email protected]
Abstract. Mining biological data in databases has become the subject of increasing interest over the past several years. But most research of data mining in bioinformatics is limited to the sequence data of molecules. Biological sequences are easy to understand due to their sequential nature and have many well-developed algorithms to handle them since they can be treated as strings of characters. The structure of a molecule, on the other hand, is much more complex but plays an important role since it determines the biological function of the molecule. We have developed a set of algorithms to recognize all the secondary and tertiary structure elements of RNA from the three-dimensional atomic coordinates of protein-RNA complexes. Although there have been computational methods developed for assigning secondary structure elements in proteins, similar methods have not been developed for RNA, due in part to a small number of structure data available for RNA. Therefore, extracting secondary or tertiary structure elements of RNA depends on a significant amount of manual work. This is the first attempt to extracting RNA structure elements from the atomic coordinates in structure databases. The patterns in the structure elements discovered by the algorithms will provide us with useful information for predicting the structure of RNA binding protein.
1 Introduction Most of the currently known structures of molecules were determined by X-ray crystallography or Nuclear Magnetic Resonance (NMR). These methods generate a large amount of structure data, mostly the three-dimensional coordinate values of the atoms, even for a small molecule. The coordinate values at the atomic level are useful information for analyzing molecular structure, but structure elements at higher level are also required for complete understanding of the structure and for predicting structures, in particular. There have been computational approaches to assigning secondary structural elements in proteins from the atomic coordinates of proteins [1, 2]. However, similar methods have not been developed for RNA, due in part to a very small number of structure data available for RNA so far. Therefore, extracting secondary or tertiary structural elements of RNA depends on a significant amount of manual work. As the number of the three-dimensional structures of RNA molecules is
*
To whom correspondence should be addressed.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 352–359, 2003. © Springer-Verlag Berlin Heidelberg 2003
Mining RNA Structure Elements from the Structure Data
353
increasing, we need a more systematic and automated method for extracting structural elements in RNA. We have developed a set of algorithms for recognizing secondary or tertiary structural elements of RNA in the protein-RNA complexes obtained from protein data bank (PDB), which provides a rich source of structural data [3]. The structure data were first cleaned up to make all the atoms accurately named and ordered, and no atoms have alternate locations. The algorithms identify hydrogen bonds and base pairs, and classify the base pairs into one of 28 types [4]. These base pairs include non-canonical pairs such as purine-purine pairs and pyrimidine-pyrimidine pairs as well as canonical pairs such as Watson-Crick pairs and wobble pairs. The algorithms also extract RNA sequences to integrate them with the data of base pairs. Secondary or tertiary structural elements consisting of base pairs are then visualized for user scrutiny. To the best of our knowledge, this is the first attempt to extracting RNA structural elements from the atomic coordinates in structure databases.
2 Background A nucleotide of RNA is made of a molecule of sugar, a molecule of phosphoric acid, and a molecule called a base. A base-pair is formed when a base of a nucleotide pairs with a base of other nucleotide by hydrogen bonds. Base-pairs can be classified into canonical base-pairs (such as Watson-Crick pair) and non-canonical base-pairs. We consider base-pairs of 28 types [4] to include both canonical and non-canonical basepairs in this study. A hydrogen bond is formed by three atoms: one hydrogen atom and two electronegative atoms (often N or O). The hydrogen atom is covalently bound to one of the electronegative atoms, and is called a hydrogen bond donor. The other electronegative atom is known as the hydrogen bond acceptor. Hydrogen bonds can be identified by finding all proximal atom pairs that satisfy the given geometric criteria between the hydrogen bond donors (D) and acceptors (A). The positions of the hydrogen atoms (H) are theoretically inferred from the surrounding atoms, because hydrogen atoms are invisible in purely X-ray-derived structures. The criteria considered to form the hydrogen bonds for this study are: contacts with a maximum D-A distance of 3.9 Å, maximum H-A distance of 2.5 Å, and a minimum D-H-A angle and H-A-AA angle set to 90°, where AA is an acceptor antecedent.
3 Algorithms Fig. 1 shows the framework of our approach. We use HB-plus [5] to get all hydrogen bonds in PDB file. Algorithm 1 extracts the RNA sequence data from a PDB file and records the sequence data into RNA-SEQ. Algorithm 2 extracts only hydrogen bonds between a base of nucleotide and a base from the hydrogen bonds obtained by HBplus. These hydrogen bonds between a pair of bases are recorded into Base-Base List, and classified into one of 28 types. Fig. 2 shows 4 base pairs of them.
354
D. Lim and K. Han GDWDIORZ SURFHVVIORZ
67$57 RNASEQ
PDB Extract RNA Sequence
HB-plus
RNA Structure Elements
Algorithm 1
BaseBase List
HBonds Make Base-Base List
BasePair List Classify Base-Pair List
Algorithm 2
Integrate RNA-SEQ with Base-Pair List Algorithm 3
Fig. 1. Framework for extracting base-pairs of 28 types and secondary and tertiary structure elements of RNA from PDB files.
Fig. 2. G-C and A-U Watson-Crick pairs, G-A purine-purine pair, and C-U pyrimidinepyrimidine pair.
The atoms of a base are numbered, and hydrogen bonds are formed between fixed atoms for each base-pair. For example, A-U Watson-Crick Pair has a hydrogen bond between atom N1of adenine (A) and atom N3 of uracil (U), and between atom N6 of A and atom O4 of U (Fig. 2). These atom numbers are important when determining base-pairs. The hydrogen bond data in the Base-Base List are analyzed by Algorithm 2, and the base-pairs extracted are recorded into the Base-Pair List with their types.
Mining RNA Structure Elements from the Structure Data
355
Algorithm 3 integrates the RNA sequence data obtained by Algorithm 1 with the base-pair data. During this process the algorithm analyzes all nucleotides of RNA in RNA-SEQ and matches a nucleotide in the Base-Pair List to a nucleotide in RNASEQ to find the hydrogen binding relation of each nucleotide. The final output of the three algorithms is the secondary and tertiary structure elements of the RNA in the protein-RNA complex. The algorithms are outlined below.
Algorithm 1 Given a PDB file P for all atoms a in P do if a is nucleotide of RNA then RNA-SEQ <- RNA-SEQ a end if end for Algorithm 2 Given Hydrogen bonds Data in H-Bonds for all h-bond b in H-Bonds do if b is h-bond between RNA nucleotides then b is bb if bb is h-bond between bases then Base-Base-List Base-Base-List bb end if end if end for Let hydrogen bonds between bases in the Base-Base-List bb-bonds for all bb-bond in Base-Base-List do if a base-pair is constructed by the bb-bond then Base-Pair-List Base-Pair-List bb-bond end if end for Algorithm 3 for all nucleotide n in RNA-SEQ do for all bp-bond in the Base-Pair-Lists do if n is assigned bp-bond then paired-nucleotide n bp-bond else n unpaird-nucleotide end if RNA-structure-List RNA-structure-List paired-nucleotide unpaird-nucleotide end for end for
356
D. Lim and K. Han
4 Experimental Results Table 1 shows the base-pairs, extracted by our algorithms from a PDB file (PDB identifier 1RNK) for mouse mammary tumor virus (MMTV). Nucleotides in columns 3 and 4 of the table represent the nucleotides that base-pair with the nucleotides in columns 1 and 2. Column 5 indicates the type of the base pair. If the nucleotide does not pair with other nucleotides, columns 3, 4 and 5 are left blank. The base-pairs in Table 1 form a pseudoknot, as displayed in Fig. 3A. A pseudoknot is a tertiary structure element formed when bases of a single-stranded loop pair with complementary bases outside the loop. The pseudoknot structure found by our algorithm indeed agrees with the actual structure of MMTV. Table 1. Base-pairs of MMTV extracted from a PDB file (PDB identifier 1RNK). nucleotide paired nucleotide number symbol number symbol base-pair type 0001 G 0019 C G-C Watson-Crick 0002 G 0018 C G-C Watson-Crick 0003 C 0017 G G-C Watson-Crick 0004 G 0016 C G-C Watson-Crick 0005 C 0015 G G-C Watson-Crick 0006 A 0007 G 0008 U 0033 A A-U Watson-Crick 0009 G 0032 C G-C Watson-Crick 0010 G 0031 C G-C Watson-Crick 0011 G 0030 C G-C Watson-Crick 0012 C 0029 G G-C Watson-Crick 0013 U 0028 G G-U Wobble 0014 A 0015 G 0005 C G-C Watson-Crick 0016 C 0004 G G-C Watson-Crick 0017 G 0003 C G-C Watson-Crick 0018 C 0002 G G-C Watson-Crick 0019 C 0001 G G-C Watson-Crick 0020 A 0021 C 0022 U 0023 C 0024 A 0025 A 0026 A 0027 A 0028 G 0013 U G-U Wobble 0029 G 0012 C G-C Watson-Crick 0030 C 0011 G G-C Watson-Crick 0031 C 0010 G G-C Watson-Crick 0032 C 0009 G G-C Watson-Crick 0033 A 0008 U A-U Watson-Crick 0034 U
Mining RNA Structure Elements from the Structure Data
357
Fig. 3. (A) Pseudoknot structure of MMTV, formed by the base pairs in Table 1. (B) Secondary structure of 2 RNA chains (chains M and N), extracted from a protein-RNA complex (PDB identifier1DFU). The structures were visualized by PseudoViewer [6].
Our algorithm can extract structure elements formed between different RNA sequences. Fig. 3B shows 2 RNA chains (chains M and N), extracted from a proteinRNA complex (PDB identifier1DFU). In addition to structure elements formed by base-pairs, it can also identify base-triples. A base-triple is one of the tertiary interactions in RNA structure, in which a base-pair interacts with a third base [7]. Fig. 4 shows a secondary structure of tRNA (PDB identifier 1FFY) with 5 tertiary interactions, obtained by our algorithms. Each of the interactions 1 and 2 in Fig. 4 corresponds to a base-triple (see Fig. 5 for the structure of the base-triples at the atomic level). A base-triple has secondary and tertiary interactions at the same time. For example, in the C-G-G base-triple of Figs. 4 and 5, G-C pair is a secondary interaction as Watson-Crick pair and G-G pair is a tertiary interaction as purinepurine pair. The tertiary interactions shown in Fig. 4 are in agreement with those found by experimental methods [8, 9]. It is not straightforward to extract tertiary interactions such as base-triples and pseudoknots from a large amount of 3D coordinate values of the atoms of a PDB file, or from 3D display of the molecular structure. Our algorithm automatically extracts tertiary interactions of RNA structure from a PDB file. The structure elements extracted by our algorithm also suggest that tertiary interactions such as pseudoknots and base-triples are formed after secondary structure elements in RNA structure.
5 Conclusions So far extracting secondary or tertiary structure elements of RNA from the threedimensional atomic coordinates as relied on a significant amount of manual work, due
358
D. Lim and K. Han ,QWHUDWLRQ
,QWHUDWLRQ
,QWHUDWLRQ
,QWHUDWLRQ
,QWHUDWLRQ
Fig. 4. Secondary structure of tRNA with 5 tertiary interactions (PDB identifier 1FFY).
Fig. 5. Structures of U-A-A base-triple and C-G-G base-triple at the atomic level, corresponding to interactions 1 and 2 of Fig. 4, respectively.
in part to a very small number of structure data available for RNA. However, manual analysis of the structure data is becoming increasingly challenging as the complexity and number of structures increase. This study developed a set of algorithms for recognizing secondary or tertiary structure elements of RNA in the protein-RNA complexes obtained from PDB. Our algorithms are data mining algorithms that extract information about secondary or tertiary structure from PDB file. Therefore our algorithm has several complex steps to extract necessary information. To the best of our knowledge, this is the first attempt to extracting RNA structure elements from the
Mining RNA Structure Elements from the Structure Data
359
atomic coordinates in structure databases. Experimental results showed that our algorithms are capable of extracting base-triple structure and all secondary or tertiary structure elements formed by hydrogen bonding automatically and easily. We expect our algorithms to help research about RNA structure and the patterns in the structure elements discovered by the algorithms will provide us with useful information for predicting the structure of RNA binding protein.
Acknowledgements. This work was supported by the Ministry of Information and Communication of Korea under grant 01-PJ11-PG9-01BT00B-0012.
References 1. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22 (1983) 2577–2637 2. Frishman, D, Argos, P.: Knowledge-based protein secondary structure assignment. Proteins 23 (1995) 566–579 3. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic Acids Res. 28 (2000) 235–242 4. Tinoco, Jr.: The RNA World (R. F. Gesteland, J. F. Atkins, Eds.), Cold Spring Harbor Laboratory Press, (1993) 603–607 5. McDonald, I.K. Thornton, J.M.: Satisfying Hydrogen Bonding Potential in Proteins, J. Mol.Biol. 238 (1994) 777–793 6. Han, K., Byun, Y.: PseudoViewer2: visualization of RNA pseudoknots of any type. Nucleic Acids Res. 31 (2003) 3432–3440 7. Akmaev, V.R., Kelley, S.T., Stormo, G.D.: Phylogenetically enhanced statistical tools for RNA structure prediction. Bioinformatics 16 (2000) 501–512 8. Du, X., Wang, E.-D.: Tertiary structure base pairs between ' DQG 7 &ORRSV RI Leu Escherichia coli tRNA play important roles in both aminoacylation and editing. Nucleic Acids Res. 31 (2003) 2865–2872 9. DNA-RNA structure Tutorials http://www.tulane.edu/~biochem/nolan/lectures/rna/intro.htm
Discovery of Cellular Automata Rules Using Cases Ken-ichi Maeda and Chiaki Sakama Department of Computer and Communication Sciences Wakayama University [email protected]
Abstract. Cellular automata (CAs) are used for modeling the problem of adaptation in natural and artificial systems, but it is hard to design CAs having desired behavior. To support the task of designing CAs, this paper proposes a method for automatic discovery of cellular automata rules (CA-rules). Given a sequence of CA configurations, we first collect cellular changes of states as cases. The collected cases are then classified using a decision tree, which is used for constructing CA-rules. Conditions for classifying cases in a decision tree are computed using genetic programming. We perform experiments using several types of CAs and verify that the proposed method successfully finds correct CA-rules.
1
Introduction
Cellular automata (CAs) [4] are discrete dynamical systems whose behavior is specified by the interaction of local cells. Because of their simple mathematical constructs and distinguished features, CAs have been used for modeling advanced computation such as massively parallel computers and evolutionary computation, and also used for simulating various complex systems in the real world. On the other hand, complex behavior of CAs is difficult to understand, which makes hard to design CAs having desired behavior. The task of designing CAs usually requires domain knowledge of a target problem and it is done by human experts manually and experientially. This task becomes harder as a target problem becomes complex, since there are a number of possible automata to specify behavior. The difficulty also comes from the feature of CAs such that a small change of local interaction would affect the global behavior of a CA, and the result of emergence depends on the initial configuration of cells. To automate CA designing, we develop techniques for automatic discovery of CA-rules which reflect cellular changes in observed CA configurations. Reconstruction of CA-rules from input configurations is known as the identification problem [1]. However, we aim at not only reconstructing the original CA-rules but also discovering new CA-rules. Automatic discovery of CA-rules is also studied in the context of density classification task [3]. The objective of this task is to find a 1-dimensional 2-state CA that can classify the density of 1’s in the initial
Current address: Hitachi System& Services, Ltd. [email protected]
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 360–368, 2003. c Springer-Verlag Berlin Heidelberg 2003
Discovery of Cellular Automata Rules Using Cases
361
configuration, which is different from our goal. Technically, our goal is achieved by the following steps. Given a sequence of CA configurations we first collect cellular changes of states as cases. The collected cases are then classified using a decision tree which is used for constructing CA-rules. Conditions for classifying cases in a decision tree are computed using genetic programming. We perform experiments using several types of CAs and verify that the proposed method not only reconstructs the original CA-rules, but also discovers new CA-rules. The rest of this paper is organized as follows. Section 2 presents a brief introduction of cellular automata. Section 3 provides techniques for automatic discovery of CA-rules. Section 4 shows experimental results to verify the proposed method, and Section 5 summarizes the paper.
2
Cellular Automata
A cellular automaton (CA) consists of a regular grid of cells, each of which has a finite number of possible states. The state of each cell changes synchronously in discrete time steps according to local and identical transition rules (called CA-rules). The state of a cell in the next time step is determined by its current state and the states of its surrounding cells (called a neighborhood of a cell). The collection of all cellular states in the grid at some time step is called a configuration. A CA has an n-dimensional cellular space. Consider a sequence of configurations S0 , . . . , Sn in which each configuration has a finite number of cells. Here, S0 is the initial configuration of a CA and St (0 ≤ t ≤ n) represents the configuration of the CA at a time step t. A t , where x and y represent the configuration St consists of states of a cell Cx,y coordinates of the cell in the configuration (Figure 1).
Fig. 1. Configurations of a CA
Given a sequence of configurations as an input, we want to output CA-rules which, applied to the initial configuration, reproduce the change of patterns of input configurations. We pursue the goal in the following steps. 1. Determine an appropriate neighborhood of a cell. 2. Collect cellular changes of states as cases from input configurations.
362
K.-i. Maeda and C. Sakama
3. Construct a decision tree to classify cases and extract CA-rules. In the next section, we explain techniques using 2-dimensional 2-state CAs, but the techniques are applied to m-dimensional n-state CAs in general.
3
Discovering CA-Rules
3.1
Collecting Cases
We first describe a method of collecting cases. A case is defined as a pair t t+1 t t , Cx,y ) where Rx,y is a neighborhood of a cell Cx,y at some time step t (Rx,y t+1 and Cx,y is the state of the cell at the next time step t + 1.1 Cases are collected using the following procedure. Procedure: Collecting cases. Input : a sequence of configurations S0 , . . . , Sn . Output : a set CASE of cases. Initially put CASE = ∅, and do the following. t and extract its neighborhood 1. From the configuration St , choose a cell Cx,y t t t Rx,y , where Rx,y contains Cx,y as a central cell. t+1 . 2. From the configuration St+1 , extract the cell Cx,y t t+1 3. If the pair (Rx,y , Cx,y ) is not in CASE, add it to CASE. 4. Iterate the above 1–3 steps for all the coordinates (x, y) of each configuration S0 , . . . , Sn−1 .
Figure 2 illustrates an example of a 2-dimensional 2-state CA in which cases are collected using a neighborhood of the 3 × 3 square.
Fig. 2. Collecting Cases
In the above, a neighborhood is selected such that (i) it uniquely determines the next state of a target cell, and (ii) it does not contain cells which 1
The scripts t and (x, y) are often omitted when they are not important in the context.
Discovery of Cellular Automata Rules Using Cases
363
are irrelevant to the cellular changes of states. The condition (i) is examined by t t+1 t t+1 , Cx,y ) and (Rz,w , Cz,w ) checking whether there exist two different cases (Rx,y t t t+1 t+1 in CASE such that Rx,y = Rz,w and Cx,y = Cz,w . If there is one, the neighborhood is changed by increasing its size. To remove redundant cells in (ii), we use hill-climbing search with a heuristic function which gives a higher score to a neighborhood that can distinguish every case with fewer cells. 3.2
Decision Trees
To construct CA-rules, we classify cases using a decision tree. A decision tree used here has properties such that (i) each layer except the lowest one has a classification condition Coni (R) where R is a neighborhood; and (ii) each node Ni,j except the root node has a condition value Vi,j and the next state N Ci,j of a cell (Figure 3).
Fig. 3. A decision tree
Classification conditions classify cases according to the state of a neighborhood. For example, given a neighborhood R of the 3 × 3 square, the classification condition Con(R) = Σci ∈R ci represents the sum of the states of cells in R. Classification conditions are computed using genetic programming (GP). In GP classification conditions are expressed by a tree structure (called a condition tree) in which Con0 (R), Con1 (R), Con2 (R), . . . are cascaded (cf. Figure 4(b)). GP applies genetic operations to a condition tree to find classification conditions which correctly classify all cases. A condition value Vi,j is calculated by Coni−1 (R) and N Ci,j represents the state of a cell at the next time step. t of a cell in a configuration St , a decision tree Given a neighborhood Rx,y returns the next state of the cell in the configuration St+1 as an output. A decision tree is built using cases as follows. Procedure: Building a decision tree. Input : a set CASE of cases where (Rk , Ck ) (k ≥ 1) is an element from CASE, and classification conditions Con0 (R), . . . , Conl−1 (R) (l > 0). Output : a decision tree with the depth l. Set the initial tree as the root node N0,0 . For every case (Rk , Ck ) from CASE, do the following.
364
K.-i. Maeda and C. Sakama
1. At the root node N0,0 , if it has a child node N1,j (j ≥ 0) with V1,j = Con0 (Rk ), apply the steps 2–3 to N1,j . Otherwise, add the new node N1,j with V1,j = Con0 (Rk ) and N C1,j = Ck to the tree. 2. For each node Ni,j (1 ≤ i < l), if there is a node Ni+1,h with Vi+1,h = Coni (Rk ), apply the steps 2–3 to Ni+1,h . Else if there is no node Ni+1,h with Vi+1,h = Coni (Rk ) and it holds that N Ci,j = Ck , add a new node Ni+1,h with Vi+1,h = Coni (Rk ) and N Ci+1,h = Ck . Otherwise, the case (Rk , Ck ) is successfully classified. 3. If there is a node Nl,j with N Cl,j = Ck , the case (Rk , Ck ) is successfully classified. Otherwise, the construction of a decision tree fails. The procedure searches a node Ni+1,j which has the value Vi+1,j equal to Coni (Rk ). When the node Ni+1,j has the next state N Ci+1,j which is not equal to Ck , the tree is expanded by adding a new node.2 The depth l of a decision tree is determined by the number of classification conditions. A decision tree is expanded by introducing additional classification conditions until every case is classified. At the bottom level of a tree, no such expansion is performed. So, if there is a node Nl,j such that the next state N Cl,j does not coincide with Ck , the construction of a decision tree fails. This means that classification conditions used for the construction of a decision tree are inappropriate, then it is requested to re-compute new classification conditions. Those classification conditions which fail to classify all cases are given penalties in the process of the evaluation of GP, and they are not inherited to the next generation. This enables us to find more appropriate classification conditions. Once a decision tree is successfully built, it can classify all cases from CASE. t When a neighborhood Rx,y is given as an input to a decision tree, the tree outputs the next state N Ci,j of a node Ni,j satisfying the following conditions: (1) every antecedent node Nk,l (0 < k < i) of Ni,j satisfies the condition Vk,l = t t ) and Ni,j satisfies the condition Vi,j = Coni−1 (Rx,y ), and (2) Conk−1 (Rx,y t Ni,j has no child node Ni+1,h satisfying Vi+1,h = Coni (Rx,y ). For example, in t , the tree outputs the next the decision tree of Figure 3, given the input Rx,y t state N C2,2 of N2,2 if the next conditions are met: (1) Con0 (Rx,y ) = V1,1 and t t t Con1 (Rx,y ) = V2,2 , and (2) Con2 (Rx,y ) = V3,1 and Con2 (Rx,y ) = V3,2 . The first condition ensures that the condition values of the nodes N1,1 and N2,2 satisfy the t t ) and Con1 (Rx,y ), respectively. The second classification conditions Con0 (Rx,y condition ensures that the condition values of the nodes N3,1 and N3,2 do not t ). Using these conditions a decision tree effectively searches satisfy Con2 (Rx,y a node which has the condition value satisfying classification conditions with t respect to the input Rx,y , and outputs the next state of a cell. The node N2,2 represents the following if-then rule: t t t ) = V1,1 and Con1 (Rx,y ) = V2,2 and Con2 (Rx,y ) = V3,1 if Con0 (Rx,y t and Con2 (Rx,y ) = V3,2 then N C2,2 . 2
A similar way of expanding a decision tree is in [2].
Discovery of Cellular Automata Rules Using Cases
365
Every node except the root node represents such an if-then rule. Given an t a decision tree is built so as to satisfy the condition of only one such input Rx,y rule, so that the output N Ci,j is uniquely determined by the input.
4
Experiments
To verify the effect of the proposed techniques, we present the results of two experiments such that: (a) given 2-dimensional 2-state CA configurations produced by a 2-dimensional 2-state CA, find 2-dimensional 2-state CA-rules which reproduce the same configurations; and (b) given 2-dimensional 2-state CA configurations produced by a 1-dimensional 2-state CA, find 2-dimensional 2-state CA-rules which reproduce the same configurations. The purpose of these experiments is as follows. In the experiment (a), we verify that our procedure can find the original CA-rules which produce observed 2-dimensional 2-state configurations. In the experiment (b), on the other hand, we show that our procedure can discover new CA-rules which produce observed configurations but have different dimensions from the original one.
Fig. 4. Experimental result
4.1
Finding the Original 2-Dimensional 2-State CA-Rules
In this experiment, we use the following 2-dimensional 2-state CA. – A neighborhood consists of 9 square cells: a central cell and 8 orthogonally and diagonally adjacent cells. The state of a cell is either 0 or 1. – A configuration consists of 100 × 100 cells. The initial configuration S0 is randomly created. – A sequence of configurations S0 , . . . , S20 are produced by the CA-rules such that: (1) if the central cell has exactly 2 surrounding cells of the state 1, the next state of the cell does not change; (2) else if the central cell has exactly 3 surrounding cells of the state 1, the next state of the cell is 1; (3) otherwise, the next state of the central cell is to 0.3 3
This is known as the Game of Life.
366
K.-i. Maeda and C. Sakama
From the input configurations S0 , . . . , S20 , the neighborhood, the condition tree, and the decision tree were constructed as shown in Figure 4. The condition tree represents the classification conditions such that Con0 (R) = c4 and Con1 (R) = c0 + c1 + c2 + c3 + c5 + c6 + c7 + c8 , where each ci corresponds to a cell in the neighborhood (a), and takes the value of either 0 or 1. The decision tree (c) was built from this classification condition. In each node, a value in the left-hand side expresses the condition value Vi,j and a value in the right-hand side expresses the next state N Ci,j . Nodes of this decision tree represent the following if-then rules: N1,0 : if Con0 (R) = 0 and Con1 (R) = 3 then N C1,0 = 0, N2,0 : if Con0 (R) = 0 and Con1 (R) = 3 then N C2,0 = 1, N1,1 : if Con0 (R) = 1 and Con1 (R) = 3 and Con1 (R) = 2 then N C1,1 = 0, N2,1 : if Con0 (R) = 1 and Con1 (R) = 3 then N C2,1 = 1, N2,2 : if Con0 (R) = 1 and Con1 (R) = 2 then N C2,2 = 1. Comparing these 5 rules with the original CA-rules, N2,0 and N2,1 correspond to the rule (2); N2,2 corresponds to the rule (1) in which the state of the central cell is 1; N1,0 and N1,1 correspond to the rule (3) and the rule (1) in which the state of the central cell is 0. Thus, it is verified that the CA-rules constructed by the decision tree coincide with the original CA-rules which produce the input configurations. The result of this experiment shows that in 2-dimensional 2-state CAs the original CA-rules are reproduced by a sequence of input configurations. We also conducted a similar experiment for 2-dimensional 3-state CAs, and verified that the proposed method successfully finds the original CA-rules.
Fig. 5. Evolution of a 1-dimensional CA
4.2
Discovering New 2-Dimensional 2-State CA-Rules
In this experiment, we use the following 1-dimensional 2-state CA. – A neighborhood consists of 3 square cells: a central cell and its adjacent neighbors on each side. The state of a cell is either 0 or 1. – A configuration consists of 100 arrayed cells. The initial configuration S0 has a centered cell with the state 1 and all the other cells have the state 0. – A sequence of configurations S0 , . . . , S20 are produced by the CA-rules such that: (1) if a neighborhood contains exactly one cell of the state 1, the next state of the central cell is 1; (2) otherwise the next state of the cell is 0.
Discovery of Cellular Automata Rules Using Cases
367
Such a 1-dimensional CA produces 2-dimensional patterns. Figure 5 illustrates an example of an evolving 1-dimensional CA with 7 cells with 3 times applications of CA-rules. Thus, a 1-dimensional configuration Si (0 ≤ i ≤ 20) is identified with the corresponding 2-dimensional configuration Si which is obtained by vertically arranging Sj (j ≤ i) downward in the 100 × 21 grid. We used such 2-dimensional configurations S0 , . . . , S20 as an input. As a result, we obtained the neighborhood, the condition tree, and the decision tree of Figure 6. The meaning of the figure is the same as that of Figure 4.
Fig. 6. Experimental result
It is worth noting that the obtained neighborhood (a) is 2-dimensional. The condition tree (b) means that: Con0 (R) = c3 and Con1 (R) = c0 + c1 + c2 . The decision tree (c) represents the following if-then rules: N1,0 : if Con0 (R) = 0 and Con1 (R) = 1 then N C1,0 = 0, N2,0 : if Con0 (R) = 0 and Con1 (R) = 1 then N C2,0 = 1, N1,1 : if Con0 (R) = 1 then N C1,1 = 1. Viewing c3 as the central cell in the neighborhood, these rules are interpreted as follows: (N1,0 ) When the state of a central cell is 0 and the number of cells of the state 1 in the neighborhood is not 1, the state remains 0; (N2,0 ) When the state of a central cell is 0 and the number of cells of the state 1 in the neighborhood is exactly 1, the state changes to 1; (N1,1 ) When the state of a central cell is 1, the state remains 1. Applying these rules to the initial con figuration S0 produces configurations S1 , . . . , S20 , which coincide with the input configurations. This result shows that the procedure discovers new 2-dimensional CA-rules which reproduce the same pattern produced by a 1-dimensional CA. By this experiment, it is observed that the proposed method not only reproduces the original CA-rules, but can also discover new rules in different dimensions.
5
Conclusion
In this paper, we developed techniques for automatic generation of cellular automata rules. We first extracted cases from input CA configurations then built
368
K.-i. Maeda and C. Sakama
a condition tree and a decision tree. A decision tree correctly classifies every case and expresses CA-rules which reproduce the input CA configurations. We showed by experiments that the proposed method can successfully find the original CA-rules which generate given 2-dimensional, 2/3-state CA configurations. It is also shown that 2-dimensional 2-state new CA-rules are discovered from the input 1-dimensional 2-state CAs. In this paper, we performed experiments to reconstruct CA-rules from input CA configurations which are produced by existing CA-rules. On the other hand, in real-life problems input configurations generally include noise, the goal is then to discover unknown CA-rules if any. In this case, the relevant neighborhood of a cell would probably fail to converge for moderate amount of noise and it would be necessary to introduce an appropriate threshold on the number of conflicting cases. To apply the proposed method in practice, we will work on further refinement of techniques in future research.
References 1. A. Adamatzky. Identification of Cellular Automata. Taylor& Francis, London, 1994. 2. B. Liu, M. Hu, and W. Hsu. Intuitive representation of decision trees using general rules and exceptions. In: Proc. AAAI-2000, pp. 615–620, MIT Press, 2000. 3. M. Mitchell, P. T. Hraber, and J. P. Crutchfield. Revisiting the edge of chaos: evolving cellular automata to perform computations. Complex Systems 7, pp. 89– 130, 1993. 4. T. Toffoli and N. Margolous. Cellular Automata Machines. MIT Press, 1987.
Discovery of Web Communities from Positive and Negative Examples Tsuyoshi Murata1,2 1
National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430 JAPAN [email protected] http://research.nii.ac.jp/~tmurata 2 Japan Science and Technology Corporation Yoyogi Community Building, 1-11-2 Yoyogi, Shibuya-ku, Tokyo, 151-0053 JAPAN
Abstract. Several attempts have been made for Web structure mining whose goals are to discover Web communities or to rank important pages based on the graph structure of hyperlinks. Discovery of Web communities, groups of related Web pages sharing common interests, is important for assisting users’ information retrieval from the Web. There are several different granularities of overlapping Web communities, and this makes the identification of objective boundaries of Web communities difficult. This paper proposes a method for discovering Web communities from given positive and negative examples. Since the boundary of a Web community is hard to define only from positive examples, negative examples are used for limiting its boundary from outer side of the Web community. Experimental results are shown and the effectiveness of our new method is discussed.
1 Introduction As the research for Web structure mining that focus on graph structure of hyperlinks, discovery of Web communities (related Web pages sharing common interests) is very important. Discovered Web communities are useful for Web page recommendation, and they will give insight to the phenomena of real human communities since the Web often reflects the trends of real world. Several attempts have been made for Web community discovery. In general, personal Web pages often contain hyperlinks to Web pages of several genres. Web communities are not rigidly separated and are often overlapped with each other. Therefore, it is not easy to find objective boundaries of Web communities only from graph structure of hyperlinks. Boundaries of Web communities depend heavily on users’ viewpoint of recognizing relatedness among Web pages. For example, discovery of Web communities about relatively broader topics, such as cars, is successful in our previous methods [8][9][10]. However, discovery of Web communities of cars that are manufactured by a certain company is not easy since hyperlink-based methods in general often suffer topic drift. The degree of relatedness among Web pages deG. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 369–376, 2003. © Springer-Verlag Berlin Heidelberg 2003
370
T. Murata
pends heavily on users’ information needs, and it is not an easy task to specify strict definitions of target Web communities in advance for discovery. Although there are several approaches for Web community discovery, most of them just discover related Web pages; discussions about the granularities or boundaries of Web communities are not enough. This paper describes a method for discovering Web communities from both positive and negative example Web pages. Positive example pages exemplify target Web community, and specify its boundary from inside. Negative examples counter-exemplify the Web community and specify its boundary from outside. Explicit specifications of negative examples brings the following advantages: 1) interactive discovery of Web communities is possible and 2) since the output of discovered Web communities depend on given positive and negative examples, resultant Web communities can be improved by suitable strategies of giving examples.
2
Related Work
2.1 Discovery of Web Communities There are two major approaches for discovering Web communities: 1) searching a fixed-size graph structure from Web snapshot data and 2) disconnecting given Web graph into some densely connected subgraphs. Kumar’s trawling [7] is a famous example of the first approach. In this approach, bipartite graph is regarded as an indication of Web community sharing common interest. For example, Web pages of aircraft enthusiasts often have hyperlinks to the companies of aircraft manufacturers. Hyperlinks of these pages (enthusiasts and companies) compose a bipartite graph and all of these pages are closely related. Kumar performs an experiment for searching bipartite graph structures and verifies randomly selected samples by manual inspection. Its results show that most of the pages composing a bipartite graph are closely related. As an example of the other approach for discovering Web community, Flake [3] applies maximum-flow minimum-cut theorem of network flow theory in order to discover densely-connected subgraphs of hyperlinks, which can be regarded as Web communities His approach is often explained by the following metaphor: if edges are water pipes and vertices are pipe junctions, the maximum flow problem tells us how much water we can move from one junction to another, and the maximum flow is proved to be identical of minimum cut. Therefore, if you know the maximum flow between two points, you also know what edges you would have to remove to completely disconnect the same two points, which are called cut set. The approach accepts some Web pages as seeds of target Web community, and finds cut set that disconnect a subgraph containing seed pages from other (greater part of) Web graph. Experimental results of his approach show that some important Web communities are discovered from a few given seed pages. Girvan [4] defines edge betweenness by the number of shortest paths between pairs of vertices that run along it, and discovers densely connected subgraphs by removing edges of high edge betweenness.
Discovery of Web Communities from Positive and Negative Examples
371
Both of these approaches require relatively large-scale Web snapshot data. Moreover, discussion about the boundary of Web communities cannot be found in previous research. Although there are some new attempts for discovering relations among several topics [1], more analysis is needed in order to clarify detailed relations among Web communities. 2.2 Data Acquisition from a Search Engine Since the Web is huge and is growing, collecting data for Web mining is not an easy task. It is pointed out that the difference between data used for Web mining and the data of actual Web may cause outdated discovery of Web communities [7]. Major search engines contain abundant updated Web data and they can be used for Web data acquisition. Some of the search engines allow users to access contained data, such as Google API [5]. Although most users use search engines in order to find Web pages about some keywords, search engines enable us to follow hyperlinks backward. By attaching some option (such as “link:”) to input URL, Web pages that contain hyperlinks to the URL can be searched, which are called backlinks. Since hyperlinks to related Web pages often co-occur, backlink search enables us to find related Web pages. Although there are some attempts for data acquisition from a search engine [6], they are regard as pre-processing or expedient in the process of Web mining. This paper shows that data acquisition from a search engine brings advantages to Web community discovery. 2.3 Boundaries of Web Communities As claimed by Kumar, pages whose hyperlinks compose a complete bipartite are surely related. However, discovery of Web communities by searching fixed-size graph structure has limitations since the size and the structure of target Web communities in general is not known in advance. There might be other graph structures that correspond to meaningful collections of Web pages. Therefore, search of fixed-size graph structure is not always suitable for discovering real Web communities. The other approach, discovery of dense subgraphs from given Web graph, also has many problems. In general, several granularities of Web communities are overlapping since many personal link pages contain hyperlinks to several different topics he or she is interested in. In addition to that, boundaries of Web communities depend heavily on users viewpoints for relatedness and on their intention of using Web communities. It is not practical to discover objective boundaries of Web communities only from hyperlink information since several granularities of Web communities exist. Such problems are not discussed by previous research of Web community discovery. The author has been working on the discovery of Web communities [8] [9][10]. These are the attempts for visualizing, discovering, and purifying Web communities using a search engine as a data source. As the first step for discovering different granularities of Web communities, this paper proposes a method for discovering Web communities from positive and negative examples of Web pages.
372
T. Murata
Previous approaches for discovering related Web pages, such as Dean’s co-citation algorithm [2], use some (positive) examples only; there is no attempt for using negative examples. By using negative examples explicitly, the following advantages are expected: 1) Easy Exemplification of target Web community When a user want to discover a Web community of specific topic, it is not easy to describe precise definitions of the community. Exemplifying some Web pages is much easier for users, and they can easily specify the extent of target Web communities by providing negative examples. This ability enables interactive discovery of Web communities. 2) Clarifying Relatedness among Web communities As mentioned above, hyperlinks to pages of different topics often co-occur in many personal Web pages. The degree of such co-occurrence of hyperlinks can be regarded as the distance among the topics. If pages of two different topics are co-referred by many Web pages, there are many people who take interests in the two topics. Web community discovery from positive and negative examples clarifies such distances between topics of positive and negative pages.
3 Discovery of Web Communities from Positive and Negative Examples In order to discover boundary that divide inside and outside of a Web community, positive group and negative group of Web pages are given as inputs. Related pages of either group are repeatedly added until both groups share common Web page. Figure 1 shows the outline of our discovery method. Both positive examples (inside the oval) and negative examples (outside the oval) are increased one by one until both cannot be increased anymore.
Fig. 1. Discovery of Web community from positive and negative examples
3.1 Addition of Related Web Page The method for finding related Web pages in this paper is the same as the one the author proposed previously [9]. The method is suitable for finding related pages from local Web data acquired from a search engine. This method is based on the assumption
Discovery of Web Communities from Positive and Negative Examples
373
that hyperlinks to related Web pages often co-occur. As shown in Figure 2, the method is the repetition of the following steps: 1) When a group of positive or negative URLs are given as inputs (centers), Web pages that contain hyperlinks to all of the given URLs are searched by using a search engine (backlink search). Searched URLs are called fans in the following explanations. 2) All of the hyperlinks contained in the acquired fans are extracted and one URL that is most frequently pointed by the extracted hyperlinks is added to centers. Since hyperlinks to the URL and given centers co-occur many times, it is assumed that the URL is closely related to given centers. 3) The above two steps are repeatedly applied in order to find more related pages.
Fig. 2. Addition of related Web pages using backlink search
3.2 Terminating Addition of Related Web Pages In the new method proposed in this paper, the procedures for discovering related pages are alternately applied for both the positive and negative examples in order to find more positive and negative examples. When these two groups overlap or when no more related positive and negative pages are found (in case no backlink is searched), the process of repetition is terminated.
4 Experiments Based on the above method, a system for discovering Web communities is implemented using Java. The input to the system is some positive pages that are included in target Web community, and some negative pages that are not included in the target community. The number of given positive and negative pages affects the discovery of Web communities. Let us suppose that only few pages regarding baseball are given as input positive pages. As the result of backlink search on a search engine, many backlinks will be acquired. However, many of they may not be the hubs about baseball since
374
T. Murata
they accidentally contain hyperlinks to only few baseball pages, and such low-quality backlinks often cause distortion of topics of Web communities. Fewer input pages will improve quantity of backlinks at the expense of quality, and more input pages will improve quality of backlinks at the expense of quantity. In the experiments shown below, one, two or three pages are given as inputs. The number is determined by our trials and errors. The combination of the contents of positive and negative pages is also important for discovery. If given positive and negative pages are completely unrelated, detecting boundary between them is not easy and repetitive addition of related pages continues for a long time. In order to achieve quick termination, relatively closer topics are used for positive and negative examples. 4.1 Discovery Using Pages of Newspaper Publishing Companies and TV Stations As an example of discovering a Web community, the following positive and negative URLs are given to our system: z Positive examples: www.asahi.com, www.nikkei.co.jp, www.mainichi.co.jp z Negative examples: www.ntv.co.jp, www.tbs.co.jp, www.fujitv.co.jp Since newspaper publishing companies and TV stations are closely related in Japan, it is worth trying whether our system has abilities for finding boundaries between these two groups. The results of the discovery is as follows: z Positive: www.asahi.com, www.nikkei.co.jp, www.mainichi.co.jp, www.yomiuri.co.jp, www.sankei.co.jp, www.yahoo.co.jp z Negative: www.ntv.co.jp, www.tbs.co.jp, www.fujitv.co.jp, www.nhk.or.jp, www.tv-tokyo.co.jp, www.yahoo.co.jp As shown above, three pages are added to both positive and negative examples. www.yahoo.co.jp is added to both and this causes the termination of discovery process. Other two pages added to positive examples (www.yomiuri.co.jp and www.sankei.co.jp) are surely newspaper publishing companies, and two pages added to negative examples (www.nhk.or.jp and www.tv-tokyo.co.jp) are TV stations. This result shows that our system successfully discovers Web communities of newspaper publishing companies and TV stations. At a first glance, it seems strange that www.yahoo.co.jp is located at the boundary of both Web communities. This phenomenon is often observed in our experiments. Since many Web pages contain hyperlinks to portal sites such as yahoo, they are regarded as similar to several other topics based on hyperlink co-occurrence. 4.2 Discovery Using Pages of Japanese and American Newspaper Publishing Companies There are many cases that few hyperlinks co-occur although their contents are related. In order to show such example, the following positive and negative URLs are given to the system:
Discovery of Web Communities from Positive and Negative Examples
375
z z
Positive examples: www.asahi.com, www.nikkei.co.jp, www.mainichi.co.jp Negative examples: www.nytimes.com, www.washingtonpost.com, www.usatoday.com Positive pages are Japanese newspaper publishing companies, and negative pages are those in the United States. The results of the discovery is as follows: z Positive: www.yahoo.co.jp, www.lycos.co.jp, www.asahi.com, www.mainichi.co.jp, www.nikkansports.com, www.nikkei.co.jp, www.sanspo.com, www.sponichi.co.jp, www.excite.co.jp, 167.216.215.11/CHART/charttop.html, a.hatena.ne.jp, adios.edhs.ynu.ac.jp/natsu3, ame.x0.xom, angellink.milkybox.jp, auctions.yahoo.co.jp, wvexnettv.avexnet.or.jp, z Negative: abcnews.go.com, www.cnn.com, www.cnnfn.com, www.csmonitor.com, www.drudgereport.com, www.latimes.com, www.nytimes.com, www.salon.com, 1banana.com, 7metasearch.com, Bartleby.com/61, abcnews.go.com/Sections/GMA, abcnews.go.com/Sections/Nightline, abcradio.com, ajr.newslink.org/busin.html, Although positive examples are the same as those of previous experiment, the results are quite different since negative examples are different. Both positive and negative examples in this experiment are surely newspaper publishing companies. However, this result shows that pages of different languages are (almost) unrelated based on hyperlink graph structure, and steps for adding pages are repeated many times. It should be mentioned that some positive pages that are included in the previous experiment (such as www.yomiuri.co.jp and www.sankei.co.jp) are not included in this result. Since positive and negative pages are added independently, these pages should be included in this result also. This is because our system depends on Web data from several Web servers through the internet, and it gives up the acquisition from some Web servers whose responses are delayed because of network conditions. This sometimes causes unstableness of experimental results. Such unstableness is regarded as the characteristics of systems that depend on data acquired from the internet.
5 Discussion As shown in the experimental results, our system succeeds in discovering some pages related to positive and negative examples, and the relatedness between positive examples and negative examples (based on hyperlink graph structure) is partly clarified by the number of related pages added in the repetitive process. Some of the results show that relatedness based on hyperlink graph structure is different from content-based relatedness. Some of the topics are closely related by the result of our system although they are located separately in a conceptual hierarchy, such as Yahoo’s categories. This is not the weakness of our system. Discrepancies of relatedness between our system and Yahoo’s category mean that there are many people that have interests to (categorically different) topics. Such information is important for suitable Web recommendation and Web advertisements.
376
T. Murata
It is often pointed out that topic-dependent ranking algorithm such as HITS [6] sometimes outputs pages that are not closely related to the topic of initial pages, which is called topic drift. Since topic-dependent ranking algorithm sorts Web pages in the order of relevance based solely on hyperlink graph structure, finding a boundary that divide related and unrelated pages is not easy. Our method is expected to avoid such topic drift by providing suitable negative examples. Providing all negative example pages in advance is not an easy task. In our new method, both positive and negative pages are increased by the above procedure, and the task for specifying negative example is partly relieved.
6 Concluding Remarks This paper describes a method for discovering Web communities from both positive and negative examples. By specifying negative examples of target Web community explicitly, users can specify the extent of target Web community rather easily, and interactive discovery of Web communities is enabled. In order to clarify relations among several granularities of Web communities, there are many tasks to be done such as discovering more members of target Web communities, and giving explanations of discovered Web communities.
References 1. 2. 3. 4. 5. 6.
7. 8. 9. 10.
S. Chakrabarti, M. M. Joshi, K. Punera, D. M. Pennock: “The Structure of Broad Topics on the Web”, Proc. of the 11th WWW Conference, pp. 251–262, 2002. J. Dean, M. R. Henzinger: “Finding Related Pages in the World Wide Web”, Proc. of the 8th WWW conference, 1999. G. W. Flake, S. Lawrence, C. L. Giles, F. M. Coetzee: “Self-Organization and Identification of Web Communities”, IEEE Computer, Vol.35, No.3, pp. 66–71, 2002. M. Girvan, M. E. J. Newman: “Community Structure in Social and Biological Networks”, online manuscript, http://arxiv.org/abs/cond-mat/0112110/, 2001. Google: “Google API”, online document, http://www.google.com/apis/, 2002. J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins: “The Web as a Graph: Measurements, Models, and Methods”, Proc. of COCOON'99, LNCS 1627, pp. 1–17, 1999. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins: "Trawling the Web for Emerging Cyber-Communities", Proc. of the 8th WWW Conference, pp. 403–415, 1999. T. Murata: “Machine Discovery Based on the Co-occurrence of References in a Search Engine”, Proc. of DS99, LNAI 1721, pp. 220–229, Springer, 1999. T. Murata: “Discovery of Web Communities Based on the Co-occurrence of References”, Proc. of DS2000, LNAI 1967, pp. 65–75, Springer, 2000. T. Murata: “Finding Related Web Pages Based on Connectivity Information from a Search Engine”, Poster Proc. of 10th WWW conference, pp. 18–19, 2001.
Association Rules and Dempster-Shafer Theory of Evidence Tetsuya Murai1 , Yasuo Kudo2 , and Yoshiharu Sato1 1
Graduate School of Engineering, Hokkaido University, Sapporo 060-8628, JAPAN 2 Muroran Institute of Technology, Muroran 050-8585, JAPAN
Abstract. The standard definitions of confidence for association rules was proposed by Agrawal et al. based on the idea that co-occurrences of items in one transaction are evidence for association between the items. Since such definition of confidence is nothing but a conditional probability, even weights are a priori assigned to each transaction that contains the items in question at the same time. All of such transactions, however, do not necessarily give us such evidence because some co-occurrences might be contingent. Thus the D-S theory is introduced to discuss how each transaction is estimated as evidence.
1
Introduction
Since Agrawal et al.[1] proposed association rules in data mining, a huge number of practical research have been contributed while their theoretical foundation still seems not to be fully developed. In our previous papers[7,8,9], we discussed association rules as conditionals in conditional logic in Chellas[2]. Then, association rules were shown to be translated into probability-based graded conditionals. In this paper, we examine co-occurrence as the basis for confidence of association rules and then, using Dempster-Shafer theory[3,10] (D-S theory, for short), evidence-based confidence is introduced. Since the standard confidence defined by Agrawal et al.[1] based on the idea of co-occurrences is nothing but a conditional probability, even weights are a priori assigned to each transaction that contains the items at the same time. All of such transactions, however, do not necessarily give us such evidence because some co-occurrences might be contingent. Then we introduce the D-S theory into association rules in order to describe such evidence in a more sophisticated way.
2
Association Rules
Let I be a finite set of items. Any subset X in I is called an itemset in I. A database is comprised of transactions, which are actually obtained or observed itemsets. In this paper, we define a database D on I as T, V , where T = {1, 2, · · · , n} (n is the size of the database), and V : T → 2I . For an itemset X, df |{t∈T | X⊆V (t)}| , |T |
its degree of support s(X) is defined by s(X) = size of a finite set.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 377–384, 2003. c Springer-Verlag Berlin Heidelberg 2003
where | · | is a
378
T. Murai, Y. Kudo, and Y. Sato Table 1. Movie database. No. Transaction (movie) 1 Secret people 2 Monte Carlo baby 3 Roman holiday 4 My fair lady 5 Breakfast at Tiffany’s Charade 6 Two for the road 7 Wait until dark 8 Days of wine and rose 9 The great race 10 The pink panther 11 Sunflower 12 Some like it hot 13 14 12 Angry men 15 The apartment ······ 100 Les aventuriers
AH 1 1 1 1 1 1 1 1
HM
1 1 1 1 1 1 1 1
Given a set of items I and a database D on I, an association rule is an implication of the form X =⇒ Y , where X and Y are itemsets in I with X ∩ Y = ∅.Two indices were also introduced in [1]. An association rule r = (X =⇒ Y ) holds with confidence c(r) (0 ≤ c(r) ≤ 1) in D if and only if c(r) = s(X ∪ Y )/s(X). An association rule r = (X =⇒ Y ) has a degree of support s(r) (0 ≤ s(r) ≤ 1) in D if and only if s(r) = s(X ∪ Y ). For example, consider the movie database in Table 1 where AH and HM means Audrey Hepburn and Henry Mancini, respectively. If you have watched several (famous) Ms. Hepburn’s movies, you might hear some wonderful music composed by Mr. Mancini. This can be represented by the association rule r = }∪{HM}) = 0.5 and its degree {AH} =⇒ {HM} with its confidence c(r) = s({AH s({AH}) of support s(r) = |{T | T ⊆{AH}∪{HM}}| = 4 = 0.04. |D|
3
100
Association Rules in Graded Conditional Models
Given a finite set I of items as atomic sentences, a language LgCL (I) for graded conditional logic is formed from I as the set of sentences closed under the usual propositional operators such as , ⊥, ¬, ∧, ∨, →, and ↔ as well as k (graded conditional ) for 0
measure m as gk (t, X ) = {Y ⊆ 2W | m(Y, X ) ≥ k}, and (3) for any item x in I, v(x, t) = 1 iff x ∈ V (t). The truth condition for k in a grade conditional model is given by MgD , t |= df
p k q iff qMgD ∈ gk (t, pMgD ), where pMgD = {t ∈ W | MgD , t |= p}.
Association Rules and Dempster-Shafer Theory of Evidence
379
Table 2. Soundness results of graded conditionals by probability measures. 0
1 2
1 2
< k < 1 k = 1 Rules and Axiom schemata p↔q RCEA. (p k q) ↔ (q k q) q ↔ q RCEC. (p k q) ↔ (p k q ) q → q RCM. (p k q) → (p k q ) (q ∧ q ) → r RCR. ((p k q) ∧ (p k q )) → (p k r) q RCN. p k q (q1 ∧ · · · ∧ qn ) → q RCK. ((p k q1 ) ∧ · · · ∧ (p k qn )) → (p k q) CM. (p k (q ∧ r)) → (p k q) ∧ (p k r)
CC.
(p k q) ∧ (p k r) → (p k (q ∧ r))
CR.
(p k (q ∧ r)) ↔ (p k q) ∧ (p k r)
CN.
p k
CP.
¬(p k ⊥)
CK.
(p k (q → r)) → (p k q) → (p k r)
CD.
¬((p k q) ∧ (p k ¬q))
CDC .
(p k q) ∨ (p k ¬q)
The basic idea of this definition is the same as in fuzzy-measure-based semantics for graded modal logic defined in [4,5,6]. For example, the usual degree of confidence[1] is nothing but the well-known conditional probability, so we define function gk by conditional probability. Definition 2. For a given database D on I, its corresponding probabilitybased graded conditional model MgD is defined as a structure W, {gk }0
where gk (w, X ) = {Y ⊆ 2W | Pr(Y|X ) ≥ k}, and Pr is the familiar conditional probability: for any X (= ∅), Y in 2I , Pr(Y|X ) = |X|X∩Y| | . The truth condition of graded conditional is given by MgD , t |= p k q iff Pr(qMgD | pMgD ) ≥ k. We have several soundness results based on probability-measure-based semantics (cf.[4,5,6]) shown in Table 2. Further, given a database D on I and its corresponding probability-based graded conditional model MgD , for an association rule X =⇒ Y , we have c(X =⇒ Y ) ≥ k iff MgD |= pX k pY .
4
Dempster-Shafer-Theory-Based Confidence
D-S Theory and Confidence. The standard confidence[1] is based on the idea that co-occurrences of items in one transaction are evidence for association between the items. Since the definition of confidence is nothing but a conditional probability, even weights are a priori assigned to each transaction that contains the items in question at the same time. All of such transactions, however, do not necessarily give us such evidence because some co-occurrences might be contingent. Thus we need a framework that can differentiate proper evidence
380
T. Murai, Y. Kudo, and Y. Sato
Table 3. Soundness results of graded conditionals by belief and plausibility functions. Belied function 0
1 2
1 2
Rules and
Plausibility function
< k < 1 k = 1 Axiom schemata 0 < k ≤
1 2
1 2
RCEA
RCEC
RCM
RCR
RCN
RCK
CM
CC
CR
CN
CP
CK
CD CDC
from contingent one and we introduce Dempster-Shafer theory of evidence[3, 10] to describe such a more flexible framework to compute confidence. There are a variety of ways of formalizing D-S theory and, in this paper, we adopt multivalued-mapping-based approach, which was used by Dempster[3]. In the approach, we need two frames, one of which has a probability defined, and a multivalued mapping between the two frames. Given a database D = T, V on I and an association rule r = (X =⇒ Y ) in D, one of frames is the set T of transactions. Another one is defined by R = {r, r}, where r denotes the negation of r. The remaining tasks are (1) to define a probability distribution Pr on T : Pr : T → [0, 1], and (2) to define a multivalued mapping Γ : T → 2R . Given Pr and Γ , we can define the well-known two kinds of functions in df Dempster-Shafer theory: for X ⊆ 2R , Bel(X) = Pr({t ∈ T | Γ (t) ⊆ X}), and df
Pl(X) = Pr({t ∈ T | Γ (t) ∩ X = ∅}), called belief and plausibility functions. Now we have the following double-indexed confidence: c(r) = Bel(r), Pl(r). Multi-graded Conditional Models for Databases. Given a finite set I of items as atomic sentences, a language LmgCL (I) for graded conditional logic is formed from I as the set of sentences closed under the usual propositional operators as well as k and k (graded conditionals ) for 0
fined by belief and plausibility functions: g k (t, X ) = {Y ⊆ 2W |Bel(Y, X ) ≥ k}, df
and g k (t, X ) = {Y ⊆ 2W |Pl(Y, X ) ≥ k}, and (3) for any item x in I, v(x, t) = 1 iff x ∈ V (t)
Association Rules and Dempster-Shafer Theory of Evidence
381
The truth conditions for k and k are given by MmgD , w |= pk q iff qMmgD ∈ g k (t, pMmgD ) and MmgD , w |= pk q iff qMmgD ∈ g k (t, pMmgD ), respectively. Its basic idea is also the same as in fuzzy-measurebased semantics for graded modal logic defined in [4,5,6]. Several soundness results based on belief- and plausibility-function-based semantics (cf.[4,5,6]) are shown in Table 3. Two Typical Cases. First we define a probability distribution on T by 1 , if t ∈ p MmgD , df X Pr(t) = a 0, otherwise, where a = |pX MmgD |. This means that each world (transaction) t in pX MmgD is given an even mass(weight) 1a . To generalize the distribution is of course another interesting task. Next we shall see two typical cases of definition of Γ . First we describe strongest cases. When we define a mapping Γ by {r}, if t ∈ pX MmgD , df Γ (t) = {r}, otherwise. This means that the transactions in pX ∧ pY MmgD contribute as evidence to r, while the transactions in pX ∧ ¬pY MmgD contribute as evidence to r. This is the strongest interpretation of co-occurrences. Then, we can compute Bel(r) = 1a × b and Pl(r) = 1a × b, where b = |pX ∧ pY MmgD |. Thus the induced belief and plausibility functions collapse to the same probability b b measure Pr: Bel(r) = Pl(r) = Pr(r) = ab , and thus c(r) = , . Hence a a this case represents the usual confidence. According to this idea, in our movie database, we can define Pr and Γ in the way in Figure 1. That is, any movie in AH ∧ HMMmgD contributes as evidence to that the rule holds (r), while all movie in AH ∧ ¬HMMmgD contributes as evidence to that the rule does not hold (r). Thus we have c({AH} =⇒ {HM }) = 0.5, 0.5. Next we describe weakest cases. In general, co-occurrences do not necessarily mean actual association. The weakest interpretation of co-occurrences is to consider transactions totally unknown as described as follows: When we define a mapping Γ by {r, r}, if t ∈ pX MmgD , df Γ (t) = {r}, otherwise. This means that the transactions in pX ∧ pY MgD contribute as evidence to R = {r, r}, while the transactions in pX ∧ ¬pY MgD contribute as evidence to b r. Then, we can compute Bel(r) = 0 and Pl(r) = 1a × b, and thus c(r) = 0, . a According to this idea, in our movie database, we can define Pr and Γ in the way in Figure 4.
382
T. Murai, Y. Kudo, and Y. Sato
No. Transaction (movie) 1 Secret people 2 Monte Carlo baby 3 Roman holiday 4 My fair lady
AH 1 1 1 1
HM
Pr
5
Breakfast at Tiffany’s
1
1
1 8 1 8 1 8 1 8 1 8
6
Charade
1
1
1 8
7
Two for the road
1
1
1 8
8
Wait until dark
1
1
1 8
9 10 11 12 13 14 15 100
Days of wine and rose The great race The pink panther Sunflower Some like it hot 12 Angry men The apartment ······ Les aventuriers
1 1 1 1
{r, r}
ggkgkos7395 {r} gkgkgkogkosgkoskosos g g g g +-./ {r} g k ggggg kkk oooss ∅ ggggg kkkkkokooososss kk ooo sss k k k k oo ss ooo ss Γ ooo ssss ss ss
0 1 2 1 2
0
0 0 0 0 0 0 0 0
Fig. 1. An example of the strongest cases. No. Transaction (movie) 1 Secret people 2 Monte Carlo baby 3 Roman holiday 4 My fair lady
AH 1 1 1 1
HM
Pr
5
Breakfast at Tiffany’s
1
1
1 8 1 8 1 8 1 8 1 8
6
Charade
1
1
1 8
7
Two for the road
1
1
1 8
8
Wait until dark
1
1
1 8
9 10 11 12 13 14 15 100
Days of wine and rose The great race The pink panther Sunflower Some like it hot 12 Angry men The apartment ······ Les aventuriers
1 1 1 1
;694 r} iimimrv{r, imimimrimrvmrvrv {r} i i i i rv +,./ {r} iiii mmmrrvv iiiimmmmrmrrvrvv i i ∅ i i m v r m mm rrrrvvv m m mm rrr vvv Γ r rr vv rr vvv vv vv
1 2
0 1 2
0
0 0 0 0 0 0 0 0
Fig. 2. An example of the weakest cases.
That is, all movie in AH ∧ ¬HMMmgD contributes as evidence to that the rule does not hold (r), while we cannot expect whether each movie in AH ∧ HMMmgD contributes or not as evidence to that the rule holds (r). Thus we have c({AH} =⇒ {HM }) = 0, 0.5. In the case, the induced belief and plausibility functions, denoted respectively Belbpa and P lbpa , become necessity and possibility measures in the sense of Dubois and Prade. We have several soundness results based on necessity- and possibility-measure-based semantics (cf.[4,5,6]) shown in Table 4.
Association Rules and Dempster-Shafer Theory of Evidence No. Transaction (movie) 1 Secret people 2 Monte Carlo baby 3 Roman holiday 4 My fair lady
AH 1 1 1 1
HM
5
Breakfast at Tiffany’s
1
1
6
Charade
1
1
7
Two for the road
1
1
8
Wait until dark
1
1
9 10 11 12 13 14 15 100
Days of wine and rose The great race The pink panther Sunflower Some like it hot 12 Angry men The apartment ······ Les aventuriers
1 1 1 1
Pr 1 8 1 8 1 8 1 8 1 8 1 8 1 8 1 8
8/0 r} {r,
ggglrglr835 {r} gglglglrglrlrlr ./ {r} g g g g ggggg lll rr ∅ ggggg llllllrrrr r l r l ll rrr Γ lll rr rrr r r rr
383
3 8 3 8 1 4
0
0 0 0 0 0 0 0 0
Fig. 3. An example of general cases. Table 4. Soundness results of graded conditionals by necessity and possibility measures. Necessity measure Rules and Possibility measure Axiom schemata 0
General cases. In the previous two typical cases, one of which coincides to the usual confidence, any transaction in AH ∧ HMMmgD (or in AH ∧ ¬HMMmgD ) has the same weight as evidence. It would be, however, possible that some of AH ∧ HMMmgD (or AH ∧ ¬HMMmgD ) does work as positive evidence to r (or r) but other part does not. Thus we have a tool that allows us to introduce various kinds of ’a posteriori’ pragmatic knowledge into the logical setting of association rules. As an example, we assume that (1) the music of the first and second movies was not composed by Mancini, but the fact does not affect the validity of r because they are not very important ones, and (2) the music of the seventh movie was composed by Mancini, but the fact does not affect the validity of r. Then we can define Γ in the
384
T. Murai, Y. Kudo, and Y. Sato
way in Figure 3. Thus we have c({AH} =⇒ {HM }) = 0.375, 0.75. In general, users have such kind of knowledge ‘a posteriori.’ Thus the D-S based approach allows us to introduce various kinds of ‘a posteriori’ pragmatic knowledge into association rules.
5
Concluding Remarks
Users have, in general, such kind of knowledge ‘a posteriori’ describe in the previous section. Thus the D-S based approach allows a sophisticated way of calculating confidence by introducing various kinds of ‘a posteriori’ pragmatic knowledge into association rules. This research was partially supported by Grant-in-Aid No.14380171 for Scientific Research(B) of the Japan Society for the Promotion of Science.
References 1. Agrawal, R., Imielinski, T., Swami, A. (1993): Mining Association Rules between Sets of Items in Large Databases. Proc. ACM SIGMOD Conf. on Management of Data, 207–216. 2. Chellas, B.F. (1980): Modal Logic: An Introduction. Cambridge Univ. Press, Cambridge. 3. Dempster,A.P. (1967): Upper and Lower Probabilities Induced by a Multivalued Mapping. Ann. Math. Stat., 38, 325–339. 4. Murai, T., Miyakoshi, M., Shimbo, M. (1993): Measure-Based Semantics for Modal Logic. R.Lowen and M.Roubens (eds.), Fuzzy Logic: State of the Art, Kluwer, Dordrecht, 395–405. 5. Murai, T., Miyakoshi, M., Shimbo, M. (1994): Soundness and Completeness Theorems Between the Dempster-Shafer Theory and Logic of Belief. Proc. 3rd FUZZIEEE (WCCI), 855–858. 6. Murai, T., Miyakoshi, M., Shimbo, M. (1995) A Logical Foundation of Graded Modal Operators Defined by Fuzzy Measures. Proc. 4th FUZZ-IEEE/2nd IFES, 151–156. 7. Murai, T., Sato, Y. (2000) Association Rules from a Point of View of Modal Logic and Rough Sets. Proc. 4th AFSS, 427–432. 8. Murai, T., Nakata, M., Sato, Y (2001) A Note on Conditional Logic and Association Rules. T.Terano et al.(eds.), New Frontiers in Artificial Intelligence, LNAI Vol.2253, Springer, 390–394. 9. Murai, T., Nakata, M., Sato, Y (2002) Association Rules as Relative Modal Sentences Based on Conditional Probability. Communications of Institute of Information and Computing Machinery, 5(2), 73–76. 10. Shafer, G., A Mathematical Theory of Evidence. Princeton University Press, 1976.
Subgroup Discovery among Personal Homepages Toyohisa Nakada and Susumu Kunifuji Japan Advanced Institute of Science and Technology, Tatsunokuchi, Ishikawa 923-1292, Japan {t-nakada, kuni}@jaist.ac.jp
Abstract. This paper discusses our algorithm for finding subgroups among personal homepages. Assuming that personal homepages usually carry personal information, we have developed an algorithm that allows us to automatically find potential patterns from them. For example, when the algorithm is applied to personal homepages at some school, we can approximate the ratio between the number of students interested in information science and that of students interested in social science. In the experiment, we successfully created subgroups that showed characteristics of the school. Also, we found relations between subgroups that are important for enhancing human activity.
1
Introduction
The study of relations between the real world and the cyberspace has recently received much attention from researchers. The relations are such that the real world is influenced by cyberspace, and vice versa. CommunityWare, e.g., [2], [8] is a research based on the former relation. The purpose of the researches is mainly to enhance human activities by using cyberspace. On the other hand, the purpose of the latter relation is mainly to understand real world phenomena through the cyberspace. For example, in order to understand user behavior, eMail, News, Web server log, and so on have been analyzed, e.g., [1], [7]. Our study belongs to the later type of research; the purpose is to develop an unsupervised learning algorithm for finding interesting patterns from personal homepages. The algorithm finds potential relations between personal homepages by finding subgroups of them. In other words, our algorithm finds subgroups among personal homepages that are sets of web documents. Web document grouping has been applied to the field of information retrieval to achieve better efficiency and to smmarize results from search engines. Our study is different from these in two ways. First, we are finding subgroup for knowledge discovery. Second, instead of web pages, we treat web sites, sets of web pages, as our input data so that resulting subgroups carry additional information, i.e., owners of personal homepages. This paper is organized as follows: Section 2 describes our proposed algorithm for finding subgroups among personal homepages. Section 3 goes over two experiments we conducted and provides an evaluation of our algorithm. Conclusions will be given in Section 4. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 385–392, 2003. c Springer-Verlag Berlin Heidelberg 2003
386
2
T. Nakada and S. Kunifuji
Finding Subgroups among Personal Homepages
The input of our algorithm is a set of top pages of personal homepages. In our study, a site is defined as a set of web pages that consist of one top page and the rest located under the top page directory. For example, http://www.jaist.ac.jp/ ˜t-nakada/ is the top page of the first author’s homepage. Similarly, http:// www.jaist.ac.jp/˜t-nakada/myself.html is a page that belongs to author’s homepage while http://www.jaist.ac.jp/ is a page that does not belong to the author’s homepage. The output of our algorithm is a set of subgroups of personal homepages and a list of keywords that describe each of them. 2.1
Gathering Words and Hyperlinks from Personal Homepages
Our system takes a set of top pages of personal homepages as its input and starts processing by gathering all words and hyperlinks from all pages underneath. It travels from the top personal page recursively through subpages to pick up all hyperlinks and words that show up. The words are then transformed by a light stemming algorithm (deleting word prefixes and suffixes and reducing plural to singular), non-nouns are removed, and non-word tokens (such as numbers, HTML tags and most punctuation) are stripped. Words and hyperlinks that occur only in a single personal homepage are ignored to reduce time to compute. This hyperlink and word information is organized into the data structure seen in Fig. 1. On the left hand side lies a list of personal homepages, and on the right hand side lies a list of all words and hyperlinks from personal homepages. A pointing arrow between the left side and the right side means that the personal homepage at the arrowtail has the content at the arrowhead.
Fig. 1. Data Structure of content information. Contents are either words or hyperlinks in personal homepages
2.2
Finding Subgroups Algorithm
The following shows the algorithm that is applied to the list of contents on the right hand side in Fig. 1 to construct subgroups. The bold strings are to be explained later.
Subgroup Discovery among Personal Homepages
387
1. Choose a seed content from the list of contents based on the criterion discussed in 2.3 2. From the list of contents, get contents similar to the seed content (similarity measurement discussed in 2.4) 3. Construct a subgroup from the seed content and contents from 2 4. Iterate 1-3 until obtaining an expected number of sucgroups We need to repeat the following steps as many times as the number of subgroups desired. The time taken to compute this algorithm is O(mn) where m is the number of subgroups to be created and n is the number of contents. Quering search engine is used in the procedure 1, 2 (the detail is discussed later). Because our algorithm depends on it, although computational cost is O(mn), our algorithm is very slow in the real time.
2.3
Criterion for Selecting Seed Content
We are to select our criterion for selecting a seed content depending on the purpose of finding subgroups. For example, if the purpose is to find the most predominant contents, the criterion should be to select a content that is used in a large number of personal homepages. However, the result may be trivial because, in the case of hyperlinks as contents, everyone already knows the most famous hyperlink such as http://www.yahoo.com/, so consequent subgroups are not much interesting, or you cannot expect the algorithm to find characteristics of the input data. Therefore, we developed the following criterion in order to find subgroups from contents famous in a given domain but not in general. We use a score measurement denoted by score for a given contenti and chose such a content that score in definition 1 is the largest.
Definition 1. score(contenti ) for selecting a seed content is defined as the score in the domain (d score(contenti )) minus the score in general (g score(contenti )). score(contenti ) = d score(contenti ) − g score(contenti ) d score(contenti ) = content score(n)
(1) (2)
n∈P ersonal Homepages that have contenti
(3) content score(n) = 1/# of All Contents in the P ersonal Homepage n g score(contenti ) = # of P ages that have contenti got f rom Search Engine (4) where, in equation (1), d score(contenti ) and g score(contenti ) are standardized to mean 0 and standard deviation 1 in order to compare two types of score. Equation (3) produces the effect of non-dependence of personal homepages that have a huge number of pages on obtained subgroups.
388
2.4
T. Nakada and S. Kunifuji
Similarity Measure between Two Contents
We use a similarity measure introduced in REFERRAL [3] and [4]. In these systems, similarity between hyperlinks has the following definition, which we extend to measure similarity between words and similarity between a hyperlink and a word. Definition 2. Similarity between contenti and contentj is defined by Jaccard coefficient [6]. similarity(contenti , contentj ) # of pages that have contenti and contentj = # of pages that have contenti + # of pages that have contentj
(5)
In equation (5), we used a search engine such as Infoseek(http://www. infoseek.co.jp/) in order to count the number of pages that have a given content as opposed to counting the number of such pages within the target domain since the precision of similarity(contenti , contentj ) depends on the size of data. similarity(contenti , contentj ) is to be compared with a cut-off point above which the two contents are regarded as similar and below which they are regarded as not similar. To determine our cut-off point, we performed an experiment where the authors picked up 200 pairs of contents and determined whether they are similar or not. The result suggested that the cut-off point should be 0.04.
3
Experiment and Evaluation
We randomly picked up two hundred personal homepages under Stanford University’s official homepage (http://www.stanford.edu/leland/dir.html) and ran our algorithm to find subgroups of them. Table 1 shows the eleven subgroups we discovered. Our algorithm stopped searching subgroups when it found eleven seed contents because there were only eleven seed contents in personal homepages at Stanford University. The first column shows the number of personal homepages in each subgroup, the second column shows the number of contents (words or hyperlinks), and the third column shows the contents. The first content in the contents field is the seed content in particular. Contents with a http:// prefix is a hyperlink, and the others are word contents. In the first subgroup, the characteristic of our algorithm is conspicuous in that both word and hyperlink contents describe the subgroup. However, the rest of the subgroups turned out to have only one type of contents. The reason is that the value of similarity measure (definition 2) becomes closer to zero when there is a big difference between the two numbers of pages. Even when the two contents are actually similar to the eyes of human, they are determined as not similar by the similarity measure. Also, generally speaking, the number of pages with hyperlink contents is much smaller than the number of pages with word contents. Thus, the algorithm tends to construct more subgroups that can be described only by word contents or hyperlink contents, and not both.
Subgroup Discovery among Personal Homepages
389
Table 1. Results from personal homepages at Stanford University. This list is arranged in the order of score(contenti ) (definition 1) where contenti is a seed content. Subgroup # of # of Contents # Personal Contents Homepage 1 160 7 stanford, http://www.stanford.edu, harvard, cornell, berkeley, yale, princeton 2 67 1 webauth 3 77 4 apache, index, perl, linux 4 94 22 university, science, student, college, research, faculty, institute, school, department, professor, ... 5 43 7 instructor, classroom, undergraduate, curriculum, enrollment, lecture, exam 6 8 1 coworker 7 9 2 alta, vista 8 11 2 http://www.altavista.com, http://www.google.com 9 15 8 yay, ork, gander, xia, mso, csg, cali, cuz 10 24 7 humanity, anthropology, literary, psychology, interdisciplinary, scholar, discipline 11 5 1 psychologist
Looking at the second column, we are able to see the size of each subgroup at the university as a whole. For instance, seventy-seven people belong to the information science subgroup (the third subgroup from top), and this is about three times as big as the size of the humanities and social science subgroup (the tenth subgroup). Although this may not reflect the actual situation at the university since we can expect that a larger percentage of people from information science own homepages than people from humanities and social sciences, the algorithm can still help us have some idea of what the large entity looks like. Table 2 shows the other result from MIT. In the same way as Stanford University, we randomly picked up two hundred personal homepages under MIT’s official homepage (http://www.mit.edu/Home-byUser.html). We see some differences between Stanford University and MIT. For example, the humanities and social science subgroup appeared only at Stanford University (the tenth subgroup in Table 1) while the subgroup described by brain, disease, and so on appeared only at MIT. Fig. 2 is our visualization tool that makes clear two types of relations. One is the relation between individuals (i.e., people in a subgroup have relations between each other in the viewpoint of sharing same interests). The other is the relation between subgroups. The large rectangle represents a subgroup and the small rectangle in a subgroup represents a personal homepage. The size of the large rectangle is proportional to the number of its members. All subgroups are
390
T. Nakada and S. Kunifuji
Table 2. Results from personal homepages at MIT. This list is arranged in the order of score(contenti ) (definition 1) where contenti is a seed content. Subgroup # of # Personal Homepage 1 125 2 74 3 99
# of Contents Contents
4
112
31
5
82
30
6 7
43 108
3 87
8 9 10
20 30 44
1 4 26
11 12 13
25 27 52
3 1 21
14
4
2
1 2 111
mit http://web.mit.edu, http://www.mit.edu engineer, assistant, guitar, component, keyboard, scientist, genre, installation, contract, ... research, institute, science, university, analysis, laboratory, publication, analyst, researcher, ... cambridge, oxford, boston, massachusetts, vienna, netherlands, denmark, greece, austria, hungary, ... design, engine, career class, function, level, object, process, value, cource, context, example, java, bulk, solution, ... mechanic pdf, acrobat, adobe, format brain, disease, disorder, blood, patient, tissue, ear, diagnosis, cell, cancer, muscle, pain, symptom, ... http://www.yahoo.com, yahoo, japan Photo biology, chemistry, ecology, physics, mathematics, taxonomy, biochemistry, species, medicine, ... watt, transmitter
shown in the right panel. When a subgroup is selected, it will appear in the left main panel. We found relations among subgroups from the result of Stanford University. #10 subgroup can be seen as the connection between #3 and #11 subgroup, and in this case ten people become potential key people who have ability to make interaction between #3 and #11 subgroup in the viewpoint of having the same interests. We think it is important to find potential key people in order to enhance human activity.
4
Conclusion
We discussed our algorithm for finding subgroups among personal homepages. In the Experiment, we successfully created subgroups described by both types of contents while, at the same time, we made clear some issues to be solved in the future.
Subgroup Discovery among Personal Homepages
391
Fig. 2. Sample relations between subgroups and between individuals at Stanford University
We think that one of the applications using our algorithm is to enhance human activity. It is possible to construct a new real world community and produce new interactions between people if subgroups of personal homepages can be found because although the subgroup is not a real world community yet, people in a subgroup share some interest. Other application is to understand human dynamics. Although update frequency of personal homepages is uneven, personal homepages are changed byowners. It is possible to know human dynamics if our algorithm is applied periodically to pick up any change.
References 1. Judith S. Donath, Visual Who: Animating the affinities and activities of an electronic community, in ACM Multimedia 95, 1995. 2. Toru Ishida, Towards Communityware, New Generation Computing, 16(1), 1998, pp. 5–22. 3. H. Kauts, B. Selman, and M. Shah, The Hidden Web, AI Magazine, vol. 18, no. 2, pp. 27–36, 1997
392
T. Nakada and S. Kunifuji
4. Tsuyoshi Murata, Discovery of the Structures of Web Communities, JSAI SIG-KBSA002-2, pp. 7–12, 2000 (in Japanese) 5. Toyohisa Nakada, Tu Bao Ho and Susumu Kunifuji, Finding Potential Human Communities From Personal Homepages, In Proceedings of the IASTED International Conference ACI2002, pp. 191–196, 2002 6. G. Salton, Automatic Text Processing, Addison-Wesley Publishing Company, Reading, MA, 1989 7. Myra Spiliopoulou, Lukas C. Faulstich, Karsten Winkler, A Data Miner analyzing the Navigational Behavior of Web, in Proceedings of the Workshop on Machine Learning in User Modeling of the ACAI 99, 1999. 8. Yasuyuki Sumi, Kenji Mase, Supporting Awareness of Shared Interests and Experiences in Community, in Proceedings of the International Workshop on Awareness and the WWW, held at the ACM CSCW’2000 Conference, 2000.
Collaborative Filtering Using Projective Restoration Operators Atsuyoshi Nakamura, Mineichi Kudo, Akira Tanaka, and Kazuhiko Tanabe Division of Systems and Information Engineering Graduate School of Engineering Hokkaido University, Sapporo 060-8628 JAPAN {atsu, mine, takira, tanabe}@main.eng.hokudai.ac.jp
Abstract. We propose a modified version of our collaborative filtering method using restoration operators, which was proposed in [6]. Our previous method was designed so as to minimize expected squared error of predictions for user’s ratings, and we experimentally showed that, for users who have evaluated only small number of items, mean squared error of our method is smaller than that of correlation-base methods. After further experiments, however, we found that, for users who have evaluated many items, the best correlation-based method has smaller mean squared error than our method. In our modified version, we incorporated an idea of projecting on a low-dimensional subspace with our method using restoration operators. We experimentally showed that our modification overcame the shortcoming stated above.
1
Introduction
Information filtering is one of inevitable technologies of this age for managing the large volume of information contained in the Internet. Collaborative filtering has attracted a lot of attention because its filtering quality is expected to be high based on the fact that it recommends items which were actually preferred by other users having similar taste. There are two collaborative filtering approaches, memory-based and modelbased approaches [2]. Methods of the memory-based approach calculate similarities between users and make predictions based on similar users’ ratings. The methods belonging to this category are correlation-based methods [8,9], methods using vector similarity [2], and methods using weighted majority prediction algorithms [5]. The model-based approach needs to learn model parameters, which is usually done offline, while the memory-based approach requires no learning. Employing probabilistic models is popular among the methods of the modelbased approach, for example, methods using Bayesian clustering, Bayesian network models [2] and finite mixture models [4] were proposed. Generally, the model-based approach requires smaller memory and less computation than the
This work was partly supported by Grant-in-Aid for Scientific Research (B), No. 14380151, from Japan Society for the Promotion of Science.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 393–401, 2003. c Springer-Verlag Berlin Heidelberg 2003
394
A. Nakamura et al.
memory-based approach to provide online recommendation services, which is very important when the number of users is large. According to the experimental results reported in [2], prediction performance of both approaches is comparable. Among the methods of the non-probabilistic model-based approach, methods using Singular Value Decomposition (SVD) are popular. Billsus and Pazzani [1] made use of SVD to reduce the dimensionality of feature space, which made it possible to efficiently apply machine learning methods to collaborative filtering. In [10], a more direct approach was taken, and its good prediction performance was demonstrated. This approach directly utilizes the fact that SVD finds the best k-rank matrix which approximates the original matrix in terms of the distance measured by the Frobenius norm for arbitrary k. In this method, a user-item rating matrix, in which every user’s entry for a non-evaluated item is filled with the average rating for the item, is approximated by a k-rank matrix plus the matrix of which each column values are the average rating for the item corresponding to the column, and the entry values of the approximated matrix are used for predictions. This method can be seen as a projection of user’s rating vectors (translated by a mean vector) on k-dimensional subspace extracted by SVD. Another non-probabilistic model-based approach is our method using restoration operators, which was proposed in [6]. Our method was designed so as to minimize expected squared error of predictions for user’s ratings, and can be regarded as an application of the Wiener filter originally used in digital image restoration [7]. In [6], we experimentally showed that, for users who have evaluated only small number of items, mean squared error of our method is smaller than that of correlation-base methods. After further experiments, however, we found that, for users who have evaluated many items, the best correlation-based method has smaller mean squared error than our method. In this paper, we propose a modified version of our method into which we incorporate the idea of projecting on a low-dimensional subspace. According to our experimental results using the “EachMovie” data set, our new method showed good prediction performance in terms of mean squared error (MSE) for users who have rated many items. This paper is organized as follows. In Section 2, we review our previous method using restoration operators. Section 3 describes our new method using projective restoration operators. We reported our experimental results in Section 4. Section 5 discuss computational and space costs of our methods.
2 2.1
Method Using Restoration Operators Problem Formalization and Its Solution
Let m denote the number of items and x ∈ m denote a vector representing the user’s preference for all items. Assume that x is generated according to an arbitrary distribution D over m . For J = {i1 , i2 , ..., ik } ⊆ {1, 2, ..., m}, let PJ denote a restriction transformation (matrix) that restricts components of
Collaborative Filtering Using Projective Restoration Operators
395
a vector to those in set J, namely, PJ is a transformation such that PJ x = (xi1 , xi2 , ..., xik )T for x = (x1 , x2 , ..., xm )T , where v T for vector v denotes the transpose of v. The task of collaborative filtering can be seen as an estimation of x from an observed partial vector PJ x for some set J. The collaborative filtering problem considered here is formalized as follows. Problem 1. Given set J with1 |J| = k, find a linear operator Bopt that satisfies Bopt = arg min Ex ||x − BPJ x||2 , B
(1)
where the minimization is with respect to all possible linear transformations B : k → m . Note that || · || denotes the Euclidean norm. In this problem setting, we assume that Ex x = 0 because the average vector becomes 0 by translating each vector by vector -Ex x. Thus, precisely speaking, we try to find the best affine transformation of the form BPJ (x − Ex x) + Ex x. We call Ex x mean vector [6]. This problem setting can be seen as that of an unbiased Wiener filter without additive noise. (See [7].) The solution of this problem is known to be as follows. Note that trA for matrix A denotes the trace of A. Solution 1. Bopt = RPJT (PJ RPJT )+ , minB Ex ||x − BPJ x||2 = tr(R − Bopt PJ R) where R = Ex xxT and (PJ RPJT )+ is the Moore-Penrose inverse of PJ RPJT . 2.2
Prediction by Mean Vector
Assume that there is no correlation between any pair of component values of a vector representing the user’s preference. In this case, matrix R in Solution 1 is a diagonal matrix, and Bopt is PJT . Then, vector Bopt PJ x is a vector of which J. Considering that all the i’th component value is xi for i ∈ J and 0 for i ∈ vectors are translated by the minus mean vector, the i’th component value of J. the vector predicted by this method is Exi xi for i ∈ ˆ = {x ˆ1 , x ˆ 2 , ..., x ˆm } In our experiments, we estimated the mean vector by x calculated as follows: ˆi = x xi /ni , (2) x∈X,xi =∗
where ni denotes the number of elements in X of which the i’th component is not missing. 2.3
Previous Method
Our method proposed in [6] is a method in which correlations between every pair of component values of a vector are estimated. One problem of estimating matrix 1
Here, |J| denotes the number of elements in J.
396
A. Nakamura et al.
R is that we are only given a set of partial vectors PJx x of vector x as a training data set, where Jx depends on each vector x. Our solution for this problem is ˆ calculated by x ˆ = PJTx PJx x, that is, we approximate very simple. We use x missing values by 0s, which means approximation by the corresponding values of the mean vector. Thus, we estimate rˆi,j of (i, j)-entry value of R by rˆi,j =
1 x ˆi x ˆj , ˆ |X| ˆ ˆ ∈X x
ˆ = {ˆ where X x : x belongs to a training set }.
3 3.1
Method Using Projective Restoration Operators Problem Formalization and Its Solution
ˆ as a training set. This means that we try to In our previous method, we use X ˆ as possible on average. We find an operator B such that BPJ x is as close to x want to find B such that BPJ x is close to x, but we cannot obtain x in almost every situation of real world. If we cannot obtain x, we want to find B such that BPJ x is close to some vector that is close to x. Can we find a vector that is ˆ? closer to x than x Let S be a d-dimensional subspace of m for 0 < d < m. Assume that x is generated by summing two vectors y and , where y is a vector in S that is drawn according to a distribution Dy , and is a noise vector in m of which each component is independently generated according to the same normal distribution with mean 0 and variance σ 2 . Assume also that there is no correlation between any component of y and any component of . Let PS denote the projection operator on S. Then, ˆ ||2 = ||y − y ˆ ||2 + || − ˆ||2 and ||x − x ˆ ||2 = ||PS (y − y ˆ )||2 + ||( − ˆ) + (I − PS )ˆ||2 ||x − PS x
(3) (4)
ˆ )||. When ||y − y ˆ || > ||PS (y − y ˆ )|| and ˆ || ≥ ||PS (y − y hold. Note that ||y − y ˆ || is smaller than ||x − x ˆ ||, which means ||(I − PS )ˆ|| is small enough, ||x − PS x ˆ is closer to x than x ˆ . Thus, PS x ˆ is a candidate that is possibly closer that PS x ˆ. to x than x ˆ is formalized as follows. Problem of finding B such that BPJ x is close to PS x ˆ instead of x in estimation. Note that we use x Problem 2. Given J with |J| = k and subspace S ⊆ m , find a linear operator Bopt that satisfies Bopt = arg min Ex ||PS x − BPJ x||2 , B
(5)
where the minimization is with respect to all possible linear transformations B : k → m .
Collaborative Filtering Using Projective Restoration Operators
397
Letb1 , ..., bd be an orthonormal basis of subspace S. Then, PS is written as d PS = i=1 bi bTi . Note that PS is a symmetric matrix. By calculations similar to that done when Problem 1 was solved, this problem is solved and one of the solutions is the followings. Solution 2. Bopt = PS RPJT (PJ RPJT )+ , minB Ex ||PS x − BPJ x||2 = tr(PS RPS − Bopt PJ RPS ) where R = Ex xxT and (PJ RPJT )+ is the Moore-Penrose inverse of matrix PJ RPJT . 3.2
How to Find an Appropriate Subspace
Let S be a d-dimensional subspace of m for 0 < d < m. Assume that x is generated by summing two vectors y ∈ S and which are drawn according to the distributions that we assumed in the previous subsection. Then, how can we estimate S from a set of x? Let T be an arbitrary d-dimensional subspace of m . Then, E||x − PT x||2 = E||y − PT y||2 + E||(I − PT )||2 = E||y − PT y||2 + (n − d)σ 2 holds. Since PS y = y, this means that S = arg minT E||x − PT x||2 . In practice, we may estimate E||x − PT x||2 from a training setX by (1/|X|) x∈X ||x − PT x||2 . We can calculate T that minimizes (1/|X|) x∈X ||x − PT x||2 through SVD. Let A be a matrix of which each row is xT for x ∈ X. For d-dimensional subspace T , let AT denote a d-rank matrix of which each row is (PT x)T for x ∈ X. Then, ||A − AT ||2F = ||x − PT x||2 x∈X
holds, where || · ||F denotes the Frobenius norm. Therefore, we only have to find T that minimizes ||A − AT ||2F , and this can be done by using SVD for A. Let b1 , ..., bd denote the right singular vectors of A corresponding to the d largest singular values. Then, the set of vectors {b1 , ..., bd } is an orthonormal basis of T that minimizes ||A − AT ||2F . Again, we cannot obtain X, thus we make an approximation matrix of A ˆ = {ˆ using X x : x ∈ X}, and we use SVD for that approximation matrix. 3.3
Projection and New Method
Assume that there is no correlation between any pair of component values of a vector representing the user’s preference. In this case, operator Bopt in Solution 2 is PS PJT . We call the method using operator PS PJT projection method. ˆ using this method. The proFor given PJ x, x is predicted by PS PJT PJ x = PS x jection method has been already used by Sarwar et al. [10]. According to their
398
A. Nakamura et al.
experimental results, the projection method outperforms a Peason r algorithm [9], a standard correlation based method, when the rate of missing values is large. Without the above assumption, and using the estimation method of R used in Subsection 2.3, we can calculate operator Bopt in Solution 2. In this paper, the method using projective restoration operators means this method.
4 4.1
Experiments Experimental Methodology
In our experiment, we used the “EachMovie” collaborative filtering data set [3]. The data set consists of 2,811,983 numeric ratings for 1,628 movies evaluated by 72,916 users. User’s numeric rating for a movie represents how much the user likes the movie on a six-point scale (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). Note that only 2.37% of the user-movie matrix is filled. In our experiments, we only used the ratings of 2,000 users with the largest number of movie ratings. Note that 17.1% of the user-movie matrix of the data is filled in this case. We divided the data into 10 groups randomly, and conduct a 10-fold cross validation. We made the following two types of partitions. User partition. First, users were divided into 10 groups randomly, and for each user group, we made one data set which consists of all ratings evaluated by its group members. For this partition, we used nine data sets for batch training in which a covariance matrix R was estimated and an orthonormal basis b1 , ...bd was extracted. Then, for each user in the rest one data set, we randomly selected a set of movies for online training from the movies for which the scores are known. Based on the ratings for the selected movies only, predictions for the other movies for which the scores are known were made. Online training sets were randomly selected, so the performance was averaged over 50 runs. Rating partition. All ratings were divided into 10 groups randomly, which were used as data sets. In this case, nine data sets were used for both batch and online trainings, and all data in the rest one data set were used for testing. Note that the dimensionality d of subspace used in our new method and projection method was fixed to 10 in our experiments. Maybe, the best dimensionality can be determined using some information criterion, but here, we do not address this issue. 4.2
Results
Results for User Partition. Fig. 1 shows the relation between the number of movies used for online training and mean squared error (MSE). The MSE of our previous method, the method using restoration operators, increased when the number of rated movies was more than 60, and was larger than the MSE of the
Collaborative Filtering Using Projective Restoration Operators
399
0.085 Projective Restoration Operator Restoration Operator Projection method Correltion-based method
0.08
MSE
0.075 0.07 0.065 0.06 0.055 0.05 0
20
40 60 80 Number of rated movies
100
Fig. 1. Learning curves 0.9
0.85
0.85
0.84
0.8
0.83 0.82
0.7
Precision
Precision
0.75 0.65 0.6 0.55 0.5
0.8 0.79
Projective Restoration Operator Restoration Operator Projection method Mean vector Correlation-based method
0.45
0.81
Projective Restoration Operator Restoration Operator Projection method Correlation-based method
0.78 0.77
0.4
0.76 0
0.1
0.2
0.3
0.4
0.5 0.6 Recall
0.7
0.8
0.9
1
0
20
40 60 Number of rated movies
80
100
Fig. 2. Left: Recall-precision curves for predictions using 100 rated movies, Right: #(Rated movies)-precision curves at recall 0.1
correlation-based method when the number of rated movies was more than 80. Our new method, the method using projective restoration operators, overcame this shortcoming of our previous method, and its MSE constantly decreased. Note that the correlation-based method used in our experiments is the best performing one among the three correlation-based methods considered in [6], namely, the method using the item mean. In terms of precision and recall, unfortunately, we could not improve the performance of our method. (See Fig. 2.) This fact may indicate that we should design our method so as to minimize other value instead of expected squared error in order to improve precision and recall. Note that, in calculations of precision and recall, we considered the problem of predicting the user’s favorite movies, the movies of which the rating is at least 0.8, and precision at recall r is calculated by averaging the precisions at recall r over all users. Results for rating partition. Table 1 shows MSEs for five methods. The method using projective restoration operator had the smallest MSE among the five methods.
400
A. Nakamura et al. Table 1. MSEs (95% confidence intervals) for rating partition Proj. Res. Ope. Res. Ope. Projection Mean Vector Cor.-based 0.04992(±0.0003) 0.05544(±0.0006) 0.05683(±0.0003) 0.07729(±0.0003) 0.05469(±0.0003)
1 Projective Restoration Operator Restoration Operator Projection method Mean vector Correlation-based method
0.95 0.9 Precision
0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0
0.1
0.2
0.3
0.4
0.5 0.6 Recall
0.7
0.8
0.9
1
Fig. 3. Recall-precision curves
The method using projective restoration operators also outperformed in terms of precision and recall in this case, though the projection method and the correlation-based method were comparable at low recalls. (See Fig. 3.)
5
Concluding Remarks
Our methods using (projective) restoration operators need to offline calculate covariance matrix R and need to keep it on memory to realize a quick online response. The size of R depends on the number of items, so our method is not applicable when the number of items is large. In such case, preprocessing of filtering out unpopular items is necessary. Conversely, our method is not so dependent on the number of users while most methods of memory-based approach heavily depend on it. Heavy online task in our method is a calculation of Moore-Penrose inverse when the number of items rated by a target user is large. One simple countermeasure is to use the simple projection method for the users who rated more than t items, where t is a certain threshold.
References 1. Billsus, D., Pazzani, M.: Learning Collaborative Information Filters. Proc. of the 15th International Conference (ICML’98) (1998) 46–54. 2. Breese, J., Heckerman, D., Kadie, C.: Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Proc. of the 14th Annu. Conference on Uncertainty in Artificial Intelligence (1998) 43–52.
Collaborative Filtering Using Projective Restoration Operators
401
3. EachMovie collaborative filtering data set, 1997. research.compaq.com/SRC/eachmovie/. 4. Lee, W.: Collaborative Learning for Recommender Systems. Proc. of the 18th International Conf. on Machine Learning (2001) 314–321. 5. Nakamura, A., Abe, N.: Collaborative Filtering using Weighted Majority Prediction Algorithms. Proc. of 15th International Conference on Machine Learning (1998) 395–403. 6. Nakamura, A., Kudo, M., Tanaka, A.: Collaborative Filtering using Restoration Operators. Accepted to PKDD 2003. 7. Ogawa, H., Oja, E.: Projection Filter, Wiener Filter, and Karhunen-Lo`eve Subspaces in Digital Image Restoration. Journal of Mathematical analysis and applications 114 (1986) 37–51. 8. Resnick, P., Iacovou, N., Suchak, M., Bergstom P., Riedl, J.: GroupLens: An Open Architecture for Collaborative Filtering of Netnews. Proc. of CSCW (1994) 175– 186. 9. Shardanand, U., Maes, P.: Social Information Filtering: Algorithms and Automating “Word of Mouth”. Proc. of CHI95 (1995) 210–217. 10. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of Dimensionality Reduction in Recommender System – a Case Study. WebKDD 2000 Web Mining for E-commerce Workshop, 2000.
Discovering Homographs Using N-Partite Graph Clustering Hidekazu Nakawatase1 and Akiko Aizawa2 1
The Graduate University for advanced studies, 2–1–2 Hitotubashi, Chiyoda-ku, Tokyo 101–8430, Japan [email protected] http://mic.ex.nii.ac.jp/homograph/ 2 National Institute of Informatics, 2–1–2 Hitotubashi, Chiyoda-ku, Tokyo 101–8430, Japan [email protected]
Abstract. This paper presents a method for discovering homographs from textual corpora. The proposed method first extracts an N-partite graph expression of word dependencies, and then, generates nearsynonymous word clusters by enumerating and combining maximum complete sub-components on the graph. The homographs are identified as the words that belong to multiple clusters. In our experiment, we applied the method to Japanese newspaper articles and detected 531 homograph candidates, of which 31 were confirmed to be actual homographs.
1
Introduction
The identification of homographs is an important issue for improving performance in information retrieval and accurate text classification. In such applications, the quality of the result is strongly affected by the system’s ability to identify the user’s intended meaning. The disambiguation process usually relies on an existing handmade dictionary for detecting possible homonymous candidates. Naturally, such systems have difficulty with manipulating words (or meanings) that are not registered in the standard dictionaries. Based on the above observations, we propose a method for discovering homonymous words by analyzing sets of related words extracted from a text corpus. The features of the proposed approach are as follows: 1. most previous studies focused on the problem of ‘word sense disambiguation’ (that is, to select the most appropriate meanings of given candidate lists obtained from a standard dictionary). In contrast, our study primarily focuses on differentiating the meanings of words (in text) without any prerequisite knowledge. 2. We paid special attention to the scalability issue so that a sufficient amount of text could be analyzed to collect representative usage patterns. For this purpose, we adopted a fast enumeration algorithm for graphs developed recently based on a reverse-search algorithm[2]. 3. Our method not only identifies possible candidate G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 402–409, 2003. c Springer-Verlag Berlin Heidelberg 2003
Discovering Homographs Using N-Partite Graph Clustering
403
homographs, but also provides sets of examples according to the different usage of the words. The rest of the paper is organized as follows. We propose the principle of the polysemy acquisition based on relations between a synonym and a homograph in Section 2. We then reduce the synonym-clustering problem that is the important part of this acquisition method to an already-known graph problem in Section 3. We explain the experiment done to examine the effectiveness of our method in Section 4 and present the results. We discuss the experimental results and future work in Section 5.
2
Basic Principle for Homograph Acquisition
A word is said to be ‘polysemic’ when it has multiple meanings, and a group of words are said to be ‘synonyms’ when they share the same meaning. The complementary relationship between homographs and synonyms is illustrated in Figure 1a where wi , si , and gi are individual words, meanings and groups of synonyms, respectively. If we identify the synonymous groups g1 = {w1 , w2 } and g2 = {w2 , w3 }. the homograph of w2 can be logically discovered, because w2 belongs to both g1 and g2 . How, then, can we identify words with a similar meaning? Referring to the initiating work by Hattori[1], we apply the following important “substitution allowance” principle synonyms can be substituted mutually while maintaining the syntax structure and the general idea of the sentence. For instance, let us consider the following examples with two similar words ‘pretty’ and ‘cute’:1 “pretty cat / cute cat” (Example 1) “pretty good / *cute good” (Example 2) In Example 1, even if ‘pretty’ is substituted by ‘cute’, the meaning is almost the same. Therefore, we assume that the two words share the same meaning provided they are concatenated with particular subsets of words like ‘cat’. On the other hand, because the expression “cute good” is meaningless in Example 2, ‘cute’ cannot always be substituted for ‘pretty’. As a result, it is clear that the word ‘pretty’ has at least two distinct meanings. Here, we understand that the allowability of substitution is a necessary condition for sharing meaning. However, it is not clear whether it is also a sufficient condition. In this paper, we assume the allowability of substitution to be a necessary and sufficient condition and, instead of dealing with the ‘general idea of the sentences,’ we focus exclusively on the syntax-level allowability of substitution. Based on this assumption, the proposed homograph acquisition method is outlined as follows. Step1: Extract clusters of synonyms by applying the allowability of substitution to the context obtained from the corpus. 1
We originally intended this principle to be applied to Japanese words and phrases. To explain the concept, we show an approximate example in English. The original examples are ‘UMAI RYOURI’ and ‘OISHII RYOURI – Example 1’.
404
H. Nakawatase and A. Aizawa
Step2: If a word is included in two or more synonym groups, it is judged to be a homograph. We will introduce a more formal description of the synonym clustering method of Step1 in the next section.
Fig. 1. a:Polysemies and Synonyms, b: Graph expression of Example 1
3
Clustering Synonyms in an N-Partite Graph
In this section, we first show that this clustering of synonyms can be formulated and interpreted as a graph problem. As a result, this clustering problem can be converted to a graph problem that has already been solved. Next, we briefly introduce the algorithm used in our experiment. Additionally, we designed the approximation method to acquire a true cluster from the graph generated by a small corpus. 3.1
Basic Definitions and Problem Formulation
We show a procedure for making a similar-words cluster, referring to Example 1 above. First, W = {w1 , .., wn } are assumed to be sets of words (substitution objects). C = {s1 , .., sm } are assumed to be sets of contexts. Each context si is a tuple that consists of a word row ai1 ai2 ..aiji and a substitution position p in the word row, (0 ≤ p ≤ ji . The word to be substituted is put between the pth and (p + 1)th words). In Example 1, W is {pretty, cute}, and C is {(cat,0)}. The context length is one in Example 1. The procedure consists of three steps (illustrated using Example 1). 1: Create examples that put all words of W at position p for all contexts of C (“pretty cat”, “cute cat” in Example 1). 2: Judge the allowability of each example (All are acceptable in Example 1). 3: Add words where substitution is allowed in all contexts to the cluster of synonyms ({pretty, cute} in Example 1).
Discovering Homographs Using N-Partite Graph Clustering
405
Then, all acceptable examples can be mapped to an ‘N-partite graph’ representation of words where all the vertexes on the graph are divided into N distinctive groups(Figure 1b). For example, Figure 1b shows a graph representation of Example 1. Note that the vertexes of the graph are divided into two groups, words for substitution and the context words. No edge exists within the same group because words at different positions are always considered as different vertexes. On the N-partite graph representation the ‘substitution allowability’ property is transformed into a condition that all vertexes included in a certain group are connected with all vertexes of the other group. Such a particular form of graph is called a complete N-partite graph and is formally defined as follows2 . Definition: complete N-partite graph G=(V = V1 ∪ .. ∪ VN , E) is a graph. When it meets the following two requirements, G is a complete N-partite graph. 1. There are no edges in vertex subset Vi . 2. An arbitrary vertex of Vi (i = 1..N ) in G is connected with all elements of all the vertex sets other than Vi .
Fig. 2. Maximal Complete Bipartite Graph
In addition, to examine the substitution allowance condition, we require only the maximal complete N-partite graphs that are not included in any other complete N-partite graphs. We explain this by the example of Figure 2. In the figure, G2 is a unique maximal complete bipartite graph. Although G1 =(X, Y ):(A, B) is a complete bipartite-graph, it does not satisfy the maximal conditions, because G1 is included in G2 =(X, Y, Z) : (A, B, C). 3.2
Algorithm for Enumerating Complete N-Partitle Graphs
In the previous subsection, we formulated the clustering of synonyms as a problem of enumerating maximal complete N-partite graphs, for which an efficient algorithm already exists[3]. It is a variation of a reverse-search algorithm[2], and the computational complexity of this algorithm is O(∆3 ) where ∆ is the vertex degree of the graph. 2
Sets of edges are always unique in a complete bipartite graph. We therefore express the graph with the vertex sets (X, Y ) and (A, B) as (X, Y ) : (A, B).
406
3.3
H. Nakawatase and A. Aizawa
Combining the Extracted N-Partite Graphs for Approximation
Because the size of the target text corpus is limited, there is a possibility that words that should be clustered together are separated into different groups because of inadequate numbers of examples. Figure 2 shows this case. Neither X, Y nor Z would cluster as maximal graph G2 =(X,Y,Z):(A,B,C) if the example that corresponded to edge XC and ZA was not present. Then, the cluster of (X,Y):(A,B),(Y,Z):(B,C) becomes a maximal graph (If G2 exists, these are not maximal). To compensate for possible deficiencies, we apply the following combining technique to the extracted N-partite graphs. We consider a cluster set including word A to be divided into sets such that each v1 is coprime (A is disregarded) to obtain the approximation cluster concerning A in this way. For instance, if cluster g1 =(A,B,C):(R,S,T), g2 =(A,B,D):(U,V,W), and g3 =(A,M,N):(X,Y,Z), which all contain A, are obtained, we consider g1 and g2 , which both include B, to be a group with the same meaning. That is, g1 andg2 are merged (Sets V1 of each cluster obtained after such merging are coprime (Except A). ) . On the other hand, we consider g3 , which does not share the same elements as g1 andg2 to be a cluster with a different meaning. This calculation procedure is shown by pseudocode as follows(in case of a bipartite graph). This procedure gradually divides given clusters into coprime sets. Therefore, it is necessary to repeat this procedure until the calculation result converges. [ calculation procedure for dividing clusters into coprime sets ] A: the word to be checked. Because this is included in all clusters, it is disregarded in the calculation. gi = {vgi 1 , vgi 2 }: Enumerated maximal complete bipartite graphs (clusters). G={Gj }: Merged cluster set. Procedure Coprime Set({gi }) var G := {G1 }; G1 := {g1 }; begin for(gi ∈ {g2 , .., gn }) begin for(Gj ∈ {G1 , .., Gm }=G) begin for(h1 ∈ {h1 } = Gj ) begin if(vgi 1 andvh1 1 are not coprime) begin vh1 1 := vh1 1 ∪ vg1 1 ; vh1 2 := vh1 2 ∪ vg1 2 ; goto END; end end end Add a new element Gm+1 to G, and add element gi to Gm+1 . END: end. end.
(Pseudocode of procedure)
Discovering Homographs Using N-Partite Graph Clustering
4
407
Experiment and Results
4.1
Initiating a Bipartite Graph
We experimented to evaluate the effect of this method. We explain this as follows. A bipartite graph was created using the words and concatenation relations of compound nouns of two words included in newspaper articles3 . These compound words were used because they can be automatically acquired with high accuracy by morphological analysis4 of the corpus. Because relations almost always exist between the nouns in two-word compound nouns, edges can be acquired in this way. Each word is a vertex and each concatenation is an edge in this mapping from two-word compound nouns to the graph. For instance, the two-word compound noun AB is analyzed to A and B. Therefore, nouns A and B are vertexes, and there is an edge between A and B.Next, we extracted a maximal complete bipartite graph from the given graph by the calculation method based on the reverse-search algorithm mentioned above. We finally merged the resulting graph using the method described in Section 3.3. 4.2
Cluster Enumeration and Merging (Candidate Detection)
We have extracted maximal complete bipartite graphs by using a program based on the above algorithm. This calculation was executed on a PC equipped with an Intel Pentium 4 processor (700 MHz). The calculation required about one hour. The number of initial clusters obtained in this step was 2,700,737. After merging, the number of clusters decreased to 18,261. The number of nouns in the final set was 18,276. Next, when clusters with only one element were excluded, the number of clusters was 698. The number of nouns included in those clusters was 351. These nouns are sure to be included in two or more clusters. Finally, these 351 words are candidate homographs. Table 1. Maximal complete bipartite graph extraction and merging Maximal complete bipartite graph Merged clusters Clusters with two or more words
3
4
Number of Clusters Number of Nouns in 2,700,737 18,276 18,261 18,276 698 351
All articles on the Mainichi Newspapers in 1994. There were about 6.69 million nouns (single and compound) in total in the corpus, containing about 512,000 unique words. Only compound nouns of two words were used in this experiment; there were about 1.18 million of these, containing about 299,000 unique words.’ The Japanese morphological analysis system (Chasen http://chasen.aist-nara.ac.jp/) was used.
408
H. Nakawatase and A. Aizawa Table 2. Experiment Result A:Automatic extraction candidate(raw) B:A – Morphological Analysis Error C:B – Variation in notation (candidate) D:Homograph
4.3
The number of words Percentage(%) 351 – 194 B/A = 55.3 182 C/A = 51.9 31 D/C = 17.0
Analysis of the Results
We analyze the results obtained by merging clusters here. We were able to generate candidate homographs fully automatically in this experiment. However, incomplete character strings were included in the candidate words, because of errors in the morphological analysis. Moreover, agreement words with different notations might be treated as multiple vertexes. Naturally, these errors negatively affect the acquisition of homographs. However, we counted the morphological analysis errors because we wanted to evaluate the effect of extraction from an ideal initial graph. Morphological Analysis Error: We manually checked 351 candidates of the homograph. As a result, we found word breaking errors caused by the morphological analysis. The number of homograph candidates was reduced to 194 words by deleting the clusters including these. An example of an inadequate cluster is shown as follows.
Variation in notation: Variations in notation are different wordings that mean the same thing. There are variations of the declensional kana ending mark (Examples 4.3.1b) and the character type (Examples 4.3.1c).
Such errors should be eliminated by applying more dedicated normalization techniques at the pre-processing stage. The number of candidates decreases to 182 words by deleting such words. Evaluation of the candidate homographs: Next, we checked whether the multiple clusters corresponding to these 182 candidate words indicated uses in different senses. We confirmed 31 of these words were homographs. The following examples are two clusters including homograph:kinsei (it has the two meanings “Venus” and “victory”). That is, Example 4.3.2(a) is a cluster including planets and Example 4.3.2(b) is a cluster for victory. The other 151 clusters that are not considered to be homographs mainly refer to clusters(Example 4.3.3) with minor semantic differences such as “silicon” as a substance or a chemical, and “silicon” as material. We have not considered these minor variations as homo-
Discovering Homographs Using N-Partite Graph Clustering
409
graphs, and the issues in measuring and discriminating the semantic similarity of the different contexts should be further investigated.
Table 2 shows our experimental results.
5
Discussion and Future Work
The experiment confirmed that this method behaved as expected. That is, examples of words with several meanings were detected in multiple clusters according to the differences in meaning. We also confirmed that the enumeration of maximal complete bipartite graphs could terminate in a short time, even for the graph data of about 70,000 vertexes and about 300,000 edges used in this experiment. This method of acquiring an unregistered meaning of a homograph was not known previously, although some discovery methods for words not registered in a dictionary have been proposed. On the other hand, our method automatically discovers the homographs (candidates) that were previously discovered manually. At the same time, this method can present examples of the same meaning as a cluster. We describe possible future improvements and enhancements of this method. Only the relations acquired from two-word compound nouns were used in this experiment. Therefore, all edges in this graph are treated equally. Our method can be enhanced with respect to this point. Recently, the technology of surface case analysis between given words and phrases and the analysis of the modifications has been studied (i.e., a Japanese parsing system KNP:http://www.kc.t.utokyo.ac.jp/nl-resource/knp-e.html). The case relationship can be acquired from longer compound words if this project is successful. We may then be able to extract homographs from the graph for each case relation.
References 1. Hattori, S.: Iwanami course philosophy XI language.(in Japanese) Iwanami Shoten, Tokyo (1968) 2. Avis, D., Fukuda, K.: Reverse Search for Enumeration. Discrete Appl. Math.65. (1996) 21–45 3. Uno, T.: A Practical Fast Algorithm for Finding Clusters of Huge Networks.(in Japanese) IPSJ SIGNotes Algorithms No. 088-001. Information Processing Society of Japan, Tokyo (2002)
Discovery of Trends and States in Irregular Medical Temporal Data Trong Dung Nguyen, Saori Kawasaki, and Tu Bao Ho Japan Advanced Institute of Science and Technology Tatsunokuchi, Ishikawa 923-1292 Japan
Abstract. Temporal abstraction has been known as a powerful approach of data abstraction by converting temporal data into interval with abstracted values including trends and states. Most temporal abstraction methods, however, has been developed for regular temporal data, and they cannot be used when temporal data are collected irregularly. In this paper we introduced a temporal abstraction approach to irregular temporal data inspired from a real-life application of a large database in hepatitis domain.
1
Introduction
The hepatitis temporal database collected between 1982-2001 at the Chiba university hospital is a large un-cleansed temporal relational database consisting of six tables, of which the biggest has 1.6 million records. Collected during a long period with progress in test equipment, the database also contains inconsistent measurements, many missing values, and a large number of non-unified notations [7]. The hepatitis database was given as discovery challenge in 2002, 2003 of PKDDs(http://www.cs.helsinki.fi/events/eclpkdd/challenge.html). Among problems posed by the doctors we are interested in the following: 1. Discover the differences in temporal patterns between hepatitis B and C. 2. Evaluate whether laboratory tests can be used to estimate the stage of liver fibrosis. 3. Evaluate whether interferon therapy is effective or not. One of the main approaches to mining medical temporal data is temporal abstraction (TA). The key idea temporal abstraction is to transform time-stamp points by abstraction into an interval-based representation of data. The common tasks in temporal abstraction are detecting trends and states of some variables (medical tests) from temporal sequences. The TA task can be defined as follows: The input is a set of time-stamped data points (events) and abstraction goals; the output is a set of interval-based, context-specific unified values or patterns (usually qualitative) at a higher level of abstraction. Typical works on temporal abstraction are those presented in [1], [4], [2], [8]. The common points of the above works are their basic temporal G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 410–417, 2003. c Springer-Verlag Berlin Heidelberg 2003
Discovery of Trends and States in Irregular Medical Temporal Data
411
abstraction methods Were developed for in short periods and/or regular timestamp points. The work in [5], [1], [4] related to temporal data of an individual measured on consecutive days in a short period; on normal and dependent diabetes related to temporal data measured on consecutive days within two weeks; and temporal data regularly measured every minute. Generally, doing detection of trends and characterization of states for such sequences is different (easier) from doing these tasks for irregular time-stamp sequences. The problem we face with hepatitis data is to find trends and states of tests in long and irregular time-stamp sequences. Different from separately finding “states” and “trends” as done in related work, we introduce the notion of “changes of state” to simultaneously characterize trends and states in long-term changed tests and the notions of “base state” and “peaks” to characterize shortterm changed tests, as well as algorithms to detect them.
2
Basic Temporal Abstraction
Each patient is described by 983 temporal sequences corresponding to the 983 hospital tests. As the complexity of learning generally increases with the number of tests under investigation, a small number of selected tests is expected. After selecting 41 tests from 983 tests by statistical frequency check and medical background knowledge. 1. The most frequent tests: GPT, GOT, LDH, ALP, TP, T-BIL, ALB, D-BIL, I-BIL, UA, UN, CRE, LAP, G-GTP, CHE, ZTT, TTT, T-CHO, oudan, nyuubi, youketsu. 2. The high frequent tests: NA, CL, K 3. The frequent tests: F-ALB, F-A2.GL, G.GL, F-A/G, F-B.GL, F-A1.G 4. The low frequent but significant tests: F-CHO, U-PH, U-GLU, U-RBC, UPRO, U-BIL, U-SG, U-KET, TG, U-UBG, AMY, and CRP. We firstly focus on the 15 most typical tests as suggested by medical experts. These tests can then be divided into two groups depending whether their values can change in a short term or long term. 1. Tests with values that can change in the short term: GOT, GPT, TTT, and ZTT. The tests in this group, in particular GOT and GPT, can rapidly change (within several days or weeks) their values to high or even very high values when liver cells are destroyed by inflammation. 2. Tests with values that can change in the long term: The tests in the second group can slowly change (within months or years). Liver has a reserve capacity so that some products of liver (T-CHO, CHE, ALB, and TP) do not have low values until the reserve capacity is exhaustive (the terminal state of chronic hepatitis, i.e., liver cirrhosis). Two main tendencies of change of tests in this group are: – Tests with a “going down” trend: T-CHO, CHE, ALB, TP, PLT, WBC, and HGB. – Test with “going up” trend: D-BIL, I-BIL, T-BIL, and ICG-15.
412
T.D. Nguyen, S. Kawasaki, and T.B. Ho Table 1. The temporal abstraction primitives
< pattern > ::= < state primitive > < pattern > ::= < state primitive >< relation > | < state primitive > < trend primitive > < pattern > ::= < state primitive >< relation >< peak > < pattern > ::= < state primitive >< relation >< state primitive >< relation > | < state primitive >< trend primitive >
Temporal abstraction primitives. Based on visual analysis of various sequences, we determined the following temporal abstraction primitives: 1. State primitives: N (normal), L (low), VL (very low), XL (extreme low), H (high), VH (very high), and XH (extreme high). 2. Trend primitives: S (stable), I (increasing), FI (fast increasing), D (decreasing), and FD (fast decreasing). 3. Peak primitives: P (peaks occurred). We also determined the following relations between the primitives: > (“change state to”), & (“and”), – (“and then”), / (“majority/minority”, X/Y means that the majority of points are in state X and the minority of points are in state Y). Medical doctors give thresholds for distinguishing the state primitives of tests, for example, those to distinguish values N, H, VH, XH of TP are 5.5, 6.5, 8.2, 9.2 where (5.5, 6.5) is the normal region. We define four structures of abstraction patterns as shown in Table 1. Examples of abstracted patterns in a given episode are as follows: “ALB = N” (ALB is in the normal region), “CHE = H–I” (CHE is in the high region and then increasing), “GPT = XH&P” (GPT is extremely high and with peaks), “IBIL = N>L>N” (I-BIL is in the normal region, then changed to the low region, and finally changed to the normal region). We developed and used the following procedure to identify typical abstraction patterns. Figure 1 shows 8 typical possible patterns for short-term changed tests (left) and 21 typical possible patterns for long-term changed tests (right). Several notations will be used to describe algorithms for detecting short-term and longterm changed tests. 1. Consider the patterns structures as formulas and the < state primitive >, < trend primitive > and < relation > as their variables. Create all possible candidate abstraction patterns by replacing the < state primitive >, < trend primitive > and < relation > with their possible values. 2. Randomly take a large number of sequences from the datasets, visualize and manually match them with the candidate abstraction patterns to see each of them matches which candidate abstraction pattern. 3. Eliminate the candidate abstraction patterns that have no or a small number of matched sequences.
Discovery of Trends and States in Irregular Medical Temporal Data
413
Fig. 1. Abstraction patterns of short-term and long-term changed tests
Abstraction of short-term changed tests. Our observation and analysis showed that the short-term changed tests, especially GPT and GOT, can go up in some very short period of time and then go back to some “stable” state. We found that the two most representative characteristics of these tests are a “stable” state, called base state (BS), and the position and value of peaks, where the tests suddenly go up. Based on this observation, we have developed an algorithm to find the base state and peaks of a short-term changed test, as shown in Figure 2. Abstraction of long-term changed tests. Our key idea is to use the “change of state” as the main feature to characterize sequences of the long-term changed tests. The “change of state” contains information of both state and trend, and can compactly characterize the sequence. At the beginning of a sequence, the first data points can be at one of the three states, “N”, “H”, or “L”. Either the sequence changes from one state to another state, smoothly or variably (at boundaries), or it remains in its state without changing. Because changes can generally happen in the long-term, it is possible to consider the trend of a sequence after changing of the state. We have developed an algorithm to find the base state and peaks of a short-term changed test, as shown in Figure 3.
3
Mining Abstracted Hepatitis Data
This step can be considered as complex temporal abstraction with the used of our visual data mining system D2MS [3] or the commercial data mining system
414
T.D. Nguyen, S. Kawasaki, and T.B. Ho
Clementine to find useful patterns/models from abstracted data obtained from basic TA. 3.1
Patterns Describing Hepatitis B and C
For the problem P1, different rule sets were found by using program LUPC in system D2MS with different parameters. From the rule set discovered by LUPC that describes hepatitis B and C under the constraints that each of them covers at least 20 cases and has accuracy higher than 80%, we have drawn a number of interesting conclusions. – The tests ALB, CHE, D-BIL, TP, and ZTT often occur in rules describing types B and C of hepatitis. The test GPT and GOT are not necessarily the key tests to distinguish types B and C of hepatitis (though they are important for solving other problems). – There are not many rules with large coverage for type B. – Rule 32 is simple and interesting as it confirms that among four typical short-term changed tests, TTT and ZTT have sensitivity to inflammation but they do not have enough specificity to liver inflammation. The rule says that “if ZTT is high but decreasing we can predict the type C with accuracy 83% ( 5.1)”. – Rule 29 “IF CHE = N and D-BIL = N THEN Class = C” is also typical for type C as it covers a large population of the class (173/272 or 63.6%) with accuracy 82.08% (3.42).
3.2
Patterns Describing the Fibrosis Stages
For the problem P2 we found a number of significant rules by D2MS. We can draw interesting patterns: – Rules describing the fibrosis stage F1 except the first one are typically related to the combinations of “GOT = H and GPT = XH and (T-CHO = N or TP = N)”, or “T-CHO = N and GOT = H and ZTT = H–I”. – Rules describing the fibrosis stage F3 can be distinguished from those of F1 by the combinations “TP = N/L and (D-BIL = N or CHE = N)”, or “GOT = N&P and CHE = N”. 3.3
Patterns Describing the Effectiveness of Interferon Therapy
For the problem P3, we found rules for two classes of “non-response” and “response” patients with interferon therapy. It can be observed that many rules for the “non-response” class containing GPT and/or GOT with values “XH&P”, “VH&P”, “XH”, or “H”, while many rules for the “response” class containing GPT or GOT with values “N&P” or “H&P”. The results allows us to hypothesize that the interferon treatment may have strong effectiveness on peaks (suddenly
Discovery of Trends and States in Irregular Medical Temporal Data
415
Notations used in temporal abstraction algorithms: High(S): # points of S in the high region; VeryHigh(S): # points of S in the very high region; ExtremeHigh(S): # points of S in the extreme high region; Low(S): # points of S in the low region; VeryLow(S): # points of S in the very low region; Normal(S): # points of S in the normal region; Total(S) = High(S) + VeryHigh(S) + ExtremeHigh(S) + Normal(S) + Low(S) + VeryLow(S); In(S) = Normal(S)/Total(S); Out(S) = (Total(S - In(S)) /Total(S); Cross(S): # times S crosses the upper and lower boundaries of the normal region; First (S): State of the first points in S; Last (S): State of the last points in S; State(S): State of S (one of the state primitives); Trend(S): Trend of S (one of trend primitives).
Input Sequence of values of a test S00 = {s1 , s2 , ..., sN } in a given episode. Result A base state and peaks, a set of peaks PEi, and an abstracted pattern. Parameters NU, HU, VHU, XHU: upper thresholds of normal, high, very high, extreme high regions of a test, α (real). A. Searching for base state 1. Based on NU, HU, VHU, and XHU, calculate the quantities Normal(S), High(S), VeryHigh(S), and ExtremeHigh(S) 2. Take MV = max {Normal(S), High(S), VeryHigh(S), Extreme-High(S)}. If MV/Total(S) > α then BS := MS. 3. Else BS := NULL B. Searching for peaks 1. For every element si of S, if si > si−1 and si > si+1 then si is a local maximum of S. 2. For every element Mi of the set of local maximum points, Pj = Mi will be a peak, if one of the following conditions is true, where V (x) is the value of x: (1) BS = N and V (Mi ) > V (V H) (2) BS = H and V (Mi ) > V (XH) (3) BS = V H and V (Mi ) > 2 ∗ V (XHU ) (4) BS = XH and V (Mi ) > 4 ∗ V (XHU ) C. Output the basic temporal abstraction pattern 1. If BS = N there is no peak, then N 2. If BS = N there is at least a peak, then N&P 3. If BS = H there is no peak, then H 4. If BS = H there is at least a peak, then H&P 5. If BS = VH there is no peak, then VH 6. If BS = VH there is at least a peak, then VH&P 7. If BS = XH there is no peak, then XH 8. If BS = XH there is at least a peak, then XH&P 9. If BS = NULL then Undetermined. Fig. 2. TA algorithm for short-term changed tests
416
T.D. Nguyen, S. Kawasaki, and T.B. Ho
Fig. 3. TA algorithm for long-term changed tests
Discovery of Trends and States in Irregular Medical Temporal Data
417
increasing in a short period) if the base state is normal or high. It can be hypothesized that when the base state is very high or extremely high, the interferon treatment is not clearly effective.
4
Conclusion
We have presented a temporal abstraction approach to mining the temporal hepatitis data. The temporal abstraction approach in our work differs from related temporal abstraction works in two points: the irregular data-stamped points and long periods. Different from these applications, the irregularity in measuring the hepatitis data requires a statistical analysis basing on and combining with the expert’s opinion, in particular in the determination of episodes. The temporal abstraction approach presented in this paper is carried out in the scope of an on going project in collaboration with medical doctors. The issues to be investigated in the next step include refinement of abstracted patterns (for example, positions of peaks or parameters for abstraction), the post-processing and interpretation of obtained complex temporal abstractions.
References 1. Bellazzi, R., Larizza, C., Magni, P., Monntani, S., and Stefanelli, M. (2000). Intelligent analysis of clinic time series: An application in the diabetes mellitus domain, Artificial Intelligence in Medicine, 20, 37–57. 2. Haimowitz, I.J. and Kohane, I.S. (1996). Managing temporal worlds for medical trend diagnosis. Artificial Intelligence in Medicine 8(3), 299–321. 3. Ho, T.B., Nguyen, T.D., Nguyen, D.D., and Kawasaki, S. (2001). Visualization Support for User-Centered Model Selection in Knowledge Discovery and Data Mining, International Journal of Artificial Intelligence Tools, Vol. 10, No. 4, 691–713. 4. Horn, W., Miksch, S., Egghart, G., Popow, C., and Paky, F. (1997). Effective data validation of high-frequency data: time-point-, time-interval-, and trend-based methods, Computer in Biology and Medicine, Special Issue: Time-Oriented Systems in Medicine, 27:5, 389–409. 5. Larizza, C., Bellazzi, R., and Riva, A. (1997).“Temporal abstractions for diabetic patients management”, Artificial Intelligence in Medicine, Keravnou, E. et al. (eds.), Proc.AIME-97, 319–330. 6. Lavrac, N., Keravnou, E., and Zupan, B. (1997). Intelligent Data Analysis in Medicine and Pharmacology (Eds.), Kluwer. 7. Motoda, H. (2002). Active Mining: New directions of data mining (Ed.), IOS Press. 8. Shahar, Y. and Musen, M.A. (1997). Knowledge-based temporal abstraction in clinical domains, Artificial Intelligence in Medicine, 8, 267–298.
Creating Abstract Concepts for Classification by Finding Top-N Maximal Weighted Cliques Yoshiaki Okubo and Makoto Haraguchi Division of Electronics and Information Engineering Hokkaido University N-13 W-8, Sapporo 060-8628, JAPAN {yoshiaki, makoto }@db-ei.eng.hokudai.ac.jp
Abstract. This paper presents a method for creating abstract concepts for classification rule mining. We try to find abstract concepts that are useful for the classification in the sense that assuming such a concept can well discriminate a target class and supports data as much as possible. Our task of finding useful concepts is formalized as an optimization problem in which its constraint and objective function are given by entropy and probability of class distributions, respectively. Concepts to be found can be stated in terms of maximal weighted cliques in a graph constructed from the possible distributions. From the graph, as useful abstract concepts, top-N maximal weighted cliques are efficiently extracted with two pruning techniques: branch-and-bound and entropy-based pruning. It is shown that our entropy-based pruning can safely prune only useless cliques by adding distributions in increasing order of their entropy in the process of clique expansion. Preliminary experimental results show that useful concepts can be created in our framework.
1
Introduction
In a practical situation, a Data Mining system extracts a huge number of rules, since data values in a given database are often too detailed. Users cannot easily understand and analyze them. A notion of data abstraction is useful for reducing the number of rules and making their analysis easier [1,4,5,9]. This paper is concerned with data abstractions for condition attributes in classification rules [2]. A data abstraction is an act of replacing original data values into more abstract ones. For an attribute A, an abstraction of A corresponds to a partition of its domain dom(A), where each cell, referred to as a cluster, can work as an abstract value for the original ones in the cluster. In general, abstracting A into A causes loss of information about a targe attribute C. Regarding attributes as random variables, this observation is stated in terms of conditional entropy of C given a condition attribute, that is, H(C|A ) ≥ H(C|A). In other words, discrimination ability for C by assuming A becomes worse after abstraction. It is, therefore, reasonable to prefer an abstraction that can minimize H(C|A ). The literatures [4,5,9,10] have actually adopted this kind of criteria for obtaining the best abstraction. A task of finding G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 418–425, 2003. c Springer-Verlag Berlin Heidelberg 2003
Creating Abstract Concepts for Classification
419
Y Pr(p1)=0.32 p1 Pr(p2)=0.04
p2
p3 Pr(p3)=0.32 Pr(p4)=0.32 p4 X+Y=1 X
Fig. 1. Clusters minimizing conditional entropy
the best abstraction can be formalized as an optimization problem in which its objective function is given by conditional entropy and is tried to minimize. Although such a criterion seems reasonable, we observe some undesirable outcome. According to it, each cluster tends to consist of only values which give similar class distributions. A class distribution, Da , is a vector of conditional probability of each class given a value a in dom(A), where a value in dom(C) is referred to as a class. Discrimination ability for C by assuming a is evaluated by entropy of the distribution, H(Da ). The lower H(Da ) is, the richer the ability is. As shown in Figure 1, according to the criterion, an abstraction consisting of three clusters {p1 }, {p2 , p3 } and {p4 } is constructed as the optimal one. Although p3 and p4 seems similar, {p2 , p3 } is preferred from the viewpoint of information loss, since p3 is more similar to p2 . However, it would be better to combine them together to form a larger (with a higher support) cluster, if it can still provide a sufficient classification ability. Although the information loss is not minimum in such a case, the cluster {p2 , p3 , p4 } will work as a useful abstraction, where p3 and p4 can be considered primary and p2 exceptional according to their probabilities. Actually, the literature [5] has reported that absorbing such exceptional values with relatively low probabilities is quite effective in reducing the number of rules to be extracted, allowing some loss of their discrimination abilities. This assertion suggests a new criterion for good abstractions That is, allowing a certain degree of information loss, we try to absorb as many exceptional values as possible in order to obtain larger clusters. We consider in this paper abstractions according to this new criterion. It can be formalized as an optimization problem [3] such that Constraint: H(DG ) ≤ δ and Objective Function: Pr(G) =
a∈G
Pr(A = a),
where G ⊆ dom(A) is a cluster, DG the class distribution given G and δ a threshold to adjust classification ability. Note here that entropy of distributions is given as a constraint, not as an objective function. Furthermore, we try to find a cluster G, not a grouping of dom(A). Since clusters giving high entropies in a grouping cannot provide confident rules, we are not interested in them. Finding only clusters giving low entropy will be sufficient. An optimal cluster can be obtained by finding a maximum weighted clique satisfying the entropy constraint in a weighted undirected graph G [3]. Each node in G is a class distribution, its weight is assigned as its probability and two nodes are connected if they are close each other. The closeness is evaluated by a
420
Y. Okubo and M. Haraguchi
relationship among distributions which is defined based on a property of convex functions including the entropy function. As mentioned above, we are interested in clusters providing rich classification abilities. The optimal cluster might be just one of them. Therefore, finding top-N optimal clusters is more desirable. We present an algorithm for this task. MWCC [8], an efficient algorithm for finding a maximum weighted clique, is used as the basis of our algorithm. It tries to find top-N clusters in a depth-first manner with two effective pruning techniques, branch-bounded and entropy-based pruning. The former is also adopted in MWCC and is concerned with weight of nodes. The latter, newly introduced, is concerned entropy of class distributions. It should be noted here that, in general, entropy of class distribution changes non-monotonically, as a cluster grows. Therefore, careless pruning will discard optimal cliques to be found. We show that expanding cliques by adding distributions in increasing order of their entropy never suffers from such undesirable over-pruning. That is, our entropy-based pruning can safely reject useless cliques. Our preliminary experimentation shows that extracted clusters corresponds to useful and meaningful abstract concepts.
2
Preliminaries
We assume any attribute is categorical. For a relational schema (A1 , · · · , Am ), a relational database R is defined as R ⊆ dom(A1 ) × · · · × dom(Am ). For each tuple t = (a1 , . . . , am ) ∈ R, its i-th component ai is referred to as t[Ai ]. For a database R, we select an attribute C from its relational schema as a target attribute. We say that the class of a tuple t ∈ R is c if t[C] = c. In this paper, we are concerned with classification rules for the target attribute that can discriminate each class by assuming values of certain attributes Ai (= C) referred to as a condition attribute. That is, they are of the form like (A2 = a2i ) ∧ (A5 = a5j ) → (C = c). For simplicity, we consider rules with just one condition attribute in the following discussion 1 . The probability of a tuple t in R, Pr(t), is given by a uniform distribution, Pr(t) = 1/|R|. Then, each Ai in the relational schema is regarded as a random variable: Pr(Ai = a) = Pr({ t | t ∈ R ∧ t[Ai ] = a }) = | {t | t ∈ R ∧ t[Ai ] = a} | / | R |. For a target attribute C with dom(C) = { c1 , . . . , cn } and a condition attribute A, a distribution defined as DaC = ( Pr(C = c1 | A = a), . . . , Pr(C = cn | A = a) ) is called a class distribution given a 2 . Similarly, for a cluster of attribute values, G ⊆ dom(A), a class distribution given G is defined as C DG = ( Pr(C = c1 | A ∈ G), . . . , Pr(C = cn | A ∈ G) ).
1 2
In case of several condition attributes, we can give similar discussion. A class distribution given several conditions can be considered as ( Pr(C = c1 | A = a ∩ B = b), . . . , Pr(C = cn | A = a ∩ B = b) ), for example.
Creating Abstract Concepts for Classification
3
421
Creating Abstract Concepts
We present here a method for creating abstract concepts. They are obtained by finding clusters of condition values that can well discriminate certain classes. 3.1
Entropy of Class Distribution
Let C be a target attribute with n classes c1 , . . . , cn and A be a condition attribute. Assuming a value a in dom(A) as a condition, we can consider n classification rules for C, (A = a) → (C = cj ), where 1 ≤ j ≤ n. Discrimination ability for cj by assuming a can be measured by the class distribution given a, DaC . If the distribution is strongly biased, assuming a is very effective in discriminating a certain class. Conversely, if it is flat, a cannot work well as a condition for our classification. These observations can be stated in terms of entropy of distributions. The entropy of DaC , H(DaC ), is defined as H(DaC ) = −
n
Pr(C = cj |A = a) log2 Pr(C = cj |A = a)
j=1
If the entropy of DaC is quite low, the distribution is strongly biased. In order to obtain useful classification rules, therefore, we take only conditions which can give low entropy into consideration. C Similarly, for a cluster G ⊆ dom(A), a low value of H(DG ) shows that the C distribution DG is biased and assuming G as a condition is effective for the C classification, where the entropy H(DG ) is defined as C )= H(DG
Pr(a) H(DaC ). Pr(G)
a∈G
A cluster giving a low entropy has high ability to discriminate a class in C. 3.2
Optimal Cluster
Since our task is to find clusters of attribute values that can give class distributions with low entropy, any cluster to be found must satisfy a constraint w.r.t. entropy values. However, although such a cluster can provide a confident rule, it might not support enough data. From this point of view, we formalize our task of finding useful clusters as the following optimization problem [3]: C Constraint: H(DG )≤δ Objective Function (Maximize): Pr(G) = a∈G Pr(A = a), where δ is a threshold to adjust admissible confidence of rules 3 . 3
The literature [5] has reported that a larger cluster is preferable as long as it can provide a rule with sufficient discrimination ability since such a cluster can reduce the number of confident rules to be extracted. The above formalization is supported by this statement.
422
Y. Okubo and M. Haraguchi
A solution to the problem is called an optimal cluster. Finding just one optimal cluster might be considered too restrictive in the sense that there might exist any other cluster that satisfy the entropy constraint and support data sufficiently. Since such a cluster can still provide a useful rule, we try to obtain top-N optimal clusters, where the parameter N is given by users. 3.3
Closeness between Distributions
In general, as a cluster grows, its entropy value changes non-monotonically. This means that even if a cluster does not satisfy the entropy constraint, we cannot prune its expansion safely. However, in order to efficiently find optimal clusters, it would be desired that such entropy-based pruning is available. In order to realize it, we introduce a closeness relationship among distributions which is defined based on a property of convex functions including the entropy function. Let p1 = DaC1 and p2 = DaC2 be two distributions such that p1 = DaC1 = {p11 , . . . , p1n } and p2 = DaC2 = {p21 , . . . , p2n }, where H(pi ) ≤ δ (i = 1, 2).
Each distribution is represented as a point in an (n − 1)-dimensional space. Consider the (hyper-)surface consisting of the points with the entropy value δ: f (x1 , . . . , xn ) = −
n
xi log2 xi − δ = 0 (0 ≤ xi ≤ 1 and
i=1
n
xi = 1).
i=1
Let us consider a perpendicular dropped from pi to the surface f (x1 , . . . , xn ) and ∗i assume its foot is p∗i = (x∗i 1 , . . . , xn ). Moreover, consider the tangent (hyper)plane gi (x1 , . . . , xn ) to f (x1 , . . . , xn ) at p∗i : gi (x1 , . . . , xn ) =
n
∗i ∗i fxj (x∗i 1 , . . . , xn )(xj − xj ) = 0,
j=0
where fxj denotes the partial derivative of f with respect to xj . The closeness between two distribution p1 and p2 is defined as follows: Definition 1. p1 and p2 are said to be close if g1 (p11 , . . . , p1n ) × g1 (p21 , . . . , p2n ) ≥ 0
or
g2 (p11 , . . . , p1n ) × g2 (p21 , . . . , p2n ) ≥ 0.
In a word, if p1 and p2 are on the same side of either tangent plane, they are considered close. Note that we do not require any explicit parameter on distance among distributions. Proposition 1. Let G = {a1 , . . . , am } be a cluster such that H(DaCi ) ≤ δ (1 ≤ C i ≤ m) and for any i and j (i = j), DaCi and DaCj are close. Then H(DG )≤δ holds. The proposition ensure that as long as we combine distributions together which satisfy the entropy constraint and are close each other, the resultant cluster always satisfies the constraint.
Creating Abstract Concepts for Classification
3.4
423
Finding Clusters with Maximal Weighted Clique Search
Top-N optimal clusters can be obtained by finding top-N maximal weighted cliques satisfying the entropy constraint in a weighted undirected graph. Each node in the graph is a class distribution given a value in dom(A) and is assigned a weight as the probability of the value. For any pair of nodes satisfying the entropy constraint, if they are not close, we never make an edge between them. All of the other possible edges are made. Note that any distribution not satisfying the entropy constraint is connected to any distribution. By these edges, satisfying the entropy constraint, a cluster (clique) can include several distributions not satisfying the constraint, if the cluster still satisfies the constraint. As mentioned before, entropy-based pruning is highly desired to be available for our efficient search. We can enjoy it by searching cliques in a certain ordering. Theorem 1. Assume we expand any clique by adding a node in ascending order of entropy. If a clique does not satisfy the entropy constraint, its extensions never satisfy the constraint. That is, entropy-based pruning is safely available. Our algorithm for finding Top-N cliques is based on MWCC [8]. It tries to find cliques with depth-first search with two effective pruning techniques, branchand-bounded and entropy-based pruning. The former is concerned with weight of nodes in the graph and is also adopted in MWCC. The latter is newly introduced. Our algorithm is summarized as follows: 1. The nodes are sorted in ascending order of entropy. 2. The root node in the search tree is the empty set φ. 3. A node Q in the search tree is an ordered set of clique nodes that is associated with a set of candidate nodes RQ for further expansion. 4. For a node Q in the search tree, Q is expanded to Q ∪ {q} associated with RQ ∩ adj(q), where q ∈ RQ is preceded by any node in Q and adj(q) is the set of nodes connected to q directly. 5. Branch-and-Bound Pruning: Assume we already have maximal N -cliques temporary. and Wmin is the minimum weight among them. If w(Q) + w(RQ ) < Wmin , any expansion of Q cannot be included in top-N cliques, where w(S) is the total weight of nodes in S. Stop expanding Q and go to the nearest backtrack point. 6. Entropy-Based Pruning: Q must satisfy H(CQ ) ≤ δ at each step. If for some q in RQ , H(CQ∪{q} ) > δ holds, then Q is not expanded to Q∪{q}. If the condition holds for any q, any expansion of Q cannot satisfy the entropy constraint. Therefore, since Q might be a top-N clique, keep it temporarily and go to the near backtrack point.
4
Preliminary Experimental Results
Our system has been implemented in C on a UNIX-PC (Pentium III-1.2MHz, 512MB memory). We try to create abstract concepts (clusters) for a database, Census-Income Database, in The UCI KDD Archive [6]. The database consists of 199523 tuples with 42 attributes. [Target Attribute: WorkClass]. For two condition attributes Education and Sex, we consider a target attribute WorkClass with 9-classes. That is, ob-
424
Y. Okubo and M. Haraguchi
tained rules are of the form (Education = v) ∧ (Sex = w) → (WorkClass = c). For the class NotInUniverse, the following clusters has been obtained: CEducation = {Children, LessThan1stGrade, 1st − 4thGrade, 5th − 6thGrade, ∗7th − 8thGrade, 9thGrade, 10thGrade, ∗11thGrade, 12thGrade} and CSex = {Male, ∗Female}. ,
where a value with “*” indicates that it has been exceptionally included in the cluster. The cluster CEducation can be interpreted as a concept of “PrimarySecondary Education”. On the other hand, CSex is identical with the original attribute Sex. This means that assuming Sex together with Education as a condition is useless for discriminating NotInUniverse. Thus, obtained clusters can provide us not only a meaningful concept, but also such irrelevance information. [Target Attribute: Marital State.] For two condition attributes Education, Citizenship and a target attribute MaritalState with 7-classes, we have tried to find cliques. For the class MarriedCivilianSpousePresent, we can consider the following clusters: CEducation = {DoctorateDegree, MastersDegree, BachelorsDegree, ProfSchoolDegree, ∗AssociateDegreeAcademicProgram, AssociatesDegreeOccupational} and CCitizenship = {ForeignBornUSCitizenByNaturalization, ForeignBornNotACitizenofUS},
The former can be interpreted as a concept of “Higher Education” and the latter “Foreign Born”. It should be pointed out that the latter cluster cannot be obtained by assuming Citizenship solely. It can be newly found by assuming Citizenship together with Education. Thus, it is expected that another interesting concepts would be obtained by assuming more attributes as conditions.
5
Discussion
Our method is closely related to Conceptual Clustering that is one of important tasks in Machine Learning [11]. Conceptual Clustering is to find clusters of given examples and to generate rules defining (explaining) each cluster. In order to obtain useful definitions (rules), we have to provide an adequate description language. However, it is not easy to provide such language beforehand. The task of conceptual clustering without it corresponds to conventional clustering [7]. Finding our cliques can be actually viewed as clustering of class distributions. In traditional clustering, e.g. K-means and Nearest-Neighbor clustering, a set of data is divided into several clusters (subsets) under a similarity measure. Although some interesting clusters might be found in them, we can never obtain clusters overlapping each other. On the other hand, in general, such overlapping clusters can be found in our Top-N cliques as well as non-overlapping ones. Therefore, our method has a richer ability to find interesting and meaningful concepts compared to traditional clustering frameworks. This is a remarkable advantage of our method. Our experimentation in this paper is still preliminary. In order to show effectiveness of our method more convincingly, we have to make further experimenta-
Creating Abstract Concepts for Classification
425
tion. As shown in our experimental results, interesting concepts might be found by assuming two or more condition attributes. As more condition attributes are assumed, we have a larger weighted graph from which Top-N cliques are found. MWCC the basis of our algorithm has the capability of efficiently dealing with a graph with about 1000 nodes [8]. Additionally, our algorithm can enjoy EntropyBase Pruning in clique search. We therefore expect that it can practically work for more condition attributes and find interesting concepts. Its actual potential and limitation will be verified in further experimentation.
6
Concluding Remarks
We presented in this paper a method of creating abstract concepts for classification rule mining. The task was formalized as an optimization problem and its solutions were obtained by finding top-N maximal weighted cliques. Our preliminary experimental results showed usefulness of our method. Further experimentation would be desired to make this claim more certain. Our method is for creating useful concepts without any source such as a dictionary. Although a dictionary can work as a helpful source for commonlyused concepts, there might in general exist many concepts that are interesting for certain users or purposes but cannot be found there. Revising and customizing such a poor dictionary are surely important tasks in many application fields in which dictionaries are essential. The authors expect that our method would become a basis of such an important framework.
References 1. J. Han and Y. Fu: Attribute-Oriented Induction in Data Mining, Advances in Knowledge Discovery and Data Mining, pp. 399–421, MIT Press, 1996. 2. J. R. Quinlan: C4.5 – Programs for Machine Learning, Morgan Kaufmann, 1993. 3. M. Haraguchi: Concept Learning Based on Optimal Clique Searches, SIG-FAIA202-11, pp.63–66, 2002. 4. M. Haraguchi and Y. Kudoh: Some Criterions for Selecting the Best Data Abstractions, Progress in Discovery Science, LNCS 2281, pp.156–167, 2001. 5. Y. Kudoh and M. Haraguchi and Y. Okubo: Data Abstractions for Decision Tree Induction, Theoretical Computer Science, Vol. 292, pp. 387–416, 2003. 6. S. Hettich and S. D. Bay: The UCI KDD Archive, http://kdd.ics.uci.edu, Dept. of Information and Computer Science, Univ. of California, 1999. 7. A. K. Jain and R. C. Dubes: Algorithms for Clustering Data, Prentice Hall, 1988. 8. Y. Wakai, E. Tomita and M. Wakatsuki: An Efficient Algorithm for Finding a Maximum Weight Clique, Proc. of the 12th Annual Conf. of JSAI, pp. 250–252, 1998 (in Japanese). 9. K. Takabatake: Clustering of Distributions in Summarizing Information Proc. of IBIS’01, pp.309–314, 2001 (in Japanese). 10. Y. Morimoto: Algorithm for mining association rules for binary segmentations of huge categorical databases, Proc. of VLDB’98, pp.380–391, 1998. 11. R. S. Michalski and K. A. Kaufman: Data Mining and Knowledge Discovery: A Review of Issues and a Multistrategy Approach, Machine Learning and Data Mining: Methods and Applications, pp.71–112, Wiley, 1997.
Content-Based Scene Change Detection of Video Sequence Using Hierarchical Hidden Markov Model 1
2
2
3
Jong-Hyun Park , Soon-Young Park , Seong-Jun Kang , and Wan-Hyun Cho 1
Department of Computer Science, Chonbuk National University, S. Korea [email protected] 2 School of Information Engineering, Mokpo National University, S. Korea {sypark, sjkang}@mokpo.ac.kr 3 Department of Statistics, Chonnam National University, S. Korea [email protected]
Abstract. This paper presents a histogram and moment-based video scene change detection technique using hierarchical Hidden Markov Models(HMMs). The proposed method extracts two types of features from wavelet-transformed images. One is the histogram difference extracted from a low-frequency subband and the other is the normalized directional moment of double wavelet differences computed from high frequency subbands. The video segmentation process consists of two steps. A histogram-based HMM is first used to segment the input video sequence into three categories: shot, cut, and gradual scene changes. In the second stage, a moment-based HMM is used to further segment the gradual changes into fades, dissolves and wipes. The experimental results show that the proposed technique is more effective in partitioning video frames than the threshold-based method.
1 Introduction Video segmentation into the elementary scenes is the essential process of video indexing, editing and retrieval. Scene change detection is the process of partitioning a video sequence into shots based on the content of the video. Generally, two types of scene changes are considered. One of them is an abrupt transition called a cut while the other is gradual transitions such as fade, dissolve and wipe. Recently, many different video segmentation techniques have been proposed. Most of them use a content-based distance between two consecutive image frames to find a segment boundary. If the frame distance exceeds the threshold value, then a segment boundary is declared. The existing techniques for computing the frame distance for consecutive image frames are pixel-difference method[1], statistical method[2], histogram-based method[3], DCT-based method[4] and motion-based method[5]. Hierarchical or integrated structures have also been introduced to detect the abrupt shot transition at the first step and the gradual transitions at the next step[6][7][8]. The different approaches which do not need threshold values have been developed by using a statistical model[9]. Boreczky et al.[10] have used a Hidden Markov G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 426–433, 2003. © Springer-Verlag Berlin Heidelberg 2003
Content-Based Scene Change Detection of Video Sequence
427
Model(HMM) with multiple features such as audio and image to improve the accuracy of video segmentation. In this approach, states of the HMM consist of various segments of a video. In this paper, we propose a histogram and moment-based video scene change detection using hierarchical HMMs. Histogram-based features are extracted from a lowfrequency subband and moment-based features are extracted from the modified double chromatic difference(DCD) [6] of wavelet coefficients in the high-frequency subbands of wavelet-transformed images. Video segmentations are performed through two-stage detection processes of HMMs hierarchically.
2 Multi-resolution Analysis and Feature Extraction We consider two types of features for use in video segmentation. We first decompose each image frame into a pyramid structure of subimages with various resolutions to use multiresolution capability of discrete wavelet transform[11]. Then histogrambased features are extracted from a low-frequency subband and moment-based features are extracted from high-frequency subbands of wavelet transformed frames. These features are used for the observation values of HMM. 2.1 Multi-resolution Analysis Using Wavelet Transform The two-dimensional wavelet transform of a frame f ( x, y ) is defined as :
W f ( a , bx , by ) = ∫ ∫ f ( x, y )ψ a ,bx ,b y ( x, y )dxdy
(1)
y x
Here, a is a scaling factor, b is time transition and
a ,bx ,by
is the wavelet basis func-
tion that is obtained by translating and dilating a single mother wavelet ψ :
1 x − bx y − by ψ , | a | a a
(2) After applying wavelet transform to each frame of a video, we can decompose the image frame into frequency-localized subbands. A low frequency subband preserves the video information in the form of smoothed and compressed pattern and high frequency subbands show the directional edge information depending upon each subband.
ψ a ,bx ,by ( x , y ) =
2.2 Histogram Feature Extraction The histogram feature measures the histogram-based distance between adjacent frames of a video sequence. Histogram H i [⋅] of an i th video frame is computed from a low frequency subband as H i [k ] = nk , 0 ≤ k ≤ N −1 (3)
428
J.-H. Park et al.
Here, k represents histogram bin with numbers of N , and nk is the number of pixels of k th bin. The histogram feature HDi in the frame i is the absolute bin-wise difference of histograms of adjacent frames : HDi = ∑ |H i[k ] − H i−1[k ] |, 0 ≤ k ≤ N − 1 (4) k
2.3
Moment-Based Feature Extraction
Since wavelet coefficients that lie in high frequency subbands of a wavelet transformed frame represent the edge information with the intrinsic direction, it can be efficiently used to segment the scene of a video that transforms gradually. The DCD, defined as the accumulation of pixelwised comparison between the average intensity of the starting and ending frames of the dissolve and the frame intensity inside the dissolve region, has been used satisfactorily to detect the dissolve boundary [6]. We modify the DCD approach to find the double wavelet difference(DWD) in the high frequency subbands as follows. Wi V ( j , x, y ) + Wi V ( j , x, y ) N (5) DWDiV ( j , x , y ) = T 0 − Wi V ( j , x, y ) 2 Where T () ⋅ is a threshold function, Wi V ( j , x , y ) is the wavelet coefficients of the vertical subband in the j th level of the i th frame, i0 and iN denote the starting and ending frames of a gradual transition. Similarly, DWDiH ( j , x , y ) and DWDiD ( j , x, y ) are computed from the horizontal and diagonal subbands, respectively. We expect that the above wavelet differences reflect the changing characteristics of fade, dissolve and wipe transitions. In order to quantify the distribution of the double wavelet differences, we compute the central moments for DWD s. For the coefficient w( x, y ) , the central moments of order p + q is defined as[12]
µ pq = ∑ ∑ ( x − x ) p ( y − y )q w( x, y ) x
(6)
y
where x = m10 / m00 , y = m01 / m00 and m pq = ∑∑ x p y q w( x, y ) . x
y
Using the equation (6), we can compute the 2nd and 3rd normalized directional moments from the vertical, horizontal and diagonal subbands as table 1. To make the mathematical notation simple, we denote the vertical variance η 20 and vertical sym-
metry η 30 extracted from DWDiV ( j , x, y ) as M iV ( j ,1) and M iV ( j ,2) , respectively. Similarly, horizontal variance η 02 , horizontal symmetry
η 03
extracted from
DWDiH ( j , x , y )are represented as M iH ( j ,1) and M iH ( j ,2) , respectively. We also
denote diagonal variance η11 , vertical-diagonal variance η 21 , and horizontal-diagonal
variance η12 extracted from DWDiD ( j , x, y ) as M iD ( j ,1) , M iD ( j ,2) and M iD ( j ,3) , respectively.
Content-Based Scene Change Detection of Video Sequence
429
Table 1. Normalized directional central moments Normalized Central Moments Vertical Directional Moments
Horizontal Directional Moments
Diagonal Directional Moments
Vertical Variance
η 20 =
Vertical Symmetry
η 30 =
Horizontal Variance
η 02 =
Horizontal Symmetry
η 03 =
Diagonal Variance
η11 =
VerticalDiagonal Variance HorizontalDiagonal Variance
η 21 =
η12 =
1 (m20 − x m10 ) µ 002
1
µ 005
(m30 −3 x m20 + 2m10 x 2 )
1 (m02 − ym01 ) µ 002
1
µ 005
(m03 −3 ym02 + 2 y 2 m01 )
1 (m11 − ym10 ) µ 002
1
µ 005 1
µ 005
(m21 − 2 x m11 − ym20 + 2 x 2 m01 )
(m12 − 2 ym11 − x m02 + 2 y 2 m10 )
Finally, we use the average of the directional moments as the moment-based feature as follows. 2 3 1 N 2 V MFi = (7) ∑ ∑ M i ( j , k ) + ∑ M iH ( j , k ) + ∑ M iD ( j , k ) k =1 k =1 7 × N j=1 k =1 where N is the number of wavelet decomposition levels.
3 Shot Boundary Detection by Using the Hierarchical HMMs Figure 1 illustrates the structure of the proposed hierarchical video segmentation method. Histograms are extracted from a low-frequency subband of wavelet transformed image frame and the wavelet differences are extracted from high frequency subbands. Figure 2 shows the moment-based features, MF s for fade, dissolve and wipe. As expected, the MF of a fade has a W shape which has two valleys at the probable middle frame during the first shot gradually disappearing (fade out), and the next middle one during the second shot gradually appearing (fade in), respectively. The MF s of dissolve and wipe transitions have the downward and upward parabolic shapes, respectively.
430
J.-H. Park et al. Low Frequency Sub-Band Trainning Video Sequence
Wavelet transform
Histogram Extraction
Histogram Difference
7UDLQLQJ0RGXOH
High Frequency Sub-Band
Wavelet Difference
Fade/Dissolve/Wipe Detection
HMM1 (λ h)
Moments Extraction
Step II Segmentation
Gradual Transition
HMM2 ( λ m )
Step I Segmentation
Cut Detection Detection Module Test Video Sequence
Fig. 1. Structure of a proposed shot boundary detection algorithm
The valley in a dissolve parabola corresponds to the frame with its wavelet coefficients equal to the average wavelet coefficients of starting and ending frames of the dissolve. However, the one part of the first shot is replaced by another part of the second shot during a wipe, the wipe parabola tends to have the convex shape with a peak around the middle frame during wipe transitions. Here, the starting and ending frames of the gradual transitions for the DWD computation come from the first HMM of hierarchical HMMs.
IDGHRXW VWDUWV
IDGHLQ HQGV
GLVVROYH VWDUWV
GLVVROYH HQGV
ZLSH VWDUWV
(a) fade
(b) dissolve
Fig. 2. MF s’ comparison for fade, dissolve and wipe.
ZLSH HQGV
(c) wipe
Content-Based Scene Change Detection of Video Sequence
431
Two HMMs are trained by using histogram differences and moment-based features. A histogram-based HMM detects cuts and gradual transitions, and a moment-based HMM detects fades, dissolves and wipes from the first segmented results. PF
3 7
3*7
Fade
1 − 3PT
PD
PT
37
SHOT
3 *7
1 − PF GRADUAL TRANSITION
PT
SHOT
Dissolve
1 − PD
37
PT
PW
1 − PW
CUT Wipe
(a) Histogram-based model
(b) Moment-based model
Fig. 3. Hierarchical HMMs for a video segmentation
Figure 3(a) shows the structure of a hidden Markov model in the first step in which the cut and gradual scene change are extracted using the feature vector of histogram difference. The state of a model consists of a shot, a cut and a gradual scene change. The probabilities of moving from one state to the other state are shown in arcs. The shot state can move to any other states. But from the transition state, it is only allowable to return to the shot state. The probability PT is the probability that a cut and a gradual transition occur under the assumption that each of the transition types is equally likely for simplicity. The probability 1− 2 PT is the probability of occupying a shot state. The probability PGT is the probability of occupying the gradual transition, and models the duration of a gradual transition. The probability 1 − PGT is the probability of returning from a gradual transition back to a shot. Since a cut scene changes rapidly, we make a model so that a shot makes a transition into a cut and then a cut is returned soon to a shot. Figure 3(b) shows the hidden Markov model in the second step to extract the gradual scene change such as fade, dissolve and wipe which can not be extracted in the first step. The state of a model consists of a shot, a fade, a dissolve and a wipe according to the type in which the scene change occurs. PT is the transition probability that a fade, a dissolve and a wipe occur equally likely, and 1− 3PT is the probability of occupying a shot.
432
J.-H. Park et al.
In each step, the parameters of a model are trained by the feature vector obtained from the manual segmentation of scenes. In this case Baum-Welch algorithm is used iteratively to modify the estimated parameters and finally using the estimated parameters, one of shot, cut, fade, dissolve and wipe is allocated to each state. For the segmentation of a video, we first estimate the optimal state sequence applying the Viterbi algorithm for the input video and assign the most likely state class to each frame of the video sequence.
4 Experiment To test the proposed algorithm, we have collected a video database containing a variety of video, including news, movies, and music video. We train each HMM using the histogram difference and the moment-based features, respectively, extracted from manually labeled image sequences. After the models are trained, Viterbi algorithm is used to find the optimal state sequence with maximum posterior probability and the segmentation is carried out by simply mapping the state sequence onto the given model. A histogram-based HMM is first used to segment the input image sequence into three categories: shot, cut, and gradual scene changes. Then a moment-based HMM is used to further segment the gradual changes into fades, dissolves and wipes. To evaluate the efficiency of the proposed scene change detection method, we conduct the comparison test with the method using the simple threshold selection by computing the precision and recall. Table 2 shows the results of scene change detection obtained by using both the threshold-based method and the HMM-based method. Here C, M and F denote the correct, false and miss detections. It can be easily seen that the recall and precision of the proposed method are much higher than the threshold-based method. Therefore, the proposed HMM-based segmentation method is shown to be efficient to segment video scene changes. Table 2. Scene change detection results using the threshold-based method and HMM-based method Threshold-based method
HMM-based method
C
M
F
C
M
F
Bush video
8
1
1
8
1
0
News video I
14
3
2
15
2
0
News video II
18
2
3
19
1
1
Drama video I
7
3
0
10
0
0
Drama video II
13
3
2
15
1
1
Commercial video
15
3
2
17
1
1
Recall
83.0%
94.3%
Precision
90.0%
97.5%
Content-Based Scene Change Detection of Video Sequence
433
5 Conclusion In this paper, we proposed the video segmentation method using the hierarchical HMMs. The histogram-based HMM is first used to segment the video sequence into shots, cuts and gradual scene changes. Then the moment–based HMM segments the gradual changes into fades, dissolves, wipes. The moment-based features extracted from the wavelet differences in the high frequency subbands of a wavelet domain have been shown to be efficient for the detection of gradual transitions. We have also presented that the use of hierarchical HMMs are very promising approach due to their automatic boundary detection characteristics. The experiment results show that the proposed HMM-based method is more effective in segmenting video frames than the threshold-based methods.
References 1.
Y. Tonomura, K. Oisuji, A. Akutsu, and Y. Ohba, "Stored Video Handling Techniques," MTT Rev. 5, pp. 60–82, 1993. 2. H. J. Zhang, A. Kankanhalli, and S. W. Smoliar, "Automatic Partitioning of Full-Motion Video," Multimedia Systems, 1, pp. 10–28, 1993. 3. B. Shahraray, "Scene Change Detection and Content-Based Sampling of Video Sequences," Proceedings, Storage and Retrieval for Image and Video Databases, SPIE 2419, pp. 2–13, 1995. 4. H. J. Zhang, C. Y. Low, and S. W. Smoliar, "Video Parsing and Browsing using Compressed Data, " Multimedia Tools and Applications, 1, pp. 89–111, 1995. 5. N. V. Patel and I. K. Sethi, "Video Shot Detection and Characterization for Video Databases, " Pattern Recognition, 30, pp. 583–592, 1997. 6. J. Yu, G. Bozdagi and S. Harrington, "Feature-based Hierarchical video segmentation," IEEE International Conference on Image Processing, Vol. 2, pp. 498–501, 1997 7. H. H. Yu and W. Wolf, "A Hierarchical Multiresolution Video Shot Transition Detection Scheme," Computer Vision and Image Understanding, Vol. 75, pp. 196–213, July, 1999. 8. A. Mittal L. F. Cheong and L. T. Sing, "Robust Identification of Gradual Shot-Transition Types," IEEE International Conference on Image Processing, Vol. 2, pp. 413–416, 2002. 9. J. S. Boreczky, and L. Rowe, "Comparison of Video Shot Boundary Detection Techniques, "Proceedings, SPIE ’96, 1996. 10. J. S. Boreczky and L. D. Wilcox, "A Hidden Markov Model Framework for Video Segmentation Using Audio and Image Features," Proceeding of the International Conference on Acoustics, Speech, and Signal Processing, Vol. 6, pp. 3741–3744, 1998. 11. C. Wang, K. L. Chan, and S. Z. Li, "Spatial-Frequency Analysis for Color Image Indexing and Retrieval," ICARCV ’98, pp. 1461–1465, 1998. 12. R. C. Gonzalez, R. E. Woods, Digital Image Processing, Addison-Wesley Inc., 1992.
An Appraisal of UNIVAUTO – The First Discovery Program to Generate a Scientific Article Vladimir Pericliev Institute of Mathematics and Informatics, bl.8, 1113 Sofia, Bulgaria [email protected]
Abstract. In a companion paper ([14]), I describe UNIVAUTO (UNIVersals AUthoring TOol), a linguistic discovery program that uncovers language universals and can write a report in English on its discoveries. In this contribution, the system is evaluated along a number of parameters that have been suggested in the literature as necessary ingredients of a successful discovery program. These parameters include the novelty, interestingness, plausibility and intelligibility of results, as well as the system’s portability and insightfulness.
1
Introduction
In a companion paper ([14]), I describe UNIVAUTO (UNIVersals AUthoring TOol), a system whose domain of application is linguistics, and in particular, the study of language universals, an important trend in contemporary linguistics. Given as input information about languages presented in terms of feature-values, (eventually) the discoveries of another human agent arising from the same data, as well as some additional data, the program discovers the universals in the data, compares them with the discoveries of the human agent and, if appropriate, generates a report in English on its discoveries. Running UNIVAUTO, with different queries, on the data from a classical paper by Greenberg [4] on word order universals, the system has produced several linguistically valuable texts, one of which was submitted for publication to a refereed linguistic journal without any further human editing (except for the formatting needed to conform to the style-sheet of the journal), and without disclosing the “machine origin” of the article. The article was accepted for publication ([11]). Another of these texts was also published with no post-editing as [13]. Exploring the phonological database UPSID ([8],[9]), the system has generated about 30 pages of text, comprising phonological universals and their support, which is included—as outputted by the system—in [16] (a part of these universals has already been published at the Universals Archive at the University of Konstanz (http://ling.uni-konstanz.de/pages/proj/sprachbau.htm).1 1
Some further discoveries, in which the human user played an appreciable part, will be mentioned in Sect. 3.6.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 434–441, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Appraisal of UNIVAUTO – The First Discovery Program
435
In this contribution, the system is evaluated along a number of parameters that have been suggested in the literature as necessary ingredients of a successful discovery program. These parameters are the novelty, interestingness, plausibility and intelligibility of results, as well as the system’s portability and insightfulness. The overview of the system in the next section provides the necessary context for the following discussion.
2
Overview of UNIVAUTO
UNIVAUTO operates in the domain of language universals, a branch of linguistic typology studying the common properties (=universals) shared by all or most of the languages of the world.2 Some familiar examples are “All languages have oral vowels” (unconditional, non-statistical universal), “If a language has a dual number, it has a plural” (implicational (or conditional), non-statistical universal), “In most languages, if they have the order Verb-Subject-Object, then the adjective follows the noun” (implicational, statistical universal). UNIVAUTO accepts as input the following, manually prepared, information: 1. A database (=a table), usually comprising a sizable number of languages, described in terms of some properties (feature-value pairs), as well as a list of the abbreviations used in the database. The program also knows their “names”, or what the abbreviations used for feature values stand for. A special value can occur in a database, designating either that the corresponding feature is inapplicable for a language or that the value for that feature is unknown. 2. A human agent’s discoveries, arising from the same database, stated in terms of the used abbreviations. 3. Other information. Aside from these two basic sources of information, the input includes also information on: the origin of database (the full citation of work where the database is given); reference name(s) of database; language families and geographical areas to which the languages in the database belong; etc. The system supports various queries. Thus, the user may require different: (i) logical types of universals (unconditional or implicational with two or more variables), (ii) minimum number of supporting languages, (iii) percentage of validity and (iv) statistical significance. The user can also choose the minimum number of (v) language families and (vi) geographical areas the supporting languages should belong to. 2
For a recent introduction, cf. e.g. [2]. Also, cf. several journals, incl. the authoritative Linguistic Typology, proclaiming as one of its goals the publication of universals register, as well as the electronic Konstanz Universals Archive, containing around 2000 universals from all linguistic levels that have been collected from the published literature.
436
V. Pericliev
UNIVAUTO is a large program, comprising two basic modules: one in charge of the discoveries of the program, called UNIV(ersals), and the other in charge of the verbalization of these discoveries, called AU(thoring)TO(ol). UNIV can discover various non-redundant logical patterns (universals), supported in user-specified thresholds of languages, language families and geographical areas, percentage of validity and statistical significance. Importantly, given the discoveries of another, human agent, UNIV employs a diagnostic program to find (eventual) errors in the humanly proposed universals. Currently, the system identifies as problems the following categories: – Restriction Problem: Universals found by human analyst that are below a user-selected threshold of positive evidence and/or percentage of validity and/or statistical significance. – Uncertainty Problem: Universals found by human analyst that tacitly assume a value for some linguistic property which is actually unknown or inapplicable. – Falsity Problem: Universals found by human analyst that are false or logically implied by simpler universals. The discoveries of UNIV fall into two types: (1) a list of new universals, and (2) a list of problems (sub-categorized as above). UNIV assesses the “scientific merit” of its discoveries in order to decide whether to generate a report or not. It uses a natural and simple numeric method: UNIV’s discoveries (novel universals plus problems) are judged worthy of generating a report if they are at least as many in number as the number of the published discoveries of the human agent studying the same database. The authoring module AUTO follows a fixed scenario for its discourse composition, whose basic components are: (1) Statement of title, (2) Introduction of goal, (3) Elaboration of goal, (4) Description of the investigated data and the human discoveries, (5) Explaining the problems in the human discoveries, (6) Statement of the machine discoveries, (7) Conclusion. The details of this scenario, however, will vary in accordance with a number of parameters, related to the specific query to the system and the corresponding discoveries made. We cannot go into details here, and will only mention that for its surface generation, AUTO employs a hybrid approach, using both templates and rules, which are randomly chosen among a set of alternatives in order to ensure intra-textual variability (for details, cf. [14]).
3
Evaluating UNIVAUTO
A common pragmatic criterion for evaluating discovery systems is the publication of their discoveries in the specialized domain literature. According to this criterion, UNIVAUTO performs well: its outputs have found outlet in several linguistic publications ([11], [13], [15], [16] (the latter under submission)). Vald´es-P´erez [20] has alternatively characterized machine scientific discovery as the generation of novel, interesting, plausible, and intelligible knowledge, and
An Appraisal of UNIVAUTO – The First Discovery Program
437
has suggested that a successful system should ideally have all these capacities. We have also mentioned as advantageous to discovery systems the features of portability and insightfulness, which were found to be common to four linguistic discovery systems that the author has been involved in ([12]). Below I describe UNIVAUTO along these six dimensions. (Cf. also [1] for an interesting similar discussion concerning basically systems in the domain of mathematics). 3.1
Novelty
UNIVAUTO has so far produced around 60 pages of text, covering about 250 new universals from the fields of word order and phonology. It has found (cf. [11], [13]) that two of the proposed word order universals in the classical article by Greenberg [4] are actually false and that seven others are exceptionless relative to the database investigated rather than statistical, as claimed by Greenberg. Three other of Greenberg’s ordering universals were shown to tacitly assume feature values for some languages which are actually unknown to the database. All these circumstances have remained unnoticed by previous human researchers, and ironically, some of the problematic universals are widely disseminated in the linguistic community (cf. e.g. the complete enumeration of Greenberg’s [4] ordering universals in The Linguistics Encyclopedia, London and N.Y., 1991). Inspecting two further word order databases from Greenberg [4] and Hawkins [6], which are really small 24x4 tables, the system also managed to find patterns that have escaped these authors, considered to be the authorities in the field in the textbook by Croft [2, page 57]. (Cf. also Sect. 3.6.) Similarly, many novel phonological universals were found in the UPSID database in comparison with Maddieson’s [8] findings, as well as some problems in these and other related proposals in the literature (lack of statistical significance and/or low level of validity and/or insufficiently diverse language support). Cf. [16]. Three design properties of the system enhance the chances of finding novel knowledge. The first is the system’s ability to explicitly check its own discoveries against those of a human agent exploring the same data. More generally, this strategy is not impractical in a linguistic discovery system on universals in view of the availability of universals archives, such as the Konstanz Archive mentioned above. The second is the exhaustive search of a combinatorial space that the system performs. Such comprehensive searches of combinatorial spaces, that are furthermore dense with solutions, are known to be very difficult, if not completely beyond the reach of a human investigator, a trite circumstance in computer science (but, unfortunately, not so in many domain sciences as linguistics). As a corollary of the exhaustive search, the system can make meta-scientific claims to the effect that “These are all universals of the studied type (relative to the database)”. The third design property is the ability of the system to handle diverse queries (esp. those concerning different logical types of universals), some of which may not have been seriously posed or pursued before.
438
3.2
V. Pericliev
Interestingness
The interestingness of UNIVAUTO’s findings is partly derived from the interestingness of the task it automates. Indeed, linguistics has always considered the discovery or falsification of a universal an achievement. From a purely design perspective, the system attempts to enhance the discovery of interesting universals by outputting only the stronger claims and discarding the weaker ones. Thus, if Universal 1 logically implies Universal 2, the first is retained and the second is ignored. E.g. “All languages have stops” implies “If a language has a fricative it also has a stop” and the second claim must therefore be dismissed as a pseudo-universal. (Ironically, this claim has been actually made more than 60 years ago in a celebrated book by Jakobson [7], another linguistic luminary, and has never been refuted.) 3.3
Plausibility
The plausibility of posited universals has been a major concern for UNIVAUTO. Universals are inductive generalizations from an observed sample to all human languages and as such they need substantial corroboration. The system disposes with two principled mechanisms to this end. The first is the mechanism ensuring statistical plausibility, allowing the user to specify a significance threshold for the system’s inferences. It is embodied in two diverse methods, the chi-square test and the permutation test3 , which can alternatively be used. The second plausibility mechanism pertains to the need for qualitatively different languages to provide support for a hypothetical universal for it to be outputted by the program. The specific measure of “typological diversity” of the supporting languages is chosen by the user of the system, by selecting the minimum number of language families and geographical areas to which the supporting languages must belong. The plausibility of (eventual) criticisms of a human agent’s discoveries is even less problematic. Indeed, one can definitely (and not only plausibly) say when a proposition is false relative to a known database, and that is exactly what the system does. 3.4
Intelligibility
With some discovery systems, the user/designer may encounter difficulties in interpreting the program’s findings. With other systems, typically those that model previously defined domain-specific problems, and hence systems searching conventional problem spaces, the findings would as a rule be more intelligible. However, intelligibility is a matter of degree and UNIVAUTO seems unique in producing an understandable English text to describe its discoveries (but see also [5]). The following excerpt from [11] will suffice to give an idea of the system’s output: 3
UNIVAUTO permutation test is that presented in [18].
An Appraisal of UNIVAUTO – The First Discovery Program
439
We confirmed the validity of universals [12,13,15–a,15–b,21–a,22– a,27–a]. Universals [16–a,16–b,16–c] are uncertain, rather than indisputably valid in the database investigated, since they assume properties in languages, which are actually marked in the database as “unknown or inapplicable”. . . . Universal [16–a] would hold only if the feature AuxV/VAux is applicable for Berber, Hebrew, and Maori and in these languages the inflected auxiliary precedes the verb. . . . Universal [23–a] is false. It is falsified in Basque, Burmese, Burushaski, Finnish, Japanese, Norwegian, Nubian, and Turkish, in which the proper noun precedes the common noun but in which the noun does not precede the genitive. We found the following previously undiscovered universals in the data. Universal 1. If in a language the adjective precedes the adverb then the main verb precedes the subordinate verb. Examples of this universal are 8 languages: Fulani, Guarani, Hebrew, Malay, Swahili, Thai, Yoruba, and Zapotec. . . .
UNIVAUTO thus both states in English its discoveries (new universals+problems) and the supporting evidence that makes these discoveries plausible/valid. Additionally, it provides a general context into which it places these discoveries (in the introductory parts of the generated text), as well as a summary of the findings (in the conclusion part of the generated text). The readability and self-contained nature of the texts the system normally produces must not be overstated. Some users may prefer to use the output as a “skeleton article” to be subsequently enlarged and edited to fit further stylistic and linguistic needs.4 3.5
Portability
Some discovery systems model general scientific tasks (for induction, classification, explanation, etc.) and would therefore be readily portable to diverse problems in diverse scientific domains. UNIVAUTO is such a system. It mimics the general task of discovery of (logic) patterns from data, and hence would be applicable not only to language universals discovery, where the objects described in the data are languages, but to any database describing any type of objects, be they linguistic or not. This however applies primarily to its discovery module. The text generation module, as it stands, is less flexible and most probably unportable to a domain outside of universals. 3.6
Insightfulness
The degree of formalization discovery programs require may result in our deeper understanding of the tasks modeled, esp. if the sciences from which the task is 4
Summing up the discussion in the last four subsections, it is interesting to note that, from a design perspective, UNIVAUTO turns out to share mechanisms with systems from other domains, giving further credibility to the analysis proposed by Vald´es-P´erez [20]: exhaustive search like MECHEM (operating in chemistry, cf. [19]) or KINSHIP (in linguistics, cf. [10]) and survey of the literature like ARROWSMITH (medicine, cf. [17]) for ensuring novelty; preferring logically stronger claims for ensuring interestingness and testing against qualitatively diverse data for ensuring plausibility like the mathematical system Graffiti ([3]), etc.
440
V. Pericliev
originally taken are not sufficiently formalized. Another source of insightfulness may be the outcomes of discovery programs, in the case when they make conspicuous some overlooked aspects of the results. Both the implementation and use of UNIVAUTO have triggered a number of linguistically important insights. Some of these are worth mentioning here. First, in the application of the system to linguistic typologies5 it was consistently found that a set of non-statistical (in contrast to statistical) universals exists that describes all and only the actually attested types, whereas previous influential authors ([6]), although strong proponents of exceptionless universals, have claimed them insufficient to do the job. This consistency of the system’s results could not be chance of course, so that it was only a short step finding the explanation. Indeed, a linguistic typology is equivalent to a propositional function, and therefore, as known from propositional logic, for any propositional function there exists a propositional expression that generates it. As a corollary, for any linguistic typology there exists a set of non-statistical universals, describing all and only its attested types ([15]). Secondly, our system found alternative sets of (non-statistical) universals describing the same typology. This gave rise to the problem of choosing among alternatives, which was never recognised before. Since linguists have traditionally given preference to simpler descriptions, the problem shaped to “Find simplest solution(s)”. This turned out not to be difficult, using a minimal set cover mechanism, that was previously implemented for our KINSHIP program ([10]). And, thirdly, exploring the 451 language database UPSID with UNIVAUTO has led to the formulation of a phonological principle to the effect that if Phoneme 1 implies Phoneme 2, then both phonemes share at least one feature and, besides, Phoneme 2 never has more features than Phoneme 1. This formulation was made possible only after the system’s discovery of all universals of this type valid in the database. The subsequent (machine-aided) representation of the phonemes in terms of their feature structure highlighted this statistically significant pattern, holding in 94.5 per cent of the cases ([16]).
4
Conclusion
UNIVAUTO models an important task in linguistics, synthesizing familiar methods from AI and NLP to make discoveries and verbalize these discoveries. The system performs well and is currently being used in the further study of phonological universals. Previous researchers in machine scientific discovery have not seriously considered extending their systems with text generation components basically because, presumably, their discovery objects are either non-verbally represented in their respective domains or are not sufficiently numerous to merit verbalization. 5
A “linguistic typology” states all logically possible types and which of these types are actually attested and which are not.
An Appraisal of UNIVAUTO – The First Discovery Program
441
Acknowledgment. The writing of this paper was partly supported by contract #I-813 with the Bulgarian Ministry of Education and Science.
References 1. Colton, S., Bundy, A., Walsh, T.: On the notion of interestingness in automated mathematical discovery. International Journal of Human-Computer Studies 53(3) (2000) 351–376 2. Croft, W.: Typology and Universals. Cambridge University Press, Cambridge (1990) 3. Fajtlowicz, S.: On conjectures of Graffiti. Discrete mathematics 72 (1988) 113–118 4. Greenberg, J. H.: Some universals of grammar with particular reference to the order of meaningful elements. In: Greenberg, J. H. (ed.): Universals of Language. MIT Press, Cambridge, Mass. (1966) 73–113 5. Xuang, X., Fielder, A.: Presenting machine-found proofs. CADE13, Lecture Notes in Computer Science 1104 221–225 6. Hawkins, J.: Word Order Universals. Academic Press, N.Y. (1983) 7. Jakobson, R.: Kindersprache, Aphasie, und allgemeine Lautgesetze. Almqvist & Wilksell, Uppsala (1941) 8. Maddieson, I.: Patterns of Sounds. Cambridge University Press, Cambridge (1984) 9. Maddieson, I.: Testing the universality of phonological generalizations with a phonetically specified segment database: results and limitations. Phonetica 48 (1991) 193–206 10. Pericliev, V., Vald´es-P´erez, R.: Automatic componential analysis of kinship semantics with a proposed structural solution to the problem of multiple models. Anthropological Linguistics 40(2) (1998) 272–317 11. Pericliev, V.: Further implicational universals in Greenberg’s data (a computergenerated article). Contrastive Linguistics 24 (1999) 40–51 12. Pericliev, V.: The prospects for machine discovery in linguistics. Foundations of Science 4(4) (1999) 463–482 13. Pericliev, V.: More statistical implicational universals in Greenberg’s data (another computer-generated article). Contrastive Linguistics 25(2) (2000) 115–125 14. Pericliev, V.: A linguistic discovery system that verbalises its discoveries. COLING, 19th International Conference on Computational Linguistics, August 24– September 1, Taipei, Taiwan (2002) 1258–62 15. Pericliev, V.: Economy in formulating typological generalizations. Linguistic Typology 6(1) (2002) 49–68 16. Pericliev, V.: Machine discovery of phonological universals. (2003) (Under submission) 17. Swanson, N.: An interactive system for finding complementary literature: A stimulus to scientific discovery. Artificial Intelligence 91(2) (1997) 183–203 18. Vald´es-P´erez, R., Pericliev, V.: Computer Enumeration of Significant Implicational Universals of Kinship Terminology. Cross-Cultural Research: The Journal of Comparative Social Science 33(2) (1999) 162–174 19. Vald´es-P´erez, R.: Conjecturing hidden entities via simplicity and conservation laws: Machine discovery in chemistry. Artificial Intelligence 65(2) (1994) 247–280 20. Vald´es-P´erez, R.: Principles of human-computer collaboration for knowledge discovery in science. Artificial Intelligence 107 (1999) 335–346
Scilog: A Language for Scientific Processes and Scales Joseph Phillips DePaul University, School of Computer Science, Telecommunications and Information Systems, 243 Wabash Ave., Chicago, IL 60604-2301, USA [email protected]
Abstract. We present Scilog, an experimental knowledge base to facilitate scientific discovery and reasoning. Scilog extends Prolog by supporting (1) dedicated predicates for specifying and querying knowledge about scientific processes, (2) the different scales at which processes may be manifested, and (3) the domains to which values belong. Scilog is meant to invoke more specialized algorithms and to be called by highlevel discovery routines. We test Scilog’s ability to support such routines with a simple search through the space of geophysical models.
1
Introduction
Computational Scientific Discovery (CSD) differs from the related fields of Machine Learning and Knowledge Discovery in Databases in that prior knowledge is paramount. For example: in CSD there often is at least one large and trusted model that can guide the search for new findings. Additionally, any new knowledge should be in a form that can be readily incorporated into the existing knowledge-base (kb). Here, the verb “incorporate into the . . . kb” should cover both the narrow computer-controlled memory/hard-disk kb and larger humancontrolled brain/literature kb. General approaches to doing CSD therefore benefit from a general approach to representing and reasoning with scientific knowledge, chiefly knowledge of processes. Both Langley et al [6] and the Qualitative Reasoning community (e.g [5]) have identified processes as a fundamental unit of CSD knowledge. Knowledge about processes in turn requires knowledge about their scales and attribute values that they interrelate. This paper introduces Scilog: an exploratory language and reasoning system to support CSD. Scilog is not meant to compete with qualitative systems like QSIM or quantitative systems like that of Langley et al. Rather, it is an intermediate language and reasoner midway between specific systems like QSIM that compute using dedicated algorithms and higher level algorithms that enable discovery, visualization, teaching, etc. Scilog is Prolog that’s extended to support reasoning about processes, the different scales at which processes are manifested, and domains of attributes G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 442–451, 2003. c Springer-Verlag Berlin Heidelberg 2003
Scilog: A Language for Scientific Processes and Scales
443
that processes relate. It also supports specialized data stating and querying predicates, has the ability to read data from tables, and can be told to do simple data consistency checks. Scilog’s extensive use and extension of floating pt values forbids its implementation as a Prolog module. Scilog is a language for deduction using a single model. Our conceptualization of science requires that any changes in prediction based on new knowledge be facilitated by explicit high-level model changing algorithms, not by the underlying computational method (cf non-monotonic logic). It also disallows several mutually-contradictory models to combine individual predictions into another (perhaps more accurate) one (cf boosting). With Scilog we have tried to obtain some of the advantages of using an intermediate scientific reasoning system including uniform access to a variety of specialized algorithms, support for several applications with one kb and predefined methods to exchange knowledge between domains, scales and processes. Our preliminary results in the field of geophysics suggest that we are on the correct track. We present our test domain in Section 2 and Scilog in Section 3. Section 4 presents experiments. Section 5 discusses and concludes.
2
Test Domain
The “San Andreas discrepancy” was the name for the difference between the calculated and measured relative velocities of the Pacific (“PAC”) and North American (“NAM”) plates. J. Tuzo Wilson was the first to suggest the San Andreas fault could be a boundary between PAC and NAM [4]. Early predictions estimated the speed to be about 57 mm/year. However, this disagrees with the speed of 32±5 mm/year measured in the 1960s and the inferred motion due to the examination of stream bed offsets of 36 mm/year (time scale: 13,000 years)[4].
Fig. 1. Coastal California, San Andreas and Basin-and-Range
Currently many geophysicists believe that the discrepancy can be accounted for by several factors. First, of course, is the motion along the San Andreas fault. Second, the “Basin and Range” region to the east of the fault is known to be expanding. Geodetic data between 1981 and 1985 gives the rate as 9 mm/year, most of which is parallel to the direction of PAC-NAM motion [4].
444
J. Phillips
Third, more recent models of global tectonics reduce the relative PAC-NAM rate to 48 mm/year over the past 3-4 million years. Fourth, the remaining portion of the unaccounted for motion is believed to be associated with the deformation of the the region to the west of the fault: portions of it are rising at the rate of 2-3 cm/ kyear, creating the Santa Monica mountains in Southern California [1]. However, because much of this region is under the Pacific ocean, direct measurement of its speed and direction of motion, and expansion or contraction is more difficult.
3 3.1
Scilog General
Scilog is an outgrowth of the knowledge base used in [7] and is meant to be a common language that is intermediate between low level computational modules and high level visualization and discovery modules. While Scilog is too low level for widespread usage by scientists it is a testbed for a language that may be embedded in scientific applications. Scilog uses a frame system ¡object,attribute,value¿ tuple. Like scientists, Scilog distinguishes between measurements and predictions. Measured values are to be given in tables, or are to be associated with individual objects or classes with the slot and inherit predicates respectively. The predicates dtree, equation, process class and property are used to predict with decision trees, numeric relations, processes and arbitrary Prolog sentences respectively. The querying predicate lookup just checks measurements, compute just checks predictions, and property (when used as a querying predicate) executes a lookup and then a compute. Scilog also supports contexts, which hold state information (e.g. time and current scale). Scilog values have a primary value (a mean, median or mode), a certainty range, a domain, a scale and optional references to the object and attribute being described. Domains and scales specify the units, legal values and other meta-information for values. Both are described below. Scilog extends native Prolog predicates and functions to use these metadata. The mathematical comparison predicates and functions =, is, >, <, etc. have been extended to convert between units of the same dimension and to fail to unify or operate on values of different dimensions. Additionally, numeric operators compute certainty bars given the certainty bars of their operands. If this results in a non-Gaussian probability distribution then the resulting value is a composite of 32 random samples. 3.2
Representation of Domains
Suppose we are given details of slip along a fault plane. We hastily convert this to Scilog and type it into our computer. When testing our description, if we are able to compute speeds of less than 0 mm/yr then we know there is an error.
Scilog: A Language for Scientific Processes and Scales
445
Speed can not be negative (although velocity may be) according to the speed’s range’s definition. Similarly, if we compute a speed in excess of 3 × 10+4 km/s (a tenth of the speed of light) then we are reasonably sure there is an error. Most non-fundamental particles do not go that fast, or equivalently the wave’s speed will probably saturate at some slower speed. Domains specify an attribute’s legal values (including its datatype), its units (including its dimension), its measuring instrument-determined precision and the types of objects that can be described. They are used to detect illegal values as they are computed. There are three numeric domain datatypes (integer, float pt and fixed pt) and a nonnumeric one (concept). Numeric domains have the following upper and lower bounds: 1. range define limit: The range is logically defined to have hard endpoints. 2. system limit: The system being measured has these physical endpoints. 3. detect limit: The measuring instrument cannot detect values outside this range. 4. saturate limit: The system does not exhibit many values outside this range. 5. reliable limit: The instrument can not reliably detect values outside this range. 6. observed limit: No values were observed beyond this value. Additionly, floating point domains have length of mantissa and exponent fields, and may have up to 32 random sample points, fixed point domains have a delta that tells the difference between consecutive values, and concept domains may specify the class to which all values belong. Concept values may also just specify membership to a class instead of giving a particular instance. This implements the conceptual equivalent of certainty ranges. Three classes of consistency constraints exist: system constraints (saturate ⊆ system ⊆ range define), instrument constraints (reliable ⊆ detect ⊆ range define) and data constraints (observed ⊆ system, detect). Reliable and saturate are “soft” bounds and their semantics is open to interpretation. Detect, system and logical are “hard” bounds fixed by the recording device, domain knowledge and the definitions of attributes. Scilog uses this meta-knowledge for consistency checking (Scilog can check values generated during computation to see if they violate any constraints) and dimensionality checking/auto unit conversion (values with different dimensions will fail to unify). 3.3
Representation of Contexts and Scales
There is more to a datum than just its own domain. For example, geophysicist believe that many fault segments are locked in place and only move during earthquakes. If we look at such a segment for one second we will observe a relative velocity of 0 most of the time and some dramatic non-zero value during earthquakes. If, however, we ask what the average velocity is over one century
446
J. Phillips
or another then we expect more consistent and much lower non-zero values. Contexts differentiate the cases by giving the “instantaneous” velocity’s time delta as 1 second and that the “average” velocity’s time delta as 1 century. Contexts list assertions that pertain to a value besides its domain. Such assertions may include information on measurement methods and instruments, meta-physical assumptions implicit in the data, and the value’s scale that tells the time and temporal resolution. Contexts are given to queries as conjunctions of assertions. These assertions state what may be assumed during one particular query. Contexts are automatically passed to recursive subqueries and match with equivalent contexts and with their generalizations. Scilog can automatically transform values between two scales if it has knowledge about how the scales relate and knowledge on how to transform the value’s attribute. The scale knowledge that Scilog needs is the subdivision of a gross scale value into its finer scale values (or equivalently between a gross scale value and a domain describing the subdivided values) and the value combining function. 3.4
Representation of Processes
All values, with the exception of the advancing of clock time, are only allowed to change due to the action of processes. In the fault segment example, the fact that offset between plates changes over time means that at least one process is active. Processes describe how a set of similar phenomena change a system from a before state to an after state. Processes may be decomposed into subprocesses to describe finergrained state changes. Like Langley et al, we distinguish between abstract and concrete processes. We call abstract processes process classes. They have a name uniquely identifies them, an object list which gives the classes of objects that are interrelated by the process’ effects, a conditions list, and, a manifestations list which gives equations and decision trees that relate the objects. Process instances concern one particular event of a process class. They have a name, a class to which they belong, and a list of all objects that they interrelate. Like Langley et al our process instances may be composed of other processes. The two ways that a process instance may be decomposed are temporally (implemented by serial scales) and structurally (implemented by identifying conceptually simpler subprocesses that act in parallel). Our process “grammar” is therefore more low-level, but is computationally equivalent to Finite State Automata. Process instance knowledge is redundantly stored at different scales and with different subprocesses. This serves two purposes: it aids efficient computation by caching results at scales and subprocesses that scientists say are useful (cf the utility problem) and it allows for composite processes by combining the effects of small scale processes into large ones.
Scilog: A Language for Scientific Processes and Scales
3.5
447
Computing with Domains, Scales, and Processes
Scilog has most of the functionality of a Prolog interpreter (changing operator precedence is not supported) and supports new predicates. To answer queries Scilog tries the internally-supported knowledge sources in the following order: context, property cache, database, frame and inheritance, direct process manifestations, indirect manifestations, equations, decision trees, and lastly, arbitrary Prolog logic. The context and property cache both store information specific to the current query. Scilog uses process classes in two ways. Direct computation uses a single process’ manifestations and conditions. Indirect computation predicts process values by using composite processes to constrain them.
4
Experiments
There is, of course, overhead associated with our rich description of values: dimensions must be checked; units must be converted; domains and attributes that result from arithmetic operations must be looked up, and when necessary, computed or invented. Therefore, it is inappropriate to compare Scilog with spreadsheets, that are more concerned with the layout and display of computations, or with tools like Matlab, Mathematica and Maple, which keep only limited datatype information. Rather, we should grade our knowledge base on how well it supports scientific reasoning according to the following goals: (1) uniform access, (2) applications support, and (3) domain, scale and processes support. To accomplish this we have built a simple discovery application on top of Scilog and applied it to an abstraction of a problem found in the literature: that of the “San Andreas Discrepancy”. Our discovery application considers models of increasing structural complexity and which roughly parallels the search that geophysicists made to explain it in the 1970s and 1980s. The fact that this was a real problem suggests that if Scilog is successful here then it potentially could be helpful in other “real-world” problems. Our models extensively use Scilog’s ability to store knowledge in different forms. Table 1 lists geophysical knowledge that was used by the form that was employed. We allow for three different process classes: plate-plate relative motion, motion along a fault, and subplate expansion. We use up to five process instances: super-process PAC- NAM motion and sub-processes San Andreas motion, Basin and Range expansion, California Coastal motion and the motion of some unspecified Pacific subplate. Data on these process instances exists at one or more of four scales: early 1980s, 1960s, holocene (the past 12000 years) and quaternary (the past 1 million years) as given in table 2. Speeds are associated with all process instances. The composite process PACNAM has a speed of 48 mm/year. The rate of slip along the San Andreas fault will be taken as 32 ± 5 mm/year in the 1960s and 36 mm/year during the holocene. The rate of expansion of the Basin will be taken to be 9 ± 4 mm/year in the early 1980s and 12 mm/year during the holocene[4]. Finally, the rates
448
J. Phillips
Table 1. Encoding of geophysical knowledge (some equations are online at [3]) Form slot
What is represented Earth’s radius; San Andreas locations; relations among the Earth, PAC, NAM, San Andreas, Basin and Range, and Coastal Cal.; speeds and directions; relations among motions inheritance Shear modulus and max. width info for quakes in Southern California equations Resolving motion abs length and direction into north and east vectors process Tectonic attrs. relations: rotational pole, angular speed, speed and class direction; Quake attr. relations: area, shear modulus, magnitudes and offset process Objects interrelated by PAC-NAM motion, San Andreas earthquakes, instance Basin and Range expansion, California coastal motion
Table 2. Scales at which data was stated (times in fractions of a year) scale early 1980s 1960s holocene quaternary
start time 1981.5 1960.5 −1, 100 −1 × 10−6
end time 1985.5 1970.5 1985 1985
resolution 1 1 100 1 × 10+4
of expansion of the California coast and of some other unidentified block will be predicted given the constraints imposed by the other process instances. This reflects the lack of data for the partially submerged coast as well as the admitted ignorance of where some “other” expanding subplate might be. Ideally we would use a general heuristic for grading a scientific model such as the one presented by Phillips [8]. That approach, however, requires that we divide our scientific model into theory, laws and data; and that we give constants telling how much we trust the data versus trust the laws. In this paper we have simplified our analysis by only grading the data given the laws, by generating sets of laws from the structurally simplest model to increasingly more complex ones, and by ignoring the theory component altogether. Valdes-Perez [9] pioneered the approach of generating scientific models from structurally simplest to increasingly complex. This approach, however, has only been applicable to models that are entirely structural and do not, for example, contain floating point numbers that predict to greater or lesser accuracy. We use Scilog’s computational ability to slightly generalize his approach. We rely on Scilog’s facility for processes to compute process instance attributes when they are not explicitly stated in the model. (Some values may not be deducible if there is insufficient information. If this is true then we reject the model and go to the next.) We take all of the predictions of the model and compare them with either the recorded value at the scale or the closest scale’s value. (Relying on the closest scale is necessary because not all values are known at all scales.) Acceptable models both predict all values and have predictions that have certainty ranges that overlap with those of their closest observations.
Scilog: A Language for Scientific Processes and Scales
449
Subsequent models are generated by incorporating another process instance’s measurements for process instance attributes that have no value explicitly stated in the model. Only if all process instance attribute values are specified do we increase the models complexity by adding another subprocess to PAC-NAM. This parallels real scientific discovery where first the San Andreas, then the Basin and Range, and finally Coastal California were used to understand the border between PAC and NAM in California. Scilog’s predictions are given in table 3. Values that are underlined and italicized represent predictions that do not match the observations or are inconsistent with other predictions, question marks represent the failure to predict and the “–” means that the value is not relevant for the model. Model M0 reflects the beliefs in the late 1960s that only the San Andreas was responsible for all of the relative motion. Subsequent models add the effects of the Basin and Range, the California Coast, and lastly some unseen subplate (labelled “Other”). The experiment must end at M8 because no more data is available. Table 3. Progression of Geophysical Models of “San Andreas Discrepancy” All values are velocities in mm/year
M5 is the best model. Scilog has successfully reproduced the computations needed to support the modern scientific view that San Andreas motion, Basinand-Range extension, and Coastal Californian motion all play a part in the dynamics of the PAC-NAM boundary. Let us consider the PAC-NAM northerly velocity of 39 mm/year in more detail. Scilog created a new domain domain16 that results from the operations that went into making it. For example, it is a floating point domain with only 7 binary digits of mantissa precession. This results from the low resolution of the angular velocity between PAC and NAM of 7.8 × 10−7 degrees/yr [2] used in the calculation. Additionally, Scilog created an attribute for this value named plate plate motions angular velocity times radius attr times sin of acos of sin of latitude attr times sin of plate plat 2 that (partially) expresses the history of the computation.
450
J. Phillips
To demonstrate Scilog’s ability to transfer knowledge between scales, and to show the limitations of its deductive framework, we loaded records of 176 earthquakes from part of the San Andreas in Central California (36.6 N, 121.2 W to 36.3 N, 120.8 W) from 1960 to 1970 at a finer scale (time delta: 1 day). Computed estimates of the average velocity varies with assumptions about their areas but all are significantly smaller than the lookup value of 32 mm/year. There are at least two reasons for this: (1) not all earthquakes were listed (some were too small) and (2) in general, another process (aseismic creep) also moves faults (especially in Central California). This highlights both Scilog’s ability to check consistency between models and data and its limitation to purely monotonic reasoning.
5
Discussion and Conclusion
We can assess our progress towards our goals: 1. Uniform access to a variety of specialized algorithms. These computations utilized knowledge from frames, rules, equations, process classes and process instances. 2. Support for several applications with one kb. The experiment demonstrated support for scientific re-discovery. 3. Predefined methods to exchange knowledge between domains, scales and processes. The experiments showed the system’s ability to use and create new domains and attributes, to use process-specific information and to rescale knowledge. Scilog is a purely deductive system and is limited to monotonic reasoning. In science this might be a good thing: it forces scientists to be as explicit as possible about their models. We have introduced preliminary work on Scilog, an extension of Prolog expressly designed to support scientific deduction. We implemented Scilog in C++ because of its heavy use and extension of floating pt values. Scilog is too low level a language for scientists and it is designed for internal computation by applications that do scientific reasoning. Despite its low level Scilog has shown promise in its ability to support scientific computations by its uniform access, its support for applications, and its ability to manipulate domains, scales and processes. Specific areas for improvement include calling more specialized reasoning systems, and building discovery and reasoning systems on top of it.
References [1] Collier, Michael: A Land in Motion: California’s San Andreas Fault. University of California Press, Berkeley and Los Angeles (1999) [2] DeMets, C., Gordon, R.G., Argus, D.F., Stein, S.: Current plate motions. Geophys. J. Int., 101, p 425–478 (1990)
Scilog: A Language for Scientific Processes and Scales
451
[3] Jordan, B.: Global Plate Motion Models. http://people.whitman.edu/˜jordanbt/platemo.html (2002) [4] Jordan, T.H., Minster, J.B.: Measuring Crustal Deformation in the American West. Scientific American, (1988) August. [5] Kuipers, B.: Qualitative Reasoning: Modeling and Simulation with Incomplete Knowledge. MIT Press, Cambridge, Massachussetts (1994) [6] Langley, P., Sanchez, J., Todorovski, L., Dzeroski, S.: Inducing Process Models from Continuous Data. ICML (2002) [7] Phillips, J.: Representation Reducing Heuristics for Semi-Automated Scientific Discovery. Ph.D. Thesis, University of Michigan (2000) [8] Phillips, J.: Towards a Method of Searching a Diverse Theory Space for Scientific Discovery. Discovery Science. Morgan Kaufmann, San Francisco (2001) [9] Valdes-Perez, R.: Machine discovery in chemistry: new results. Artificial Intelligence. 74(1), p 191–201 (1995)
Mining Multiple Clustering Data for Knowledge Discovery Thanh Tho Quan, Siu Cheung Hui, and Alvis Fong Nanyang Technological University, School of Computer Engineering, Singapore {PA0218164B,asschui,ascmfong}@ntu.edu.sg
Abstract. Clustering has been widely used for knowledge discovery. In this paper, we propose an effective approach known as Multi-Clustering to mine the data generated from different clustering methods for discovering relationships between clusters of data. In the proposed MultiClustering technique, it first generates combined vectors from the multiple clustering data. Then, the distances between the combined vectors are calculated using the Mahalanobis distance. The Agglomerative Hierarchical Clustering method is used to cluster the combined vectors. And finally, relationship vectors that can be used to identify the cluster relationships are generated. To illustrate the technique, we also discuss an application example that uses the proposed Multi-Clustering technique to mine the author clusters and document clusters for identifying the relationships on authors working on research areas. The performance of the proposed technique is also evaluated.
1
Introduction
Clustering [1,2,3] is an effective data mining technique. It divides data into groups of similar objects, called clusters. As clusters are generated based on an individual attribute of a database, it is useful to identify the relationships between different attributes of a database from mining the multiple clustering data. In this paper, we propose a mining technique called Multi-Clustering, which mines cluster data generated from multiple clustering methods for identifying relationships between them. To tackle this problem, we need to transform the multiple clustering data into combined vectors, and use a clustering method to cluster the combined vectors in order to derive the relationships (i.e. intercluster set relationships and intra-cluster set relationships) among the different clusters. In this paper, the proposed Multi-Clustering technique will be discussed. In addition, we will also discuss an application example to illustrate the proposed technique. Finally, the performance of the proposed technique is evaluated based on F-measure and entropy measurements.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 452–459, 2003. c Springer-Verlag Berlin Heidelberg 2003
Mining Multiple Clustering Data for Knowledge Discovery
Database
Multiple Clustering Methods
Cluster Set 1 Cluster Set 2
Vectorization
Applications
Cluster Set n
Distance Evaluation
453
Cluster relationships
Vector Clustering
Relationship Generation
Multi-Clustering
Fig. 1. Multi-Clustering technique.
2
Multi-clustering Technique
The proposed Multi-Clustering technique is shown in Figure 1. The MultiClustering technique consists of the following steps: Vectorization, Distance Evaluation, Vector Clustering and Relationship Generation.
3
Vectorization
Vectorization extracts and represents multiple clustering data as vectors called combined vectors in a multi-dimensional space. Vectorization can be defined formally as follows. Definition 1 (Cluster Set): Let S = {D1 , D2 , . . . , Dm } be a set of data items. A clustering method CM clusters the data items in S into a set of k clusters (or cluster set) CS = {C1 , C2 , . . . , Ck } such that – If Ds ∈ Cp and Dt ∈ Cp , where 1 ≤ s, t ≤ m and 1 ≤ p ≤ k, then Ds and Dt are similar. – If Ds ∈ Cp and Dt ∈ Cq , where 1 ≤ s, t ≤ m, 1 ≤ p, q ≤ k and k = l, then Ds and Dt are dissimilar. Definition 2 (Cluster Number): Let CS = {C1 , C2 , . . . , Ck } be the cluster set obtained from applying a clustering method CM to a set of data items S = {D1 , D2 , . . . , Dm }. If Di ∈ Cj , where 1 ≤ i ≤ m and 1 ≤ j ≤ k, then j is the cluster number for Di in CS. Definition 3 (Combined Vector): Let S = {D1 , D2 , . . . , Dm } be a set of data items. Let CS1 , CS2 , . . . , CSn be the cluster sets obtained from applying the clustering methods CM1 , CM2 , . . . , CMn to S. The combined vector for a data item Di where 1 ≤ i ≤ m is the vector vi = (d1 , d2 , . . . , dn ) where dj , with 1 ≤ j ≤ n, is the cluster number for Di in CSj .
454
T.T. Quan, S.C. Hui, and A. Fong
Definition 4 (Vectorization): Let S = {D1 , D2 , . . . , Dm } be a set of data items. Let CS1 , CS2 , . . . , CSn be the cluster sets obtained from applying the clustering methods CM1 , CM2 , . . . , CMn to S. Vectorization generates a set of m vectors V = {v1 , v2 , . . . , vm } from the cluster sets CS1 , CS2 , . . . , CSn where vi , 1 ≤ i ≤ m, is the combined vector for a data item Di in S.
4
Distance Evaluation
To discover relationships from the combined vectors, we can cluster the combined vectors into clusters of similar vectors. However, in order to perform clustering, we need to obtain the distances between the combined vectors. In this research, we adopt the Mahalanobis distance [2] to calculate the vectors’ distances, since the Mahalanobis distance can incorporate the correlation of the dimensions of vectors. Mathematically, the correlation of dimensions can be inferred from their covariance [8]. Let V = {v1 , v2 , . . . , vm } be a set of n-dimensional combined vectors. The mean of the ith dimension of V , where 1 ≤ i ≤ n, is defined as m vki (1) Vi = k=1 m where vki is the value of vk on the ith dimension. Let V = {v1 , v2 , . . . , vm } be a set of n-dimensional combined vectors. The covariance c(i, j) of the ith dimension and j th dimension is defined as m (vki − Vi )(vkj − Vj ) (2) c(i, j) = k=1 m−1 The covariance c(i, j) can be used to determine whether the ith and j th dimensions have correlation or not. If c(i, j) = 0, then the ith and j th dimensions have no correlation. Let V = {v1 , v2 , . . . , vm } be a set of n-dimensional combined vectors. The covariance matrix CV of dimensions in V is defined as c(1, 1) c(1, 2) . . . c(1, n) c(2, 1) c(2, 2) . . . c(2, n) CV = ... ... ... ... c(n, 1) c(n, 2) . . . c(n, n) Having the covariance matrix defined, we use the Mahalanobis distance to calculate the vectors’ distances, which is defined as follows. Let V = {v1 , v2 , . . . , vm } be a set of n-dimensional combined vectors. The Mahalanobis distance between vi and vj , where 1 ≤ p, q ≤ m, is calculated as (3) dM (i, j) = (vi − vj )T CV−1 (vi − vj ) Note that when we calculate the distances between combined vectors using equation (3), we use the same unit metric for all dimensions. However, since the importance of clustering methods corresponding to the dimensions is different, it
Mining Multiple Clustering Data for Knowledge Discovery
455
is clear that values on different dimensions are scaled by different scaling factors if the same unit metric is used on all dimensions. Thus, we need to prove that if values on dimensions are scaled by any scaling factors, the Mahalanobis distance evaluation is not affected. Consider the scaling of dimensions for a set of vectors V = {v1 , v2 , . . . , vm } as a linear transformation process, i.e. a new set of vectors V = AV is obtained, where A is a weighted matrix. In Lemma 1, we will show that the Mahalanobis distance is invariant for any linear transformation. Lemma 1: The Mahalanobis distance is invariant for any linear transformation V = AV . Proof. We have CV = AT CV A. Let vi be the transformed vector of vi , i.e. vi = Avi . We have dm (vi , vj )2 = (vi − vj )T CV−1 (vi − vj ) = (vi − vj )T AT (AT )−1 CV−1 A−1 A(vi − vj ) = (vi − vj )T Cv−1 (vi − vj ) = dm (vi , vj )2 The distance of every pair of the combined vectors is calculated and stored as a matrix called the distance matrix.
5
Clustering the Combined Vectors
To discover the hidden relationships from multiple clustering data, which are represented as combined vectors, we have adopted the Agglomerative Hierarchical Clustering (AHC) [9], one of the most popular agglomerative clustering techniques, to perform a bottom-up clustering process. Using the set of combined vectors and the distance matrix as input, the AHC algorithm generates a set of combined vector clusters. Then, the combined vector clusters are further analyzed to discover the cluster relationships.
6
Generating Relationships
In Generating Relationships, we identify the knowledge on the relationships of the multiple clustering data from the results of the AHC clustering process on the combined vectors. Two kinds of relationships can be identified: the intra-cluster set relationships, which are the relationships among clusters from the same clustering method, and inter-cluster set relationships, which are the relationships between clusters from different clustering methods. Let S be a set of data items. The entropy [10,11] of S is defined as (pi log(pi )) (4) e(S) = − i∈value(S)
where pi is the proportion of items in S that has the value i.
456
T.T. Quan, S.C. Hui, and A. Fong
Definition 5 (Purity Set): Let S be a set of data items and TP be a Purity Threshold. S is a purity set if and only if e(S) ≤ TP . According to experimental results given in [4,11], efficient clustering often results in clusters that have entropy with a value less than 0.4. Based on this result, we set TP to 0.4. Next, we define the common value. Common values are the values that occur frequently in a set of data items S compared to the number of items. Generally, in the purity set, there are two kinds of values: common values and noisy values. We assume that if we eliminate noisy values from a purity set, then the set obtained is similar to the original set. The Jaccard measure is commonly used to evaluate set similarity [3]. Based on Jaccard measure, the similarity of two sets is defined as follows: CJ (S1 , S2 ) =
|S1 ∩ S2 | |S1 ∪ S2 |
(5)
Definition 6 (Common Value): Let S be a set of data items and TC be a Common Value Threshold. A value I is the common value of S if and only if (i) S is a purity set, and (ii) CJ (S, (S \ {I})) ≤ TC , where {I} denotes the subset of items in S that has the value I. TC is set intuitively. Since the similarity between two sets ranges from 0 to 1, TC should be greater than 0.5 to imply that the two sets are ”similar”. In this research, we set TC as 0.75, which is the average value of 0.5 and 1. Definition 7 (Common Value Set): Let S be a set of data items. We define common value set SCM (S) of S as the set of all of common values in S. A common value set can be an empty set. Definition 8 (Dimension Set): Let V C = {v1 , v2 , . . . , vm } be a combined vector cluster, where vk is a n-dimensional combined vector, with 1 ≤ k ≤ m. We define the ith dimension set di (V C) of V C, with 1 ≤ i ≤ n, as di (V C) = {v1i , v2i , . . . , vmi }, where vki , with 1 ≤ k ≤ m, is the value of vk on the ith dimension. Definition 9 (Relationship Vector): Let V C = {v1 , v2 , . . . , vm } be a combined vector cluster, where vk is a n-dimensional combined vector, with 1 ≤ k ≤ m. The relationship vector VR of V C is defined as VR = (S1 , S2 , . . . , Sn ), where Si = SCM (di (V C)) with 1 ≤ i ≤ m. From the relationship vectors generated, we can discover the relationships between the multiple clusters. As discussed earlier, there are two kinds of relationships: intra-cluster set relationships and inter-cluster set relationships. We use the relationship vectors to determine the intra-cluster set relationships in Corollary 1. Corollary 1: Let VR = (S1 , S2 , . . . , Sn ) be a relationship vector. Let CM1 , CM2 ,. . . , CMn be clustering methods where CMi corresponds to the ith
Mining Multiple Clustering Data for Knowledge Discovery
457
dimension of VR , with 1 ≤ i ≤ n. Let CS1 , CS2 , . . . , CSn be cluster sets generated by CM1 , CM2 , . . . , CMn , respectively. Let s(CSj ) be the subset of CSj corresponding to cluster numbers in Si , with 1 ≤ j ≤ n. If |s(CSj )| > 1, then there exists intra-cluster set relationships between clusters in s(CSi )). We also use the relationship vectors to determine the inter-cluster relationships in Corollary 2. Corollary 2: Let VR = (S1 , S2 , . . . , Sn ) be a relationship vector. Let CM1 , CM2 ,. . . , CMn be clustering methods where CMi corresponds to the ith dimension of VR , with 1 ≤ i ≤ n. Let CS1 , CS2 , . . . , CSn be cluster sets generated by CM1 , CM2 , . . . , CMn , respectively. Let s(CSj ) be the subset of CSj corresponding to cluster numbers in Si , with 1 ≤ j ≤ n. Let k be the number of non-empty sets in {s(CS1 ), s(CS2 ), . . . , s(CSn )}. If k > 1, then there exists inter-cluster set relationships between clusters in {s(CS1 ), s(CS2 ), . . . , s(CSn )}.
7
Performance Evaluation
The proposed approach has been implemented and applied to a collection of over 1400 scientific publications on Information Retrieval domain from 19871997 [4,5]. The clustering methods were based on techniques such as KSOM [13], Fuzzy ART [14], co-citation analysis [4], etc. to cluster the citation database to generate different sets of clusters including keyword clusters, author clusters, journal clusters, date clusters and organization clusters. Then, we apply the Multi-Clustering technique to the six combinations of the data clusters as shown in Table 1. Table 1. The six combinations of clusters. No. Keyword Author Journal Date Organization Clusters Clusters Clusters Clusters Clusters 1 2 3 4 5 6
x x x x -
x x x x
x x x -
x x -
x
To measure the performance, the F-measure, which combines precision and recall [3], is used. We have used four different methods to calculate the similarities between clusters during the implementation of the AHC algorithm when clustering the combined vectors as discussed in Section 5. They are single link, complete link, average link, and Ward’s method [9]. The performance of the Multi-Clustering technique applied to the different combinations of data clusters based on the F-measures is given in Figure 2. High F-measure values imply good performance of the clustering method.
458
T.T. Quan, S.C. Hui, and A. Fong
In addition, we also measure the entropy for performance evaluation [10,11]. Using the formula for calculating the entropy given in equation (4), we measure the entropy for each generated cluster. The total entropy is calculated as the sum of entropy for each cluster weighted by the size of each cluster. The performance results of the Multi-Clustering technique applied to the different combinations of data clusters based on entropy are given in Figure 3. Low entropy values imply good performance of the clustering method.
1.2
F-measure
1 0.8
Single link Complete link
0.6
Average link Ward’s method
0.4 0.2 0 1
2
3
4
5
6
Combination
Fig. 2. Performance evaluation based on F-measure.
1.8 1.6 1.4
Entropy
1.2
Single link
1
Complete link
0.8
Average link
0.6
Ward’s method
0.4 0.2 0 1
2
3
4
5
6
Combination
Fig. 3. Performance evaluation based on Entropy.
As can be seen from Figure 2 and Figure 3, the single-link method is quite sensitive to data noise. Moreover, the single-link method has high entropy values and low F-measure values for the different combinations of clusters. The Ward’s method always obtains the lowest entropy values. However, as the Ward method tends to generate clusters in small size, in some cases the F-measure values obtained are lower as compared to that of the complete link and average link methods. In addition, the performance results also give us some ideas on the relationships between clusters. The combination 3, which combines keyword and journal clusters, has good performance. It implies that there is a strong relationship between keyword clusters and journal clusters. It is justifiable because scientific journals usually publish papers in specialized research areas. On the contrary, we can also see that the combination 5, which combines author and
Mining Multiple Clustering Data for Knowledge Discovery
459
journal clusters, does not have good performance in terms of F-measures. It can be explained that most researchers may publish papers in a number of journals and conferences, rather than focusing on a few journals and conferences. So the relationships between researchers and journals are not strong, thereby resulting poor performance.
8
Conclusion
In this paper, we have proposed a new data mining approach called MultiClustering that mines the relationships between clustering data generated from multiple clustering methods. An application example that illustrates the use of the proposed approach to mine the author clusters and document clusters to identify the relationships between authors and research areas has been discussed. Performance analysis based on the domain of scientific publications has also been given. One important advantage of our proposed approach is that it is clustering method-independent. Thus, it does not depend on the clustering method that is used.
References 1. 2. 3. 4. 5. 6.
7. 8. 9. 10. 11. 12.
13. 14.
Berkhin P.: Survey of Clustering Data Mining Techniques. Technical Report. Accrue Soft-ware, Inc, 2002. Cios K.J., Pedrycz W., Swiniarski, R.W.: Data Mining: Methods for Knowledge Discovery. Kluwer Academic Publisher. Norwell, MA, USA, 1998. Van Rijsbergen C.: Information Retrieval. Utterworths, London, England, 1979. He,Y., Hui, S.C.: Mining a Web Citation Database for Author Co-citation Analysis. In: Information Processing and Management, Vol. 38, No. 4, pp.491–508, 2002. He,Y., Hui, S.C., Fong, A.C.M.: Mining a Web Citation Database for Document Clustering. In: Applied Artificial Intelligence, Vol. 16, No. 4, pp. 283–302, 2002. Bohm, C., Berchtold, S., Keim: Searching in High-Dimensional Spaces – Index structures for Improving the Performance of Multimedia Databases. ACM Computing Surveys, Vol.33, No.8, pp. 322–373, 2001. Carkacioglu, A., Vural, F.Y.: Learning Similarity Space. International Conference on Image Processing, pp. 405–408, 2002. Weinberg, S.: Applied linear regression, John Wiley and Sons, Chichester, 1985. Everitt, B.: Cluster Analysis. Edward Arnold, 3rd edn. London, 1993. Mitchell, T.M.: Machine Learning. McGraw Hill. United States, 1997. Boley, D., Principal Direction Divisive Partitioning. Data Mining and Knowledge Discovery, Vol.2, No. 4, pp. 325–344, 1998. Zamir, O., Etzioni, O.: Web Document Clustering: a Feasibility Demonstration. Proceeding of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54, 1998. Kohonen, T: Self-Organizing Maps. Springer, Berlin, 2001. Grossberg, S.: The Adaptive Self-Organization of Serial Order in Behavior: Speech, Language and Motor Control. Pattern Recognition By Humans and Machines, Vol. I: Speech Perception. Academic Press Inc., 1986.
Bacterium Lingualis – The Web-Based Commonsensical Knowledge Discovery Method Rafal Rzepka1 , Kenji Araki1 , and Koji Tochinai2 1
Hokkaido University, Kita-ku Kita 13-jo Nishi 8-chome, 060-8628 Sapporo, Japan {kabura,araki}@media.eng.hokudai.ac.jp, http://sig.media.eng.hokudai.ac.jp 2 Hokkai-Gakuen University, Toyohira-ku, Asahi-machi 4-1-40 062-8605 Sapporo, Japan [email protected]
Abstract. The Bacterium Lingualis is a knowledge discovery method for commonsensical reasoning based on textual WWW resources. During developing a talking agent without a domain limit, we understood that our system needs an unsupervised reinforcement learning algorithm, which could speed up the language and commonsensical knowledge discovery. In this paper we introduce our idea and the results of preliminary experiments.
1
Introduction
Numerous researchers of the last decade have underlined the importance of the relation between human emotions and our reasoning abilities [1,2,3,4], what gave birth to so called “affective computing”. In our approach, the very basic feelings toward the learned elements are borrowed from humans but starting point of our method bases on a much lower level than Homo sapiens’. As Penrose [6] claims, the intelligence may be a fruit of our development based on Darwinian natural selection. The ideas of how to catch an animal into a trap were developed long time before a human started describing things in an abstractive manner as in logic or mathematics. Many of the artificial intelligence researchers agree that bottom-up simplified learning methods are a key to broaden the computer’s capabilities and various algorithms were developed so far. The most popular ones are inspired biologically, as for example Artificial Neural Networks, genetic algorithms or insect colonies. Their weaknesses differ from one to another but they are not independent and they need laborious trainings. “Bacterium Lingualis” has a lot in common with the methods mentioned above but its differences come from the new possibilities brought by the Internet development. When we realized that pure logic is not enough for the machines to be rational and that they need all backgrounds that we have [5], it was the time to start teaching computers the commonsensical knowledge. Unfortunately it seems to be a Sisyphean task and even projects as CyC [7] or global OpenMind [8] are far away from being successful. We claim that full automatizing of this task is necessary and we should use as big corpora as possible, since, as we will demonstrate below, G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 460–467, 2003. c Springer-Verlag Berlin Heidelberg 2003
Bacterium Lingualis
461
not only the quality, but also the number of commonsensical inputs is crucial for learning the laws ruling our world. We “stepped back” in evolution and started creating an insect to begin learning from the very bottom without forcing it to behave on Cartesian philosophy. By Latin “Bacterium Lingualis” (hereafter abbreviated as BL) we mean a kind of web crawler which exploits only the textual level of WWW resources and treats it as its natural environment. We assume that cognition, by which we mean the process or result of recognizing, interpreting, judging, and reasoning, is possible without inputs other than word-level ones - as haptic or visual [9,10]. Although such data could significantly support our method, a robot which is able to travel from one place to another in order to touch something, would cost enormous amount of money, not mention a fact that current sensor technology is not ready for such an undertaking. There are several goals we want to achieve with BL. The main one is to make it search for the learning examples and learn from them unsupervisedly. For that reason we decided to move back in evolution and initiate self-developing “computational being” on the simplest level with as few human factors as possible. We assumed that all human behaviors are driven by one global reason - the pursuit of good feeling which seemed to us more adequate than simple natural selection. On the basis of above mentioned assumption we formulated “good feeling hypothesis” (hereafter abbreviated as GFH) and we implemented BL with simple negative and positive factors recognition mechanism. GFH determines the motivation for knowledge acquisition which involves language acquisition as the living environment for our program is a language itself. We imagine a language as a space where its components live together in a symbiosis. Its internal correlations are not understandable for BL and the learning task is to discover them. For exploring such an area we use simple web-mining methods inspired on Heylighen et al.’s work [11]. Most of the researches suggest that machines have to be intelligent to mine knowledge for us, we suggest that they have to mine for themselves to be intelligent.
2
Bacterium Lingualis
In order to make their idea clearer, suggest the fact of simulating basic instincts, and not to confuse their system with agents working for users, the authors decided to use a concept of an imaginary bacterium, although the rules of the language world (called here Lingua Environment) and its rules should not be considered as strictly corresponding to the biological world in which we live. BL’s organism is capable of moving if the relocation is needed, to sense food and enemies, excreting what is useless. We also equipped it with enzymes and two kinds of memory, which will be detailed hereunder. 2.1
Lingua Environment
We created the BL’s environment according to ideas proposed by Rzepka et al.[12]. To achieve better uniformity we decided to replace English language
462
R. Rzepka, K. Araki, and K. Tochinai
Fig. 1. Bacterium Lingualis (A – Flagellum, B – Positiveness Receptors, C – Concrete and Abstract Knowledge Memory, D – GF Cell)
homepages used in original Rzepka’s work with Japanese homepages since this language seems to have an easier structure for processing especially because of its particles usage what Fillmore has suggested in his works [13]. Other reasons will be presented further in this paper. For experiments we have collected almost three millions .jp domain homepages with the Larbin robot, then after filtering off pages without sentences in Japanese and converting them into pure text files, we created a web-based raw corpus consisting of about 2.090.000 documents (approx. 20 Gb). No tagging or whatsoever was conducted. 2.2
Flagellum
Flagellum symbolizes BL’s ability of movement inside its environment by which we mean text mining techniques. For these purposes we used Namazu indexing and searching system which has ability to separate words with spaces in so called wachigaki mode as Japanese sentences do not contain spaces. This helps BL to recognize what elements the contacted organism (by which we mean semantic units as text, sentence, words cluster, etc.) consists of. The morphological analysis could be done by recognizing similar patterns and statistical calculations but we assumed that omitting this level would not harm the results of BL’s performance and will shorten the processing time. 2.3
Positiveness Receptors
As we mentioned before, BL is able to automatically determine its emotional reaction to the observed object. We applied a simple mechanism proposed by Rzepka [12] which calculates so called Positiveness value retrieved from the Internet users’ opinions: P ositiveness =
Cα1 + Cα2 ∗ γ Cβ1 + Cβ2 ∗ γ
α1 = disliked, α2 = hated β1 = liked, β2 = loved, γ = 1.3 Where γ is to strengthen the “love” and “hate” opinions. This method helps BL to recognize if an object is very positive (Positveness = 5), positive (P
Bacterium Lingualis
463
= 4), indifferent (neutral) (P = 3), negative (P = 2) or very negative (P = 1), and can provide common information about what humans feel toward the given object. For instance, if the BL contacts with a single noun “beer” its reaction is positive, when the ”organism” consists of two elements: “cold” and “beer”, receptors send a P5 signal (very positive) to the GF cell, which will be described further. In the case of an unusual organism as a combination of “warm” and “beer”, BL receives a strong negative signal. We assumed that the basic emotional information about objects is necessary for the BL’s selfdevelopment in the same way as living organisms need the ability to determine what helps and what harms their development. 2.4
Particles as Enzymes
The receptors let BL contact other organisms and start symbiosis to ensure what can be learned from it. For example if BL “contacts” a noun “Sapporo” it can read its most frequent symbionts, that is, most frequent left and right neighbors of the contacted object. It is done by searching the environment together with particles characteristic to the Japanese language. Their role may be imagined as “grammar enzymes” which help to create semantic chains: “—Sapporo— de—(live, saw, take place...)”, “—Sapporo—ni—(go, come, arrive...)”, “— Sapporo—to—(Nagoya, compare, Otaru)”, “(...known, related, belonging)— u—Sapporo—”, “(...nice, nostalgic)—i—Sapporo—”, “(...strange, wonderful)— na—Sapporo—”. Since the causal relationships are crucial for the reasoning, several “IF enzymes” were prepared to be combined with discovered neighbors. It was relatively easy because nouns, verbs and adjectives have the same elastic if-forms in the Japanese language: “konpyuutaa-dattara (if computer)”, “tsukattara (if to use)” or “aokattara (if blue)”. The last example demonstrates some other interesting feature of Japanese language - many forms do not fit grammatical frames created for Indo-European languages. For instance, “kaeritai” (I want to go home) behaves identically as “nemui” (sleepy). Our further goal is to provoke BL to create its own rules of language providing it only with the basic tools, therefore “Going to Sapporo” is allowed to be treated on the same level as “cold Sapporo” if it leads to the same conclusions. This is also one of the reasons we decided to limit conventional linguistic terminology in our work and replace it with biological terms. 2.5
Concrete and Abstract Knowledge Memory (C)
BL is able to store gained knowledge. Its memory is divided into two coexisting units, Concrete Knowledge Memory and Abstract Knowledge Memory. Both are equally important but only the growth of the latter we consider as the system’s growth. At this point of the system’s development the concrete knowledge stands only for retrieved chains database, the abstract one describes a dictionary of automatically categorized groups of objects that frequently appear in similar combinations. This will be explained in the Method section.
464
2.6
R. Rzepka, K. Araki, and K. Tochinai
GF Cell (D) and the Role of Affective Reasoning
The logicalness of human behavior is often very difficult to be analyzed with mathematic approaches. We assumed that natural language itself should decide the rules for BL system, however, it must have some inborn initial instincts as its biological equivalent. Our Good Feeling Hypothesis mentioned in the Introduction is supposed to realize this task. We presuppose that if any activity of Homo sapiens has been always motivated by pursuing desire of “good feeling” also the language was one of the tools for achieving this goal and is based on the same “affective logic”. Therefore, “Good Feeling Hypothesis” assumes that implementing such a mechanism to a machine could help it to acquire knowledge and language. Following our thought that the GFH or defense of GFH are the reasons of every behavior, we inputed this two simple rules into GF Cell and made it default final conclusion of any reasoning while searching for different “sub-reasons” on its way. Obviously a “good feeling” varies according to the individual features but we discovered that some standards can be retrieved. Since we aim at creating unsupervised system, these standards are also supposed to play the role of safety valve. This is possible because the idea of Positiveness is based on average opinions of the homepages creators. It prevents the system from remembering chains like “killing is good” as the commonsensical facts. Another purpose of the GF Cell is to get rid of useless objects or mistaken strings that are created during the processing. This mechanism will be explained below. 2.7
Basic Method
As we mentioned earlier in this paper, in our approach we want to experiment on the lowest level of language mechanisms. Therefore, the first experiments we conducted, were to achieve automatic responses resembling Pavlovian reactions in the biological world. Such responses are needed to identify the object as pleasant, unpleasant or neutral and provoke a system’s suitable behavior, by which we mean the ability of reasoning on emotional ground. On this stage, the BL uses only a very simple algorithm mostly for the associations gathering and reasons lookup. The main part which is a matter of this paper. works as follow. First, it measures the Positiveness value of a contacted object. In the beginning, it has no syntactic knowledge but by measuring the Positiveness it is able to recognize if it is analyzing a verb, a noun, an adverb or a particle etc., since the “enzymes” link only specific objects. For example a Japanese particle DE does not appear after a verb. Although we prefer words grouping by their connections with particles, what bases the categorization on more metaphoric grounds, for the time being we limited input only to the nouns. If, for instance, “Sapporo” was inputted, BL seeks for the most frequent input-particle strings to decide three most suitable enzymes. In the case of “Sapporo” they are NI (approx. 94.000 hits), DE (approx. 80.000 hits) and KARA (41.700). For better accuracy this is done by Perl API for Google. The object “Sapporo” is recorded in Concrete Knowledge Memory in the NI-DE-KARA category, which is characteristic for places. Then, the mining process starts and the neighbors of Sapporo-ni,
Bacterium Lingualis
465
Fig. 2. The mechanism of a basic enzymatic selection
Sapporo-de and Sapporo-kara are found. Also in this case we limit the search only to the three most frequent neighbors. The candidates are taken from the first ten results and again the frequency for them is measured. For the reason that there is many mistaken retrievals and the choosing by hand would be very laborious, BL uses the Positiveness measures to eliminate mistakes as “Sapporodearu” where “dearu” means something different that “de aru”. We could use spacing program Kakasi used in Namazu, but we try to decrease the usage of external tools to the minimum. Then, the next neighbor is searched. The process is being repeated until the last possible neighbors found. After that, the string is saved in the Concrete Knowledge Memory. If there are other objects remembered in the same category, the inputted one is replaced with every one of them: Sapporo—enzyme—string1 —string2 —stringn Object1 —enzyme—string1 —string2 —stringn Object2 —enzyme—string1 —string2 —stringn Objectn —enzyme—string1 —string2 —stringn If one of them exists in the Lingua Environment, the abstract string is being saved at the Abstract Knowledge Memory: Sapporo—enzyme—string1 —string2 —stringn Objectn —enzyme—string1 —string2 —stringn creates an abstract chain: N I − DE − KARA—enzyme—string1 —string2 —stringn
We suppose that that collecting such abstractive rules based on the common sense may be very helpful also as a support to the other systems. The idea of using common parts in expressions to make abstractive rules is influenced by Araki et al.’s Inductive Learning [14]. If the analyzed neighbor object does not exist in the Concrete or Abstract Knowledge Memory, BL checks if it is processable with enzymes, that is if it appears with particles which determines of it is an individual object. If not, the object is deleted.
466
3
R. Rzepka, K. Araki, and K. Tochinai
Experiment and Its Results
For the first test of our system, we made BL search for the connotations explaining why the object being analyzed are regarded positive or negative. A group of 10 students assigned Positiveness value for 90 words picked by BL system as those which have distinct bad or good associations. We have confirmed that 36.3% of selected words were evaluated by humans as neutral, without any emotional connotations. For proving that objects’ emotional load varies from a situation, we made BL find a reasonable chain of conditions for 5 words that seemed to be indifferent for 5-7 of subjects. No word was recognized as neutral by every subject, what proves that associations of one expression are sometimes positive and other times negative depending on individual connotations. Discovering the examples of conditions or situations for both positive and negative associations was the task for the experiment. Differently from the methods proposed by Heylighen et al., BL does not only count the co-occurrences but actually mines further the inputted noun’s neighbors and measures its Positiveness also if it is a verb or adjective. This done by a “’noga enzyme”, which consists of two particles: (V/Adj)-no-ga. Using the same method and “noga enzyme”, BL is able to determine that eiga-o mi-ni iku (to go to the movies) or yasashii (kind) are commonly positive and uso-o tsuku (to lie) or mendoukusai (troublesome) are distinctly negative. The words that were recognized as neutral were: fun’iki (mood), dashi (dashi soup), jouken (condition), seikaku (personality) and kumiawase (combination). Here are the retrieved pairs of reasons were: “calm atmosphere of a little bar” (+), “atmosphere of irritation before a game” (-); “dashi soup made of sea-cucumber and dried sardines” (+), “dashi soup from today [erroneous result]” (-); “conditions - new building - because it’s new” (+), “conditions - changing job - to be told things” (-); “personality - cool - design - homepage” (+), “personality of myself when I can’t” (-); “combination of two persons who can’t eat garlic” (+), “reason not found - combination of hero and heroine” (-). We can see that there are semi-correct answers as BL still ignores some particles for output and a mistake caused by the fact, that BL’s Abstract Knowledge lacks of “time” category but we think that on some stage of learning this kind of concept will be developed automatically. Output is not ready to be used by language generation programs but we think it could be used in common-sense based talking agents as Rzepka et al.’s GENTA [12] or real-life robots which have no data about newly recognized object.
4
Conclusion
We understand that the Bacterium Lingualis is in its very early stage of development but in our opinion, the initial probes seem promising and assures us that not purely connectionistic or purely stochastic methods will help to tackle the “knowledge acquisition bottleneck”, but their combinations. Considering the importance of emotions, which is often neglected in AI research, we implemented BL system with a simple, automatic emotional information retrieving algorithm,
Bacterium Lingualis
467
which helps not only to reason affectively, but also to automatize the verification. Developing a brain from a bacterium is certainly a difficult task but we argue against the thesis, that machines always need to simulate Aristotelian logic or learn within the borders made by our grammar rules. We believe that the feelings influence every action of a human being and the numbers of given experiences form our characters which is quite random in most situations. The Internet with its enormous WWW corpus is probably the best place where a machine can gain its experience basing only on symbols and their occurrences, also gives us many possibilities not only in the commonsensical information retrieving field, but also in other areas as developing an automatic categorization method which is one of our future works on Bacterium Lingualis.
References 1. Damasio, A.R.: Descartes’ error – emotion, reason, and the human brain. Avon, New York (1994) 2. Bates, J.A.: The role of emotion in believable agents. Communications of the Association for Computing Machinery 37, ACM, pp. (1994) 122–125 3. Pinker, S.: How the Mind Works. New York, W. W. Norton (1997) 4. Picard, R.W., Klein, J.: Computers that Recognise and Respond to User Emotion: Theoretical and Practical Implications. MIT Media Lab Tech Report 538 (to appear) 5. Devlin, K.: Goodbye, Descartes. The End of Logic and the Search for a New Cosmology of the Mind. John Wiley & Sons, Inc. (1997) 6. Penrose, R.: Shadows of the Mind. A Search for the Missing Science of Consciousness. Oxford Univ. Press (1994) 7. Lenat, D.: Common Sense Knowledge Database CYC. (1995) http://www.opencyc.org/, http://www.cyc.com/ 8. Stork, D.G.: “Open Mind Initiative”. (1999) http://openmind.media.mit.edu/ 9. Kielkopf, C.F.: The Pictures in the Head of a Man Born Blind. Philosophy and Phenomenological research, Volume 28, Issue 4, June (1968) 501–513 10. Fletcher, J.F.: Spatial representation in blind children, Development compared to sighted children. Journal of Visual Impairment and Blindness 74 (1980) 381–385 11. Heylighen, F.: Mining Associative Meanings from the Web: from word disambiguation to the global brain. Proceedings of Trends in Special Language and Language Technology, R. Temmerman (ed.) Standaard Publishers, (2001) Brussels 12. Rzepka R., Araki K., Tochinai, K.: Is It Out There? The Perspectives of Emotional Information Retrieval from the Internet Resources. Proceedings of the IASTED Artificial Intelligence and Applications Conference, ACTA Press, Malaga, (2002) pp. 22–27 13. Fillmore, J.C.: The Case for Case. E. Bach & R.T.Harms, eds., Universals in Linguistic Theory, New York: Holt, Rinehart & Winston, (1968) 1–88 14. Araki, K., Tochinai, K.: Effectiveness of Natural Language Processing Method Using Inductive Learning. Proceedings of the IASTED International Conference Artificial Intelligence and Soft Computing, ACTA Press, (2001) Cancun.
Inducing Biological Models from Temporal Gene Expression Data Kazumi Saito1 , Dileep George2 , Stephen Bay2 , and Jeff Shrager2,3 1 NTT Communication Science Laboratories 2-4 Hikaridai, Seika, Soraku, Kyoto 619-0237 Japan [email protected] 2 Computational Learning Laboratory, CSLI Stanford University, Stanford, California 94305 USA [email protected], [email protected] 3 Department of Plant Biology Carnegie Inst. of Washington, Stanford, California 94305 USA [email protected]
We applied Inductive Process Modeling (Langley et al., in press) to induce biological process models from background knowledge and temporal gene expression data relating to the regulation of bacterial photosynthesis. Labiosa et al. (2003) studied the regulation of all of the genes in the Cyanobacterium Synechocystis sp. 6803. They simulated the natural day/night light cycle in a continuous culture cyclostat, and extracted samples at 2AM, 8AM, 10AM, noon, 2PM, 6PM, and midnight. Whole-cell RNA from these samples was converted to cDNA and hybridized to DNA microarrays, thereby measuring the abundance of RNA transcripts for all the genes in the organism at the selected times. Many of the photosynthesis-related RNAs show low abundance at night and increase rapidly when the sun rises, but these also exhibit an ‘M-shaped’ pattern with a substantial decrease at noon. The IPM algorithm initially searches through a space of model structures formed by instantiating a set of generic processes, and then uses second-order gradient descent to fit model parameters (Saito and Nakano, 1997). For the first stage, we developed a set of seven generic biomolecular processes including translation, transcription, degradation, photosynthesis, and up/down regulation of genes. Each process embeds numeric equations in qualitative structures. The equations describe relations among variables, and are cast as algebraic or differential equations. By compiling models composed from these processes into a set of linked equations, we can predict the dynamics of the system. The first stage of IPM produced 288 model structures (when limited to use seven or fewer processes). We then fit the parameters in all of these models. The best fitting model reproduces the general M-shape, and has an excellent quantitative fit to the data (r2 = 0.94). There is a long history in Artificial Intelligence of the use of generic processes in modeling complex systems (e.g., Forbus, 1984), and of automatically or semiautomatically discovering such models (e.g., Shrager, 1986). Our approach has close connections with these, and with other more recent efforts on the induction G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 468–469, 2003. c Springer-Verlag Berlin Heidelberg 2003
Inducing Biological Models from Temporal Gene Expression Data
469
of differential equation models by Todorovski and Dˇzeroski (1997), Bradley et al. (1999), and Koza et al. (2001), which also take advantage of domain knowledge to construct models of dynamical systems. An important advantage of our approach is that, because the models produced by the current method are composed from generic biomolecular processes, they are explanatory; that is, we can go beyond a mere description of observations, to account for them in terms of basic biomolecular processes. However, because the generic processes that we used here have a relatively large grain-size, the models explain aggregate phenomena well, but do not explain the observed abundance of individual genes. A natural way to extend our method is to produce models that encompass a system/subsystem decomposition. Such models would have many more parameters, and would require additional search constraint. Acknowledgments. This work was supported by the NASA Biomolecular Systems Research Program and by NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation. We thank Lonnie Chrisman for initial models, Pat Langley for guidance, and Andrew Pohorille for useful discussions.
References Bradley, E., Easley, M., & Stolle, R. (2001). Reasoning about nonlinear system identification. Artificial Intelligence, 133 , 139–188. Forbus, K. (1984). Qualitative process theory. Artificial Intelligence, 24 , 85–168. Koza, J., Mydlowec, W., Lanza, G., Yu, J., & Keane, M. (2001). Reverse engineering and automatic synthesis of metabolic pathways from observed data using genetic programming. Pacific Symposium on Biocomputing, 6 , 434–445. Labiosa, R., Arrigo, K., Grossman, A., Reddy, T. E., & Shrager, J. (2003). Diurnal variations in pathways of photosynthetic carbon fixation in a freshwater cyanobacterium. Presented at the European Geophysical Society Meeting. Langley, P., George, D., Bay, S. & Saito, K. (in press). Robust induction of process models from time-series data. Proceedings of the Twentieth International Conference on Machine Learning. Saito, K., & Nakano, R. (1997). Law discovery using neural networks. Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (pp. 1078–1083). Yokohama: Morgan Kaufmann. Shrager, J. (1987). Theory change via view application in instructionless learning. Machine Learning, 2 , 247–276. Todorovski, L., & Dˇzeroski, S. (1997). Declarative bias in equation discovery. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 376–384). San Francisco: Morgan Kaufmann.
Knowledge Discovery on Chemical Reactivity from Experimental Reaction Information 1, 2
Hiroko Satoh 1
and Tadashi Nakata
2
Artificial Intelligence Systems Division, National Institute of Informatics Hitotsubashi 2-1-2, Chiyoda, Tokyo 101-8430, Japan [email protected] 2 Synthetic Organic Chemistry Laboratory, RIKEN Hirosawa 2-1, Wako, Saitama 351-0198, Japan [email protected]
Abstract. A knowledge discovery approach from chemical information with focusing on negative information in positive data is described. Reported experimental chemical reactions are classified into some reaction groups according to similarities in physicochemical features with a self-organizing mapping(SOM) method. In one of the reaction groups, functional groups of reactants are divided into two categories according to the experimental results whether they reacted or not. The classes of the functional groups are used for derivation of knowledge on chemical reactivity and condition intensity. The approach is demonstrated with a model dataset.
1
Introduction
Experimental data are essential information in natural science, especially in chemistry, biology, and physics, and in general, contents and quality of the experimental data may vary according to character of data and their purposes for collection. Utilizing the experimental information for knowledge discovery, it is noticed that a methodology to meet the needs of research purposes is necessary, and a strategy for knowledge discovery depends on the contents and characteristics of the data. In chemistry, experimental chemical reaction data is essential for synthetic studies, and a large amount of data of chemical reactions has been reported, which we can see on technical journals, books, and databases. Recently, chemical reaction databases have been utilized as information sources for synthetic design systems, e.g. AIPHOS[1] and WODCA [2] and reaction prediction systems, e.g. EROS6.0[3] and SOPHIA[4,5]. There also have been studies on reaction classification, knowledge discovery, and construction of prediction models utilizing chemical reaction databases [6-9]. A characteristic of the chemical reaction data is that most of them are positive, namely, they are chemical reactions that actually occurred. Reactions that did not occurred, called as negative data are difficult to find in the publications nor databases, because the negative data are so difficult to demonstrate that they consequently are not opened to public and confined to in-house data of each laboratory. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 470–477, 2003. © Springer-Verlag Berlin Heidelberg 2003
Knowledge Discovery on Chemical Reactivity
471
However, the negative data is said as a clue for success of synthetic studies as well as the positive data. If the negative data are taken into account in chemical knowledge discovery, more useful knowledge in chemistry must be derived. We have therefore developed a new approach for utilization of the current available positive chemical reaction data as well as possible, thereby getting a kind of negative information in chemical sense from the positive data and deriving knowledge on chemical reactivity and condition intensity. The approach is demonstrated with a model dataset consisting of typical six oxygen functional groups in reductive reactions automatically collected from a database with a self-organizing mapping (SOM) method.
2
Methods
2.1
A Concept
Our developed new concept for knowledge discovery for chemical reactions formally simulates the thinking way of chemist, an outline of which is shown in Fig. 1. Chemist memorizes information from synthetic experiments, literatures, books, and databases, and classifies and organizes them according to similarities in several features, then, rules are discovered and theories are constructed with detailed analyses. Our approach uses reaction databases as the memorized data, and systematically classifies according to similarities in factors controlling reactions, which factors are computed based on physicochemical attributes concerning with chemical reactions, such as reactive sites, sites where did not react, reagents, catalysts, solvents, temperature, electronic features, structural features, stereochemical features, changes of the features, and their interactions. The results from classification are used for derivation of rules and knowledge, and for construction of models describing chemical reactions. We have been reported some studies toward to the final goal of reaction prediction according to the concept [7–12] and the knowledge derivation in the current article is also according to the concept.
Fig. 1. Outline of our concept and plan for chemical reaction prediction
472
H. Satoh and T. Nakata
2.2
Negative Information in Positive Data
Currently available reaction databases consist of almost positive data, and our methodology focuses on negative information in the positive data. The negative information is defined as not-reacted sites in a reaction that actually occurred. Fig. 2 shows an example of reaction data, where stereochemical information is not represented. Chemical structures of the left and right sides are a reactant and product, respectively. The text-type descriptions of “9-BBN, pinene” are reaction conditions. Partial structures denoted by gray circle were changed during the reaction, called as reaction sites, the partial structures except which did not react under the condition of “9-BBN, pinene”. They are defined as negative information in the positive reaction data and are used for the knowledge derivation.
Fig. 2. Negative Information in the Positive Data
2.3
Procedures of the Knowledge Derivation
Fig. 3 shows procedures of knowledge derivation on reactivity and condition intensity, including classification of reaction data, dividing functional groups based on the negative information, and knowledge derivation. The analyses for the knowledge derivation should be carried out for a dataset of reactions under similar conditions, because reactive sites varies according to the conditions, and therefore discussion of reactive sites without consideration of differences in conditions makes no sense. Reaction condition consists of various contribution of many factors, including reagents, catalyses, solvents, temperatures, and pressure, and consequently has highly diversity and complexity, which make it difficult to classify reactions based on similarities of conditions. Our developed new similarity measure of condition hence considers that conditions are similar under which similar transformations of reactants to products are given. Similarities in physicochemical transformations were used in the classification for classified reactions according to the transformation of physicochemical parameters for numerical analysis of reactions based on physicochemical attributes. A self organizing mapping (SOM) method is appropriate for analysis of non-linear data and used for the classification. Knowledge derivation was carried out for one of reaction groups resulted from the classification.
3 Results 3.1
Reaction Classification with SOM
The reaction classification was executed with the same way described in a previous paper[7], and just brief descriptions are given, here. 131 chemical reactions arbitrary
Knowledge Discovery on Chemical Reactivity
473
Fig. 3. Block Diagram of Knowledge Derivation on Reactivity and Condition Intensity
selected from a SYNLIB database[13] were classified according to similarities in six physicochemical parameters of σ charge, π charge, σ residual electronegativity, π residual electronegativity, polarizability, and pKa values with a SOM method of Kohonen[14]. A TUT-SOM system[15] was used for the execution of SOM.
Fig. 4. A Kohonen Map from Reaction Classification. Fig. 4 shows an obtained Kohonen map, where neurons are labeled according to similarities in reaction types, and its actual shape is a torus. An obtained reaction group that was included in the reaction type of "a" consisting of 15 reductive reactions (Fig. 5) was used in the next step.
3.2
Dividing Functional Groups
Six oxygen functional groups shown in Fig. 6 were focused, for which knowledge on reactivity was derived. The functional groups in the 15 reductive reactions were divided into two categories whether they were reacted or not, as shown in Table 1. Reaction conditions are listed in the middle column and functional groups that reacted and did not reacted under the corresponding condition are listed in the left and right columns, respectively. An oxygen atom in the functional groups is represented with a gray circle with the label.
474
H. Satoh and T. Nakata
Fig. 5. A Dataset: Reaction Data of One of the Reaction Groups
Fig. 6. A Dataset of Functional Groups
3.3
Construction of a Correlation Table
Using the table, a correlation table between the functional groups and the reaction conditions was constructed as shown in Table 2. When an oxygen atom of a functional group in every reactant reacted under the corresponding condition, the corresponding cell is marked as 1. When an oxygen atom of a functional group in some reactant structures reacted and one in the others did not reacted under the corresponding condition, the cell is marked as 2. For an oxygen atom in a functional group in any reactant did not react under the corresponding condition, the cell is marked as 3. When there is no data, the cell is marked as 0. 3.4
Deriving Knowledge
Based on the information in the correlation table, knowledge on reactivity and condition intensity was derived, as the followings.
Knowledge Discovery on Chemical Reactivity
475
Table 1. Results from the dividing of functional groups in the reactant structures.
Table 2. A correlation table between the functional groups and the reaction conditions.
Chemical Reactivity. Reactivity between all oxygen atoms in the functional groups was compared under the same conditions corresponding to the cells having 1, 2, and 3 in the correlation table. For example, an oxygen atom of a functional group O1 and O2 reacted and did not reacted under the same condition, respectively, then an oxygen atom of O1 was judged as higher reactivity than O2. If contrary experimental data were observed under the same condition, the judgment was draw. Fig. 7-(a) shows the results from the comparison, and they were ordered according to the reactivity as shown in Fig. 7-(b). It is found that the results are reasonable in a chemical sense.
476
H. Satoh and T. Nakata
Fig. 7. Knowledge on Reactivity from Database
Condition Intensity. Reaction conditions were sorted with two aspects of judgment. One was that a reaction condition under which a lower reactive functional group reacted was judged as stronger (judgment 1). The other is that a reaction condition under which a higher reactive functional group did not react was judged as milder(judgment 2). The results from two types of judgment are shown in Fig. 8. The same tendency was shown in both of the results with the two types of judgment, and the order is reasonable in a chemical sense, which chemist generally has.
Fig. 8. Knowledge on Condition Intensity from a Database
4
Conclusion
A new approach of chemical knowledge discovery on reactivity and condition intensity from experimental reaction information with focusing on the negative information in positive data has been described. An execution for a dataset of reactions having oxygen functional groups demonstrated that the approach gave reasonable results. Chemical information has high potential of providing useful knowledge and rules for molecular and synthetic design, and prediction of reactivity and biological activity. The described methodology will make it possible to bring out useful information from experimental databases having differences in quality and
Knowledge Discovery on Chemical Reactivity
477
contents. An application of the methodology to a large amount of data could give more practically useful knowledge in chemistry, which will be reported elsewhere.
References [1] [2] [3] [4] [5] [6] [7] [8]
[9] [10] [11] [12] [13] [14] [15]
Funatsu, K., Sasaki, S.: Computer-Assisted Synthesis Design and Reaction Prediction System AIPHOS. Tetrahedron Comput. Method., 1, 27(1988) Gasteiger, J., Ihlenfeldt, W. D.: A collection of computer methods for synthesis design and reaction prediction. Recl. Trav. Chim. Pays-Bas, 111, 270(1992) Röse, P., Gasteiger J.: Automated derivation of reaction rules for the EROS 6.0 system for reaction prediction. Anal. Chim. Acta 235, 163(1990) Satoh, H., Funatsu K.: SOPHIA, a Knowledge Base-Guided Reaction Prediction System Utilizing of a Knowledge Base Derived from a Reaction Database. J. Chem. Inf. Comput. Sci. 35, 34(1995) Satoh, H., Funatsu K.: Further Development of a Reaction Generator in the SOPHIA System for Organic Reaction Prediction. Knowledge-Guided Addition of Suitable Atoms and/or Atomic Groups to Product Skeleton. J. Chem. Inf. Comput. Sci. 36, 173(1996). Chen, L., Gasteiger, J.: Organic Reactions Classified by Neural Networks: Michael Additions, Friedel-Crafts Alkylations by Alkenes, and Related Reactions. Angew. Chem. 108, 844(1996), Angew. Chem. Int. Ed. Engl. 35, 763(1996) Satoh, H., Sacher, O., Nakata, T., Chen, L., Gasteiger, J., Funatsu, K.: Classification of Organic Reactions: Similarity of Reactions Based on Changes in the Electronic Features of Oxygen Atoms at the Reaction Sites. J. Chem. Inf. Comput. Sci. 38, 210(1998) Satoh, H., Itono, S., Funatsu, K., Takano, K., Nakata, T.: A Novel Method for Characterization of Three-dimensional Reaction Fields Based on Electrostatic and Steric Interactions toward the Goal of Quantitative Analysis and Understandingg of Organic Reactions. J. Chem. Inf. Comput. Sci. 39, 671(1999) Satoh, H., Funatsu, K., Takano, K., Nakata, T.: Classification and Prediction of Reagents' Roles by FRAU System with Self-organizing Neural Network Model. Bull. Chem. Soc. Jpn. 73, 1955(2000) Satoh, H., Koshino, H., Funatsu, K., Nakata, T.: Novel Canonical Coding Method for Representation of Three-dimensional Structures. J. Chem. Inf. Comput. Sci. 40, 622 (2000) Satoh, H., Koshino, H., Funatsu, K., Nakata, T.: Representation of Configurations by CAST Coding Method. J. Chem. Inf. Comput. Sci. 41, 1106(2001) Satoh, H., Koshino, H., Nakata, T.: Extended CAST Coding Method for Exact Search of Stereochemical Structures. J. Comput. Aided. Chem. 3, 48(2002) Distributed Chemical Graphics, Inc. Kohonen, T.: Self-organized Formation of Topologically Correct Feature Maps. Biol. Cybern. 43, 59(1982) Laboratory of Prof. Kimito Funatsu in Toyohashi University of Technology.
A Method of Extracting Related Words Using Standardized Mutual Information Tomohiko Sugimachi, Akira Ishino, Masayuki Takeda, and Fumihiro Matsuo Department of Informatics, Kyushu University, Hakozaki 6-10-1, Higashi-ku, Fukuoka, 812-8581, JAPAN {t-sugi, ishino, takeda, matsuo}@i.kyushu-u.ac.jp
Abstract. Techniques of automatic extraction of related words are of great importance in many applications such as query expansion and automatic thesaurus construction. In this paper, a method of extracting related words is proposed basing on the statistical information about the co-occurrences of words from huge corpora. The mutual information is one of such statistical measures and has been used for application mainly in natural language processing. A drawback is, however, the mutual information depends mainly on frequencies of words. To overcome this difficulty, we propose as a new measure a normalize deviation of mutual information. We also reveal a correspondence between word ambiguity and related words using word relation graphs constructed using this measure.
1
Introduction
Extraction of related words is important in many applications such as query expansion[1] and automatic thesaurus construction[2]. Those applications require related word lexicons, and are implemented with hand-built lexical resources such as WordNet[3]. However, hand-built lexical resources have some problems. Coverage is poor in rapidly changing domains, constructing then is costly, and frequent update is difficult. Therefore, we discuss word relativity which is fundamental to automatically construct the lexicons. Some previous works have been done in this field[4]. In this paper, a method of extracting related words is proposed basing on the statistical information about the co-occurrences of words from huge corpora. The mutual information is one of such statistical measures and has been used for application mainly in natural language processing[5][6]. A drawback is, however, that mutual information depends mainly on frequencies of words. To overcome this difficulty, we propose as a new measure a normalize deviation of mutual information. Specifically, let I(w1 , w2 ) be mutual information of word w1 , w2 , we propose a normalize deviation of I(w1 , w2 ) in nearly frequent words of w1 , w2 as a new measure. In viscerally, we attend to words which have larger mutual information than nearly frequent word. Therefore, we try to reduce affect of word freqency on mutual information. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 478–485, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Method of Extracting Related Words
479
FQ value[7] is one of previous works for this problem. FQ value is a product of mutual information and co-occurrences probability. The basic idea of it is from the mathematical implication of tf-idf. In the paper, they reported that FQ value is used for text categorize problem and keyword extraction problem, and considered weighted FQ value depend on word freqency. Keyword extraction is nearly problem of related word extraction. In the area, the statistical information about the co-occurrences of words is also used. Using co-occurrence graph[8] and χ2 measure[9] has been proposed. We alse reveal a correspondence between word ambiguity and related words using word relation graph constructed using this measure. There is another research on recognizing word polysemy[10]. On the method, co-occurrence is noun and/or noun relationship is used to co-occurrence, and they make word relation graph from number of co-occurrence of words. Then they proposed recognizing word polysemy using some connected compornents of the graph. Our method, word relation graph is made from related word extracted by normalize deviation of mutual information. And recognizeing word polysemy is based on same idea, using connected compornents of the graph. The remainder of this paper is organized as follows. In Section 2, we describe a definition of mutual information and propose the measures to quantify the strength of relation between two words based on mutual information. In Section 3, We evaluate our methods. And we discuss an application of the word senses disambiguation problem by word relation graph in Section 4. Conclude in Section 5.
2 2.1
Extraction of Related Words Using Mutual Information Mutual Information
It is considered that frequency of co-occurrences of two words in documents reflects some kind of relationship between these words. For example, “database” and “retrieval” often co-occur in documents in the field of information retrieval. But sometimes two frequent words co-occur with high probability, even when they have no relation. This leads to a simple idea of comparing the probability of two words co-occurring with the product of their respective probabilities. Based on this idea, the mutual information of two words has been used as a means of quantifying the strength of relation between them [5]. The mutual information between two words w1 and w2 , denoted by I(w1 , w2 ), is defined by: I(w1 , w2 ) = log
P (w1 , w2 ) , P (w1 )P (w2 )
where P (w) is the probability of w occurring in a document, and P (w1 , w2 ) is the probability of w1 and w2 co-occurring. To approximate the probabilities we use the relative frequencies within a large collection of documents.
480
2.2
T. Sugimachi et al.
Standardized Mutual Information
Mutual information I(w1 , w2 ) is a useful measure for quantifying the strength of “relation” between two words w1 , w2 . It is, however, widely known that this measure is biased towards the word frequency. That is, the value of I(w1 , w2 ) grows larger when the frequencies of w1 and w2 are lower as shown in Fig. 1. We researched mutual information on random-text, Fig. 3. The ranking of words in this random-text follows only the Zipf’s law. Zipf’s law states that the frequency of word w is inversely proportional to rank of w. It is experience knowns that the natural language texts compose of Zipf’s law. On random-text, the mutual information is also biased towards the word frequency. This means that the tendency of mutual information does not depend on word semantics and the kinds of corpora but only on word frequency. This causes a problem in extracting the related words of a given word using an appropriate threshold value θ. Most of the extracted words are low frequency words and middle frequency words are rarely extracted. Middle frequency words play an important role in applications such as information retrieval. We propose a way of standardizing the values of mutual information. Let N be the maximum frequency of a word, and let f1 , . . . , fr be a sequence of positive integers arranged in the increasing order satisfying that f1 = 1 and fr = N + 1. The sequence partitions the words into r groups according to their frequencies. That is, a word w belongs to the i-th group if its frequency is in the i-th interval [fi , fi+1 − 1]. This partitions the set of word-pairs into r(r + 1)/2 groups. The idea is to standardize I(w1 , w2 ) over the group to which (w1 , w2 ) belong. The resulting new measure, denoted by Z(w1 , w2 ), is defined as follows. Z(w1 , w2 ) =
I(w1 , w2 ) − µi,j , σi,j
where i and j, respectively, the group numbers for the words w1 and w2 , and µij and σij , respectively, the average and the deviation of I(w1 , w2 ) over the group to which the word pair (w1 , w2 ) belong. We substitute Z(w1 , w2 ) for I(w1 , w2 ) in the definition of the word relation graph. The word relation graph Gθ (w) of w and the set Wθ (w) of related words of w are re-defined accordingly.
3 3.1
Evaluation Corpus
We tested our method of extracting related words using INSPEC data, a widelyknown database of scientific and technical literature. The dataset we used is the collection of abstracts from the documents concerning Computer and Control Technology issued from 1969 to 2001. It consists of 1,865,281 documents containing 121,707,164 total word occurrences and 393,293 different words. It is ideal to deal with all kinds of words, but computing the relational ratio between every pair of words is obviously impractical. So we dealt with the most
A Method of Extracting Related Words
Fig. 1. Average of mutual information.
481
Fig. 2. Variance of mutual information.
Fig. 3. Average of mutual information of random text.
frequent 10000 words. They cover about 96.0% of total word occurrences in the documents. 3.2
Related Word Extraction
First, we compared the extracted related words using standard score Z(w1, w2) with those simply using the mutual information. The frequency range was partitioned into 100 intervals so that the logarithms of the boundaries fi were equally spaced, in which the most frequent 30 words were excluded as stop words so as to avoid the intervals containing few words. Fig. 1 and Fig. 2 are the scatter diagrams of the average and the deviation of I(w1 , w2 ) in narrow frequency range, respectively.
482
T. Sugimachi et al.
Table 1 lists the most frequent 20 of the extracted related words of “ftp”. Both of them extracted words which have adequate relationship to “ftp” , but many low-frequency words were extracted with high mutual information. On standard score Z(ftp, w), the extraction was lightly affected by the word frequency. Also, highly-related words, e.g. “file” was extracted with high score, however they have low mutual information because they are high frequency word. On the extraction about other words, there was a same tendency of “ftp”. Therefore, Z(w1 , w2 ) is more effective than the mutual information to extract related words. Table 1. Related words of “ftp”. w internet web mail servers ip www tcp news html http archives bulletin mosaic downloaded downloading proxy linux anonymous download mailing
4
f (w) I(ftp, w) 20400 5.30 15786 4.92 6749 5.77 6591 4.91 5230 5.61 4261 6.11 2986 6.41 2378 5.09 1905 5.56 1499 7.09 1239 5.52 960 4.96 951 5.56 824 6.26 751 5.32 737 5.19 699 5.91 657 9.11 655 6.52 643 5.68
w file internet web mail servers ip www tcp freely news html archive http archives mosaic downloaded linux anonymous download mailing
f (w) Z(ftp, w) 20501 4.03 20400 4.67 15786 4.23 6749 4.92 6591 4.15 5230 4.86 4261 5.24 2986 5.53 2484 3.98 2378 4.28 1905 4.73 1610 3.85 1499 6.18 1239 4.49 951 4.24 824 5.28 699 4.50 657 8.36 655 5.14 643 4.11
Application
This section, we propose an application for word polysemy using related word graphs. 4.1
Word Relation Graph
Let θ be a fixed threshold value. A word relation graph is an undirected graph such that the vertices are the words and an edge exists between two words w1 and w2 if and only if I(w1 , w2 ) ≥ θ. The word relation graph of a word w, denoted by Gθ (w), is its subgraph induced by limiting the vertices to those
A Method of Extracting Related Words
483
directly connected to w and w itself. We call all vertices but w in Gθ (w) the related words of w. The set of related words of a word w is denoted by Wθ (w). Fig. 4 displays the word relation graph of the word “scripts” for θ = 3.8 from INSPEC database. Based on the word relation graph of a word w, we can partition the related words of w into two groups: the related words of w that are connected only to w, and the rest. The words in the former seem to have a weak relation to w, compared to the words in the latter. Discarding the words in the former group possibly improves the precision of related word extraction. Remark that the word relation graph of a word is connected by definition. Removing the vertex w and the edges connected to w often divides the graph Gθ (w) into more than one connected component. We hypothesize that the connected components, respectively, correspond to the multi-senses of w in such cases. In Section 4, we discuss an application to the word sense disambiguation problem.
Fig. 4. Word relation graph of “scripts” for θ = 3.8.
4.2
Fig. 5. Word relation graph of “explorer” for θ = 4.2.
Word Polysemy with Cut-Vertex
A vertex of an undirected graph said to be a cut-vertex if removal of it increases the number of connected components. The word relation graph Gθ (w) of a word w is apparently connected by the definition, and we are interested in the cases where the vertex of w is a cut-vertex of Gθ (w), that is, only via the word w the components are connected each other. For example, see the word relation graph of “air” for θ = 4.0 displayed in Fig. 6. We can see that the vertex “air” is a cut-vertex of this graph, and removal of it will divide the graph into two connected components. This way, the related words of “air” are partitioned into two groups. It seems that one group is the category concerning “gas”, and the other concerning “aerial”. It appears that the division of the word relation graph coincides well with polysemy of “air”. Generalizing this observation, we hypothesize that when a word w is a cutvertex of is word relation graph for some threshold θ, the connected components resulting from removal of the vertex w correspond to multi-senses of w.
484
T. Sugimachi et al.
Fig. 6. The word relation graph of “air” for θ = 4.0, in which the vertex “air” is a cut-vertex which partitions the related words into two groups.
5
Conclusion
This paper shown the method of extracting of related words from huge corpus. That was basing on a standardized mutual information which was the statistical information about the co-occurrences of words. We have shown that mutual information depends mainly on frequencies of words, and this tendency does not depend on kinds of corups. For this problem, we proposed to use a standardized mutual information. Viscerally, we attended pair of words which have specially large mutual information. As a result, we could reduce an affective of word frequency on related word extraction. We also considered word relation graph. We revealed a correspondence between word ambiguity and division of word relation graph, and evaluated it. Word relation graph has possibilities to be effective for application to the word sense disambiguation problem. In this paper, we did not stem any words. The word relation graph of “scripts” (Fig. 4) was not divided. But if “scripts” was converted into “script”, the graph was divided. And also we did not use n-gram model for corpus. Using stemming and n-gram model divides some graphs more adequately.
References 1. Ellen M. Voorhees, On expanding query vectors with lexically related words. Proceedings of the Second Text Retrieval Conference, pp. 223–231, 1994.
A Method of Extracting Related Words
485
2. Y. Jing and B. Croft, An association thesaurus for information retrieval, Proceedings of RIAO, pp. 146–160, 1994. 3. Christiane Fellbaum, WordNet: An electronic lexical database, MIT press, Cambridge MA, 1998. 4. Dekang Lin and Patrick Pantel, DIRT - Discovery of Inference Rules from Text, Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2001, pp. 323–328, 2001. 5. Kenneth Ward Church and Patrick Hanks, Word association norms, mutual information, and lexicography, Computational Linguistics, Vol.16, No.1, pp. 22–29, 1990. 6. Dunning T, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics, Vol.19, No.1, pp. 61–74, 1993. 7. Akiko Aizawa, The Feature Quantity: An Information Theoretic Perspective of Tfidf-like Measures, Proceeding of ACM SIGIR2000, pp. 104–111, 2000. 8. Yukio Ohsawa, Nels E Benson and Masahiko Tachida, KeyGraph: Automatic Indexing by Co-occurrence Graph based on Building Construction Metaphor, Proceeding of IEEE Advanced Digital Library Conference, pp. 12–18, 1999. 9. Yutaka Matsuo and Mitsuru Ishizuka, Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information, Proceeding of 16th Int’l FLAIRS Conference, pp. 392–396, 2003. 10. Dominic Widdows and Beate Dorow, A Graph Model for Unsupervised Lexical Acquisition, 19th International Conference on Computational Linguistics, pp. 1093– 1099, 2002.
Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Masayuki Takeda1,2 , Shunsuke Inenaga1,2 , Hideo Bannai3 , Ayumi Shinohara1,2 , and Setsuo Arikawa1 1
Department of Informatics, Kyushu University 33, Fukuoka 812-8581, Japan 2 PRESTO, Japan Science and Technology Corporation (JST) {takeda, s-ine, ayumi, arikawa}@i.kyushu-u.ac.jp 3 Human Genome Center, University of Tokyo, Tokyo 108-8639, Japan [email protected]
Abstract. The classificatory power of a pattern is measured by how well it separates two given sets of strings. This paper gives practical algorithms to find the fixed/variable-length-don’t-care pattern (FVLDC pattern) and approximate FVLDC pattern which are most classificatory for two given string sets. We also present algorithms to discover the best window-accumulated FVLDC pattern and window-accumulated approximate FVLDC pattern. All of our new algorithms run in practical amount of time by means of suitable pruning heuristics and fast pattern matching techniques.
1
Introduction
String pattern discovery centers in the recent trend of knowledge discovery from computational datasets, since such data are usually stored as strings. Especially, the optimization problem of finding a pattern that most frequently appears in the set of positive examples and least frequently in the set of negative examples, is of great importance. To obtain useful knowledge from given datasets, we began with the possibly most basic and simple pattern class, the substring pattern class (as known to be the work of BONSAI [8]), for which the problem can be solved in linear time [3]. In many applications, however, it is necessary to consider a more flexible and expressive pattern class, for example in the field of bioinformatics, since biological functions are retained between sequences even if they are slightly different. In fact, the function may be dependent on two regions which are some distance apart in the sequence, but are close in the three dimensional structure that the sequence assumes. To this end, we considered the subsequence pattern class [3], and then the variable-length-don’t-care pattern class (VLDC pattern class) [5]. An example of a VLDC pattern is a abb with a, b ∈ Σ, where the variable length don’t care symbol matches any string. The VLDC pattern class is a generalization of the substring and subsequence pattern classes. In this paper, we further consider mismatches on constant segments of VLDC patters. Firstly, we consider replacing a character with a fixed length don’t care symbol ◦ that matches any single character. It yields a pattern such as a G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 486–493, 2003. c Springer-Verlag Berlin Heidelberg 2003
Discovering Most Classificatory Patterns for Very Expressive Pattern Classes
487
a ◦ b for the running example. Such a pattern is called a fixed/variable-lengthdon’t-care pattern (FVLDC pattern), and its class is named the FVLDC pattern class. Pursuing a pattern class of more expressive power, we secondly apply the approximate matching measure to constant segments of FVLDC patterns. An approximate FVLDC pattern is a pair q, k where q is an FVLDC pattern and k is a threshold for the number of mismatches in the constant segments of q. The approximate FVLDC pattern class no doubt has a great expressive power, but in the meantime, it includes quite many patterns without a classificatory power in the sense of being too general and matching even most negative examples. Typically, an approximate FVLDC pattern could match almost all long texts over a small alphabet. The same problem happens to the subsequence pattern class for the first place, but its window-accumulated version called the episode pattern class has overcome this difficulty [4]. This paper considers the window-accumulated FVLDC pattern class as well as the window-accumulated approximate FVLDC pattern class. We address that not only do they possess a remarkable expressive power, but also include many patterns with a very good classificatory power. The main result of this paper consists in new practical algorithms to find a best pattern that separates two given string datasets, for all of the FVLDC pattern class, approximate FVLDC pattern class, window-accumulated FVLDC pattern class, and window-accumulated approximate FVLDC pattern class. Each algorithm runs in reasonable amount of time, due to the benefit of suitable pruning heuristics and pattern matching techniques. Interested readers are guided into reading our previous work [3,4,5] for the overall idea of our project, and technical report [9] for more detail of this work.
2
Preliminaries
Let N be the set of non-negative integers. Let Σ be a finite alphabet. An element of Σ ∗ is called a string. The length of a string w is the number of characters in w and denoted by |w|. The empty string is denoted by ε, that is, |ε| = 0. Strings x, y, and z are said to be a prefix, substring, and suffix of string w = xyz, respectively. The substring of a string w that begins at position i and ends at position j is denoted by w[i : j] for 1 ≤ i ≤ j ≤ |w|. For convenience, let w[i : j] = ε for j < i. The reversal of a string w is denoted by wR . For a set S ⊆ Σ ∗ of strings, the number of strings in S is denoted by |S| and the total length of strings in S is denoted by S. Let be a special symbol called the variable length don’t care matching any string in Σ ∗ . A string over Σ ∪ {} is a variable-length-don’t-care pattern (VLDC pattern in short). For example, aabba is a VLDC pattern with a, b ∈ Σ. We say a VLDC pattern q matches a string w if w can be obtained by replacing ’s in q with some strings. In the running example, the VLDC pattern aabba matches string abababbbaa with the ’s replaced by ab, b, b and a, respectively. The size of p, denoted by size(p), is the length of p excluding all ’s. Thus,
488
M. Takeda et al.
size(p) = 5 and |p| = 9 for p = a ab ba. We remark that size(p) is the minimum length of the strings p matches. A pattern class over Σ is a pair (Π, L) consisting of a set Π of descriptions over some finite alphabet, called patterns, and a function L that maps a pattern π ∈ Π to its language L(π) ⊆ Σ ∗ . A pattern π ∈ Π is said to match a string w ∈ Σ ∗ if w belongs to the language L(π). ∗ ∗ Let good be a function from Π × 2Σ × 2Σ to the real numbers. The problem we consider is: Given two sets S, T ⊆ Σ ∗ of strings, find a pattern π ∈ Π that maximizes the score good (π, S, T ). Intuitively, the score good (π, S, T ) expresses the “goodness” of π in the sense of distinguishing S from T . The definition of good varies with applications. For example, the χ2 values, entropy information gain, and Gini index are often used. Essentially, these statistical measures are defined by the number of strings that satisfy the rule specified by π. Any of the above-mentioned measures can be expressed by the following form: good (π, S, T ) = f (xπ , yπ , |S|, |T |), where xπ = |S ∩ L(π)| and yπ = |T ∩ L(π)|. When S and T are fixed, xmax = |S| and ymax = |T | are regarded as constants. On this assumption, we abbreviate the notation of the function to f (x, y). In the sequel, we assume that f is conic [3,4,5] and can be evaluated in constant time. Let F (x, y) = max{f (x, y), f (x, 0), f (0, y), f (0, 0)}. The following lemma derives from the conicality of function f , on which our pruning heuristics are based. Lemma 1 ([3]). For any (x, y), (x , y ) ∈ [0, xmax ] × [0, ymax ], if x ≤ x and y ≤ y , then f (x, y) ≤ F (x , y ).
3
Allowing Mismatches by Don’t Cares
A VLDC pattern requires that all of its constant segments occur within a string in the specified order with variable-length gaps. To obtain a more expressive pattern class, we first introduce the don’t care symbol ◦ that matches any single character of Σ. We have two purposes: One is to allow mismatches in the constant segments of a VLDC pattern. The other is to realize a variety of gap symbols. An s-times repetition of ◦ works as a fixed-length don’t care that matches any string of length s over Σ. Also, a VLDC followed by an s-times repetition of ◦ expresses a lower-bounded-length gap symbol that matches any string of length at least s. A string in Π = (Σ ∪ {, ◦})∗ is called a fixed/variable-length don’tcare pattern (an FVLDC pattern), and (Π, L) is the FVLDC pattern class. The length and the size of an FVLDC pattern are defined similarly to those of a VLDC pattern, regarding ◦ as a character of Σ. Definition 1 (Finding best FVLDC pattern according to f ). Input: Two sets S, T ⊆ Σ ∗ of strings. Output: An FVLDC pattern p ∈ Π that maximizes the score f (xp , yp ), where xp = |S ∩ L(p)| and yp = |T ∩ L(p)|.
Discovering Most Classificatory Patterns for Very Expressive Pattern Classes
489
Note that any FVLDC pattern p of size greater than matches no strings in S ∪ T , and thus we have xp = 0 and yp = 0. The possible maximum length of p is 2size(p) + 1 since we ignore the patterns with two or more consecutive ’s, and therefore we can restrict by 2 + 1 the length of the patterns to be examined against S and T . Lemma 2 (Search space for best FVLDC pattern). The best FVLDC pattern for S and T can be found in Π () = {p ∈ Π | |p| ≤ 2 + 1}, where is the maximum length of the strings in S ∪ T . We can prune the search space according to the following lemma. Lemma 3 (Pruning lemma for FVLDC pattern). Let p be any FVLDC pattern in Π. Then, f (xpq , ypq ) ≤ F (xp , yp ) for every FVLDC pattern q ∈ Π. The following is a sub-problem of Definition 1, which should be solved quickly. Definition 2 (Counting matched FVLDC patterns). Input: A set S ⊆ Σ ∗ of strings and an FVLDC pattern p ∈ Π. Output: The cardinality of the set S ∩ L(p). The minimum DFA that accepts L(p) for an FVLDC pattern p has an exponential number of states. However, there is a nondeterministic finite automaton (NFA) with only m + 1 states that accepts the same language. As a practical solution, we adopt the bit-parallel simulation to this NFA. We use (m + 1)-bit integers to simulate in constant time the state transitions in parallel for each character of the input strings. When m + 1 is not greater than the computer word length, say 32 or 64, the algorithm runs in O(S) time after O(|p||Σ|)-time preprocessing for p.
4
Finding Best Approximate FVLDC Patterns
Let (Π, L) be the FVLDC pattern class, and let δ : Σ ∗ × Σ ∗ → N ∪ {∞} be the well-known Hamming distance measure [2]. A pattern p ∈ Π is said to approximately match a string w within a distance k if there is a string w ∈ L(p) such that δ(w, w ) ≤ k. For any p, k ∈ Π × N , let Lδ (p, k) = {w ∈ Σ ∗ | ∃w ∈ L(p) such that δ(w, w ) ≤ k}. Then, the pair (Π × N , Lδ ) is a new pattern class derived from (Π, L) with δ. Let us call the elements of Π × N the approximate FVLDC patterns. For a fixed k ∈ N , an ordered pair p, k with p ∈ Π is called a k-approximate FVLDC pattern. Definition 3 (Finding best approximate FVLDC pattern according to f ). Input: Two sets S, T ⊆ Σ ∗ of strings, and a non-negative integer kmax . Output: An approximate FVLDC pattern π = q, k ∈ Π × [0, kmax ] that maximizes the score f (xπ , yπ ), where xπ = |S ∩ Lδ (π)| and yπ = |T ∩ Lδ (π)|.
490
M. Takeda et al.
We here have to find the best combination of a pattern q and an error level k. Lemma 4 (Search space for best approximate FVLDC pattern). The best approximate FVLDC pattern for S, T ⊆ Σ ∗ and kmax can be found in Π () × [0, kmax ], where Π () is the same as in Lemma 2. We have two pruning techniques for the problem of Definition 3. The first one is as follows. Definition 4 (Computing best error level according to f ). Input: Two sets S, T ⊆ Σ ∗ of strings, an FVLDC pattern q ∈ Π, and a nonnegative integer kmax . Output: An integer k ∈ [0, kmax ] that maximizes the score f (xπ , yπ ) for π = q, k, where xπ = |S ∩ Lδ (π)| and yπ = |T ∩ Lδ (π)|. For an FVLDC pattern q ∈ Π and string u ∈ Σ ∗ , we define the distance between q and u by Distδ (q, u) = min{δ(w, u) | w ∈ L(q)}. If there is no such w, let Dist(q, u) = ∞. For q ∈ Π and S ⊆ Σ ∗ , let ∆(q, S) = {Distδ (q, u) | u ∈ S}. Then, the best error level of a pattern q ∈ Π for given S, T ⊆ Σ ∗ can be found in the set ∆(q, S ∪ T ) ∩ [0, kmax ]. Lemma 5. For an FVLDC pattern q ∈ Π and string w ∈ Σ ∗ , Distδ (q, w) can be computed in O(|q||w|) time. Proof. Directly from the results of Myers and Miller [6], in which regular expressions are treated instead of FVLDC patterns.
Lemma 6 (Pruning lemma 1 for approximate FVLDC pattern). Let p be any FVLDC pattern in Π and π = p, kmax . Then, f (xτ , yτ ) ≤ F (xπ , yπ ) for every approximate FVLDC pattern τ = pq, k such that q ∈ Π and k ∈ [0, kmax ]. The second approach for pruning the search space is quite simple. We repeatedly execute a procedure that finds the best k-approximate FVLDC pattern p, k for S and T , in increasing order of k = 0, 1, . . . , kmax . It is possible to prune the search space Π () × {k} by: Lemma 7 (Pruning lemma 2 for approximate FVLDC pattern). Let k ∈ [0, kmax ], p be any FVLDC pattern in Π, and π = p, k. Then, f (xτ , yτ ) ≤ F (xπ , yπ ) for every k-approximate FVLDC pattern τ = pq, k with q ∈ Π. The following is a sub-problem of Definition 3, which should be solved quickly. Definition 5 (Counting matched approximate FVLDC patterns). Input: A set S ⊆ Σ ∗ of strings and an approximate FVLDC pattern p, k. Output: The cardinality of the set S ∩ Lδ (p, k). We developed an efficient algorithm to solve the above sub-problem. Although we omit the detail due to lack of space, it performs the diagonal-wise bit-parallel simulation (see [7]) of an NFA that recognizes the language Lδ (p, k) of an approximate FVLDC pattern p, k. The NFA being simulated has (m+1)(k +1)
Discovering Most Classificatory Patterns for Very Expressive Pattern Classes
491
states (m = size(p)), but (m − k + 1)(k + 1)-bits are enough. If the (m − k + 1)(k+1)-bit representation fits in a single computer word, it runs in linear time in S after O(|q||Σ|)-time preprocessing of p, k. The algorithm was inspired by the work of Baeza-Yates and Navarro [1], which aims an approximate substring pattern matching where the Levenshtein distance is used as a distance measure δ, not the Hamming distance. Although their algorithm could be easily extended to the approximate FVLDC pattern matching if the Levenshtein distance measure was used, a new development is actually necessary to cope with the Hamming distance.
5
Extension to Window-Accumulated Patterns
For any pattern class Π, L, we introduce a window whose size (width) limits the length of a pattern occurrence within a string. A pattern p ∈ Π is said to occur in a string w within a window of size h if w has a substring of length at most h the pattern p matches. For any pair p, h in Π × N , let ˆ L(p, h) = {w ∈ Σ ∗ | p occurs in w within a window of size h}, ˆ ˆ is a new pattern and let L(p, ∞) = L(p) for convenience. The pair (Π × N , L) class derived from (Π, L). We call the elements of Π ×N the window-accumulated patterns for (Π, L). Definition 6 (Finding the best window-accumulated pattern in (Π, L) according to f ). Input: Two sets S, T ⊆ Σ ∗ of strings. Output: A window-accumulated pattern π = q, h ∈ Π × N that maximizes the ˆ ˆ and yπ = |T ∩ L(π)|. score f (xπ , yπ ), where xπ = |S ∩ L(π)| We stress h is not given beforehand, and hence we have to find the best combination of a pattern q and window width h. The search space is thus Π × N , not Π. The following is a sub-problem of Definition 6, which should be solved quickly. Definition 7 (Computing best window size for (Π, L) according to f ). Input: Two sets S, T ⊆ Σ ∗ of strings and a pattern q ∈ Π. Output: An integer h ∈ N that maximizes the score f (xq,h , yq,h ), where ˆ ˆ h)| and yq,h = |T ∩ L(q, h)|. xq,h = |S ∩ L(q, For a pattern q ∈ Π and for a string u ∈ Σ ∗ , we define the minimum window ˆ size θ of q for u by θq,u = min{h ∈ N | u ∈ L(q, h)}. If there is no such value, let θq,u = ∞. For any q ∈ Π and any S ⊆ Σ ∗ , let Θ(q, S) = {θq,u | u ∈ S}. The best window size of q ∈ Π for S, T ⊆ Σ ∗ can be found in Θ(q, S ∪ T ). Lemma 8 (Search space for best window-accumulated pattern). The best window-accumulated pattern in (Π, L) for S and T can be found in {q, h | q ∈ Π and h ∈ Θ(q, S ∪ T )}. We emphasize that the above discussion holds for the window-accumulated version of any pattern class (Π, L). However, the complexity of computing the minimum window size depends on (Π, L).
492
5.1
M. Takeda et al.
Window-Accumulated FVLDC Patterns
This section is devoted to finding the best window-accumulated FVLDC pattern from two given sets S, T of strings. Lemma 9 (Search space for best window-accumulated FVLDC pattern). The best window-accumulated FVLDC pattern for S and T can be found in {q, h | q ∈ Π () and h ∈ Θ(q, S∪T )}, where Π () is the same as in Lemma 2.
Lemma 10 (Pruning lemma for window-accumulated FVLDC pattern). Let p be an FVLDC pattern and π = p, ∞. Then, f (xτ , yτ ) ≤ F (xπ , yπ ) for every window-accumulated FVLDC pattern τ = pq, h such that q ∈ Π and h ∈ N. Lemma 11. The minimum window size θq,w of an FVLDC pattern q for a string w ∈ Σ ∗ can be computed in O(|q||w|) time. Proof. By a standard dynamic programming approach.
The dynamic programming method is, however, relatively slow in practice. An alternative way to solve the problem is to build from a given FVLDC pattern q two NFAs accepting L(q) and L(q R ), which we call the forward and backward NFAs, respectively. An occurrence of a pattern q that starts at position i and ends at position j of a string w, denoted by (i, j), is said to be minimal if no proper substring of w[i : j] contains an occurrence of q. Let (i1 , j1 ), · · · , (ir , jr ) be the sequence of the minimal occurrences of q in w satisfying i1 < · · · < ir . We run the forward NFA over w starting at the first character to determine the value j1 . When j1 is found, we use the backward NFA going backward starting at the j1 -th character in order to determine the value i1 . After i1 is determined, we again use the forward NFA going forward starting at the (i1 + 1)-th character for finding the value j2 . Continuing in this fashion, we can determine all of the minimal occurrences of q in w. The minimal window size is obtained as the minimum among the widths of the minimal occurrences. We simulate the two NFAs over a given string basing on the bit-parallelism mentioned in Section 3 when size(q)+1 does not exceed the computer word length. Although the running time of this method is O(r|w|) = O(|w|2 ) in the worst case, it shows a good performance compared with the above-mentioned dynamic programming based method, since the number r of minimal occurrences of q is not so large in reality. 5.2
Window-Accumulated Approximate FVLDC Patterns
The search space for the best window-accumulated approximate FVLDC pattern is Π × [0, kmax ] × N . A reasonable approach would be to compute the best pattern in Π × {k} × N for each k = 0, 1, . . . , kmax , and then choose the best one among them. We have only to consider finding the best window-accumulated k-approximate FVLDC pattern for a fixed k.
Discovering Most Classificatory Patterns for Very Expressive Pattern Classes
493
Lemma 12 (Search space for best window-accumulated k-approximate FVLDC pattern). Let k be a fixed non-negative integer. The best windowaccumulated k-approximate FVLDC pattern for S, T can be found in {q, k, h | q ∈ Π () and h ∈ Θ(q, k, S ∪ T )}, where Π () is the same as in Lemma 2. Lemma 13 (Pruning lemma for window-accumulated k-approximate FVLDC pattern). Let k be a fixed non-negative integer. Let p be an FVLDC pattern, and let π = p, k, ∞. Then, f (xτ , yτ ) ≤ F (xπ , yπ ) for every windowaccumulated k-approximate FVLDC pattern τ = pq, k, h such that q ∈ Π and h ∈ N. Lemma 14. Let k be a fixed non-negative integer. The minimum window size θq,w of a k-approximate FVLDC pattern q, k for a string w ∈ Σ ∗ can be computed in O(k|q||w|) time. Proof. A straightforward extension of the dynamic programming method for Lemma 11.
In practice, we again adopt the two NFAs based approach. It is possible to simulate the NFA for an approximate FVLDC pattern q, k over a string w using (m − k + 1)(k + 1)-bit integers in linear time, where m = size(q). The running time is therefore O(r|w|) if (m − k + 1)(k + 1) does not exceed the computer word length.
References 1. R. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2):127–158, 1999. 2. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York, 1997. 3. M. Hirao, H. Hoshino, A. Shinohara, M. Takeda, and S. Arikawa. A practical algorithm to find the best subsequence patterns. In Proc. Discovery Science 2000, volume 1967 of LNAI, pages 141–154. Springer-Verlag, 2000. 4. M. Hirao, S. Inenaga, A. Shinohara, M. Takeda, and S. Arikawa. A practical algorithm to find the best episode patterns. In Proc. Discovery Science 2001, volume 2226 of LNAI, pages 435–440. Springer-Verlag, 2001. 5. S. Inenaga, H. Bannai, A. Shinohara, M. Takeda, and S. Arikawa. Discovering best variable-length-don’t-care patterns. In Proc. Discovery Science 2002, volume 2534 of LNCS, pages 86–97. Springer-Verlag, 2002. 6. E. W. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51(1):5–37, 1989. 7. G. Navarro and M. Raffinot. Flexible pattern matching in strings: Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge, 2002. 8. S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara, and S. Arikawa. Knowledge acquisition from amino acid sequences by machine learning system BONSAI. Trans. of Information Processing Society of Japan, 35(10):2009–2018, 1994. 9. M. Takeda, S. Inenaga, H. Bannai, A. Shinohara, and S. Arikawa. Discovering most classificatory patterns for very expressive pattern classes. Technical Report DOI-TR-CS-219, Department of Informatics, Kyushu University, 2003.
Mining Interesting Patterns Using Estimated Frequencies from Subpatterns and Superpatterns Yukiko Yoshida, Yuiko Ohta, Ken’ichi Kobayashi, and Nobuhiro Yugami Fujitsu Laboratories, Ltd 4–1–1 Kamikodanaka, Nakahara-ku, Kawasaki, Kanagawa 211–8588, Japan {y-yoshida,yuiko,kenichi,yugami}@jp.fujitsu.com
Abstract. In knowledge discovery in databases, the number of discovered patterns is often too enormous for human to understand, so that filtering out less important ones is needed. For this purpose, a number of interestingness measures of patterns have been introduced, and conventional ones evaluate a pattern as how its actual frequency is higher than the predicted values from its subpatterns. These measures may assign high scores to not only a pattern consisting of a set of strongly correlated items but also its subpatterns, and in many cases it is unnecessary to select all these subpatterns as interesting. To reduce this redundancy, we propose a new approach to evaluation of interestingness of patterns. We use a measure of interestingness which evaluates how the actual frequency of a pattern is higher than the predicted not only from its subpatterns but also from its superpatterns. On the strength of adding an estimation from superpatterns, our measure can more powerfully filter out redundant subpatterns than conventional measures. We discuss the effectiveness of our interestingness measure through a set of experimental results.
1
Introduction
Discovering frequent patterns in databases is one of major tasks in data mining, and a number of methods such as Apriori [1] have been developed for the task. However, the number of discovered patterns is often too enormous for human to understand, so that filtering out unimportant ones is needed. One of major approaches is to evaluate interestingness of patterns, and a number of measures of interestingness have been proposed as reviewed in [7,9, 14]. These measures are conventionally based on a comparison of the frequency of a pattern to those of its subpatterns. For example, Gini index and χ2 are based on a correlation between two subpatterns divided from a pattern [14]. Dong and Li [5] introduced a neighbourhood-based interestingness in which the distance between two patterns is calculated using frequencies of their subpatterns. Hussain et al [8] proposed a measure of relative interestingness in which a pattern is separated into a novel and well-known parts. Jaroszewicz and Simovici [10] estimated the frequency of a pattern from those of all its subpatterns using the maximum entropy method to evaluate the pattern’s interestingness. Calders G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 494–501, 2003. c Springer-Verlag Berlin Heidelberg 2003
Mining Interesting Patterns Using Estimated Frequencies Table 1. An example of a database. T1 T2 T3 T4 T5 T6
: : : : : :
{A, B, C, D, E} {D, E} {E} {A, B, C, D} {D, E} {A, B, C}
495
Table 2. Frequent patterns with the minimum support three. pattern frequency pattern frequency {A} 3 {A, B} 3 {B} 3 {A, C} 3 {C} 3 {B, C} 3 {D} 4 {D, E} 3 {E} 4 {A, B, C} 3
and Goethals [4] proposed a method to detect redundant patterns by checking whether the exact frequency of a pattern can be derived from those of its subpatterns. These approaches, however, may give high scores to not only a pattern consisting of a set of strongly correlated items but also its subpatterns. In many cases, it is unnecessary to select all these subpatterns as interesting. To reduce this redundancy, we propose a new approach to evaluation of interestingness of patterns. Our approach is to estimate the frequency of a pattern from not only its subpatterns but also its superpatterns. On the strength of adding an estimation from superpatterns, our measure can more powerfully filter out redundant subpatterns than conventional measures. In the rest of this paper, we first review preliminary notions for evaluation of interestingness of patterns, introduce our approach and interestingness measure, and discuss some extensions to other types of patterns. We show an empirical evaluation of our interestingness measure, and finally remark conclusions.
2 2.1
Framework Preliminaries
We discuss the interestingness of patterns based on the framework of the standard frequent itemset mining. A database consists of a set of transactions and each transaction is a set of items. We assume each item appears at most once in a transaction. A pattern s is a set of items and its frequency f (s) is the number of transactions which include all the elements of s. A pattern s is frequent if its frequency equals to or higher than a given positive number, the minimum support. If a pattern s is a proper subset of another pattern t, then s is a subpattern of t and t is a superpattern of s. Table 1 is an example of a database of transactions. This database consists of six transactions and each transaction consists of one to five items in {A, B, C, D, E}. Table 2 shows the list of frequent patterns with the minimum support three. 2.2
The Measure of Interestingness
Let s be a pattern to estimate, s− a non-empty subpattern of s, s \ s− the pattern of items that are included in s but not included in s− . Assuming the
496
Y. Yoshida et al.
independence between s− and s \ s− the estimation of the frequency of s from s− can be calculated as follows: f (s \ s− ) fˆ(s|s− ) = f (s− ) · N where N is the number of transactions in the database. If the actual frequency f (s) of s is similar to these estimated values, s can be regarded as a trivial conclusion from these subpatterns. On the other hand, if f (s) is much higher than these estimated values, then s is unexpected from these subpatterns. The pattern is worth knowing, that is, interesting in comparison with its subpatterns. In Table 1, for example, a pattern {A, B} has two subpatterns {A} and {B}, and all of them appear three times in the database, respectively. In this case, the estimated frequencies of the pattern {A, B} are calculated as fˆ({A, B}|{A}) = fˆ({A, B}|{B}) = 3 · 3/6 = 1.5, and the actual frequency f ({A, B}) = 3 comes twice as large as these estimated values. Therefore, {A, B} is seemingly interesting. However, since a group of items A, B and C always appear simultaneously in the database, the pattern {A, B, C} should be regarded as more important. The pattern {A, B} only partially represents a relation among A, B and C, and is less interesting than the pattern {A, B, C}. Evaluation using only subpatterns has a tendency to evaluate not only a pattern consisting of a set of strongly correlated items but also its subpatterns as interesting, and yields too many interesting patterns to understand. To solve this problem, we then consider the estimation of the frequency of a pattern s from a superpattern s+ which is defined as follows: fˆ(s|s+ ) = f (s+ ) ·
N f (s+ \ s)
and regard a pattern as interesting if f (s) is much higher than the estimated values from both its subpatterns and superpatterns. In the above example, the estimation of the frequency of a pattern {A, B} from its superpattern {A, B, C} is fˆ({A, B}|{A, B, C}) = 3 · 6/3 = 6, which means the actual frequency of {A, B} is smaller than the estimated value, and then the pattern {A, B} can be rejected as less interesting. It is promising as a measure of interestingness to combine the evaluations from subpatterns and superpatterns. So, we define our interestingness measure of a pattern s as follows: Isub+super (s) = f (s) f (s) 1 min arctan arctan + min π s− ∈S − s+ ∈S + fˆ(s|s− ) fˆ(s|s+ ) where S − and S + are certain sets of subpatterns and superpatterns of s, respectively. We choose the minimum value among the set of subpatterns, which is based on the idea that, if there is at least one subpattern from which the estimated frequency is similar to the actual frequency, the pattern can be induced
Mining Interesting Patterns Using Estimated Frequencies
497
after all using the subpattern among the set of subpatterns. The same goes for the set of superpatterns. We use the arctangent of the ratio of the actual frequency to the estimated value, instead of the ratio itself, to limit the range of interestingness values. The factor 1/π is for normalisation. In fact, there are various ways to combine the evaluations from subpatterns and superpatterns other than taking the sum of the minimum values as in our definition. We will evaluate the performance in some variations such as taking the product or a wighted average of these values as future works. 2.3
Extensions to Other Types of Patterns
We have discussed the interestingness of patterns based on frequent itemset mining. However, our approach can be easily extended to other types of patterns by modifying the estimation of frequencies of patterns. In sequential pattern mining [11,13], for example, each transaction in a database is an ordered set of items, a pattern is also an ordered set of items, and its frequency is the number of transactions in which all the items of the pattern appear in the exact order. By considering the number of possible positions of items, we can estimate the frequency of a pattern s with the independence assumption as follows: n(s− )! · n(s \ s− )! f (s \ s− ) · f (s− ) · fˆ(s|s− ) = n(s)! N fˆ(s|s+ ) =
n(s+ )! N · f (s+ ) · n(s)! · n(s+ \ s)! f (s+ \ s)
where s− and s+ are a subpattern and a superpattern, respectively, n(s) is the number of items in s.
3
Experimental Evaluation
In this section we present an experimental evaluation of our interestingness measure. In this experimental evaluation, the interestingness of a pattern of length n was calculated by the use of its subpatterns of length (n − 1) and superpatterns of length (n + 1). To evaluate the effect of the main characteristic of our interestingness measure, which is an addition of estimation from superpatterns, we compared it against a conventional measure of interestingness, which uses estimation only from subpatterns: f (s) 2 min arctan Isub (s) = π s− ∈S − fˆ(s|s− ) For simplicity, we call our measure the subpattern+superpattern (or sub+super) measure, and the measure for comparison the subpattern (or sub) measure.
498
Y. Yoshida et al. Table 3. Databases used for our experiments
max length num of freq num of trans (avr) length min support of patterns patterns of a trans mushroom 8,124 23 0.05 8 1,740,884 soybean 307 36 0.3 8 1,691,042 zoo 101 17 0.05 8 334,531 internet-ads 3,279 13.76 (avr) 0.025 8 11,144 entree 4,160 10.91 (avr) 0.005 8 37,166 dna 2,000 46.61 (avr) 0.05 5 20,630 dataset
Table 4. The top 10 interesting patterns in the dna database: (A) the frequency of the i-th pattern; (B) the cumulative frequency up to the i-th pattern (the number of transactions which include at least one of the 1st through i-th patterns). sub+super pattern 1. {A088,C3} 2. {A082,A084,A089,A090,C2} 3. {A087,C3} 4. {A083,A086} 5. {A073,A076,A084,A089,C2} 6. {A013,A073,A084,A089,C2} 7. {A037,A073,A084,A089,C2} 8. {A071,A092,A099,A104,C1} 9. {A057,A066} 10. {A043,A073,A084,A089,C2}
(A) 276 129 259 101 107 107 103 105 136 110
(B) 276 405 664 720 798 846 867 964 1,035 1,045
sub pattern 1. {A104,C1} 2. {A099,A104,C1} 3. {A092,A099,A104,C1} 4. {A092,C1} 5. {A092,A104,C1} 6. {A092,A099,C1} 7. {A084,C2} 8. {A099,C1} 9. {A082,A084,C2} 10. {A082,C2}
(A) 398 287 287 463 397 322 484 322 392 392
(B) 398 398 398 464 464 464 948 948 948 948
The steps of filtering interesting patterns using one of these measures are 1) extracting frequent patterns which satisfied a certain minimum support from the test database; 2) calculating the interestingness of each frequent pattern based on the measure; and 3) selecting the top n interesting patterns. We performed our experiments on the databases shown in Table 3. The mushroom, soybean, zoo, and internet-ads databases were obtained from the UCI Machine Learning Repository [2]. The entree database was generated from the Entree Chicago Recommendation Data [6] in the UCI KDD Archive [3]. The dna database was from the StatLog [12] datasets. The original database consists of instances of 180 binary indicator variables, and we converted it to a set of transactions by listing up variables of which the indicators were ‘1’ for each instance. For each database, the minimum support was chosen so that the number of frequent patterns came within the range between 104 and 107 . Table 4 shows the top 10 interesting patterns and their frequencies in the dna database under the sub+super and sub measures, respectively. Under the sub measure quite similar patterns were selected, and an observation on their frequencies showed that these patterns were simply subsets of strongly correlated items. For example, the first pattern {A104, C1} was included in 398 transactions
Mining Interesting Patterns Using Estimated Frequencies
499
(of total 2,000 transactions) in the database. However, about 70% (= 287/398) of these transactions also included the second pattern {A099, A104, C1}, which was a superpattern of the first pattern. Moreover, all of these 287 transactions included the third pattern {A092, A099, A104, C1}, which was a superpattern of the first and second patterns. That caused the cumulative frequency to stop increasing, and to gain as much information as possible in the database using a small number of patterns, the first pattern was less interesting and the second one was completely redundant compared with the third one. We found a similar redundancy for other groups of patterns such as the 7, 9 and 10th patterns ({A084, C2}, {A082, A084, C2} and {A082, C2}). In contrast, under the sub+super measure, a single pattern {A082, A084, A089, A090, C2} was selected instead of {A084, C2}, {A082, A084, C2} and {A082, C2}. Also a single pattern {A071, A092, A099, A104, C1} was selected instead of {A104, C1}, {A099, A104, C1}, {A092, A099, A104, C1}, and so on. The sub+super measure preferred patterns with completely different items while the sub+super measure selected similar subpatterns repeatedly. As a result, the sub+super measure could describe far more information in the database than the sub measure. To show the above discussion quantitatively, we introduced the cover rate in the database D for a selection R of a given number of interesting patterns under a measure m as follows: n( s∈R,s⊆t s) t∈D r(m, R, D) = t∈D n(t) where n(s) is the number of items of a pattern or transaction s. If the cover rate is considerably high for a selection of a small number of patterns, which means that the selection can describe various aspects in the database, then the measure of interestingness can be regarded as very effective. The graphs in Figure 1 show the cover rates in the databases under the two measures. The bold and dotted lines denote the results under the sub+super and sub measures, respectively. The graphs show that selections of interesting patterns under the sub+super measure marked higher cover rates than those under the sub measure. In the mushroom database, for example, the top 200 interesting patterns under the sub+super measure covered 65 percent of (the items of) the transactions in the database, while those under the sub measure covered only 8.5 percent. As shown in Table 5, there was a tendency that comparatively long patterns were selected under the sub+super measure, and comparatively short patterns under the sub measure. In general, shorter patterns can be matched with more items in the transactions than longer patterns can. Nevertheless, the sub+super measure marked higher cover rates than the sub measure. This implies that the sub+super measure could reject redundant subpatterns and outweigh the difficulty of matching long patterns with the transactions in the database, while the sub measure selected a lot of redundant patterns. As a result, the sub+super measure could reduce the number of patterns which were required to cover a certain portion of the transactions in the database.
500
Y. Yoshida et al.
Fig. 1. The cover rates in the databases: The bold lines denote the sub+super measure and the dotted lines denote the sub measure.
Table 5. The average lengths of the top 200 interesting patterns. name sub+super sub name sub+super sub mushroom 7.49 3.63 internet-ads 4.63 3.31 soybean 6.98 4.53 entree 4.97 3.01 zoo 6.53 3.43 dna 3.68 3.52
4
Conclusions
We proposed a new approach to evaluation of interestingness of patterns to filter out uninteresting ones. Our approach is to estimate the frequency of a pattern from both its subpatterns and superpatterns and evaluates the interestingness as how its actual frequency is higher than the estimated values. We discussed our approach mainly for frequent itemset mining, but our approach can be easily extended to other types of databases and patterns such as sequential data mining by modifying only frequency estimation from subpatterns and superpatterns. An empirical evaluation showed the ability of our approach to reject redundant subpatterns of other interesting patterns and provide a small number of selection of interesting patterns which can cover a large portion of transactions. Our approach is, therefore, effective to describe various aspects in the database.
Mining Interesting Patterns Using Estimated Frequencies
501
References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th Int’l Conference on Very Large Databases (VLDB), 1994. 2. C. Blake and C. Merz. UCI repository of machine learning databases, 1998. University of California, Irvine, Dept. of Information and Computer Sciences http://www.ics.uci.edu/ mlearn/MLRepository.html. 3. R. Burke. Entree chicago recommendation data, 2000. University of California, Irvine Department of Information and Computer Science Irvine, CA 92697. 4. T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. In Proc. of the 13th European Conference on Machine Learning / the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2002. 5. G. Dong and J. Li. Interestingness of discovered association rules in terms of neighborhood-based unexpectedness. In Proc. of the Second Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 1998. 6. S. Hettich and S. Bay. The UCI KDD archive, 1999. University of California, Irvine, Dept. of Information and Computer Sciences [http://kdd.ics.uci.edu]. 7. R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest. Kluwer Academic Publishers, 2001. 8. F. Hussain, H. Liu, and H. Lu. Relative measure for mining interesting rules. In Proc. of PKDD-2000 Workshop on Knowledge Management Theory and Applications, 2000. 9. S. Jaroszewicz and D. A. Simovici. A general measure of rule interestingness. In Proc. of the Fifth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), 2001. 10. S. Jaroszewicz and D. A. Simovici. Pruning redundant association rules using maximum entropy principle. In Proc. of the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2002. 11. M. Joshi, G. Karypis, and V. Kumar. A universal formulation of sequential patterns. In Proc. of the KDD-2001 workshop on Temporal Data Mining, 2001. 12. D. Michie, D. Spiegelhalter, and C. Taylor. The StatLog datasets, 1994. Esprit Project 5170 StatLog (1991-94) [http://www.ncc.up.pt/liacc/ML/statlog/]. 13. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. of the Fifth Int’l Conference on Extending Database Technology (EDBT), 1996. 14. P.-N. Tan and V. Kumar. Interestingness measures for association patterns : A perspective. Technical Report TR00-036, Department of Computer Science, University of Minnesota, 2000.
Author Index
Hui, Siu Cheung
Aizawa, Akiko 402 Akutsu, Tatsuya 114 Araki, Kenji 460 Arikawa, Setsuo 486 Arimura, Hiroki 47 Arrigo, Kevin R. 141 Asai, Tatsuya 47
Inenaga, Shunsuke 486 Inui, Kentaro 180 Inui, Takashi 180 Ishino, Akira 478 Jung, Jason J. 320 Jung, Kyung-Yong 320
Bannai, Hideo 486 Bay, Stephen 141, 468 Beuster, Gerd 283 Bothorel, C´ecile 62
Kamishima, Toshihiro 194 Kang, Seong-Jun 426 Katoh, Naoki 208 Kawasaki, Saori 410 Kitagawa, Genshiro 21 Kobal, Ivan 87 Kobayashi, Ken’ichi 494 Kudo, Mineichi 393 Kudo, Yasuo 377 Kumazawa, Itsuo 306 Kunifuji, Susumu 385
Changshui, Zhang 328 Chevalier, Karine 62 Cho, Wan-Hyun 426 Corruble, Vincent 62 Custers, Bart 291 Dai, Honghua 153 Davis, John 100 Do, Tien Dung 76 Dˇzeroski, Saˇso 87, 297 Eiter, Thomas
1
Fanany, Mohamad Ivan Fletcher, Alistair 100 Fong, Alvis 76, 452 Fujiki, Jun 194 Fukagawa, Daiji 114 Furbach, Ulrich 283 Furukawa, Koichi 269
76, 452
306
Geamsakul, Warodom 128 George, Dileep 141, 468 Grigoriev, Peter A. 311 Gross-Hardt, Margret 283 Hamuro, Yukinobu 208 Han, Kyungsook 336, 352 Hang, Xiaoshu 153 Haraguchi, Makoto 418 Harao, Masateru 166 Hirata, Kouichi 166 Ho, Tu Bao 410
Langley, Pat 141 Lee, Dongkyu 336 Lee, Jung-Hyun 320 Lee, Sanghoon 344 Lim, Daeho 352 Ljubiˇc, Peter 297 Maeda, Ken-ichi 360 Makino, Kazuhisa 1 Maloberti, J´erˆ ome 220 Matsuda, Takashi 128 Matsumoto, Yuji 180 Matsuo, Fumihiro 478 Mielik¨ ainen, Taneli 233 Momma, Atsuhito 269 Monakhov, Oleg 245 Monakhova, Emilia 245 Motoda, Hiroshi 128 Murai, Tetsuya 377 Murata, Tsuyoshi 369 Nagazumi, Ryosuke 166 Nakada, Toyohisa 385 Nakamura, Atsuyoshi 393
504
Author Index
Nakano, Shin-ichi 47 Nakata, Tadashi 470 Nakawatase, Hidekazu 402 Nguyen, Trong Dung 410 Oh, Kyung-whan 344 Ohta, Yuiko 494 Okubo, Yoshiaki 418 Park, Jong-Hyun 426 Park, Soon-Young 426 Pericliev, Vladimir 434 Phillips, Joseph 442 Quan, Thanh Tho Rzepka, Rafal
Takano, Akihiko 33 Takeda, Masayuki 478, 486 Tanabe, Kazuhiko 393 Tanaka, Akira 393 Tao, Ban 328 Thomas, Bernd 283 Tishby, Naftali 45 Tochinai, Koji 460 Todorovski, Ljupˇco 87, 297 Uno, Takeaki
47, 256
Vaupotiˇc, Janja
87
Washio, Takashi
128
452
460
Saito, Kazumi 141, 468 Sakama, Chiaki 360 Sato, Yoshiharu 377 Satoh, Hiroko 470 Satoh, Ken 256 Shimazu, Keiko 269 Shinohara, Ayumi 486 Shrager, Jeff 468 Sugimachi, Tomohiko 478 Suzuki, Einoshin 220
Yada, Katsutoshi 208 Yang, Jihoon 344 Yevtushenko, Serhiy A. 311 Yoshida, Tetsuya 128 Yoshida, Yukiko 494 Yugami, Nobuhiro 494 Zeugmann, Thomas 46 Zhongbao, Kou 328 Zmazek, Boris 87