Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5946
James F. Peters Andrzej Skowron (Eds.)
Transactions on Rough Sets XI
13
Editors-in-Chief James F. Peters University of Manitoba Department of Electrical and Computer Engineering Winnipeg, Manitoba, R3T 5V6, Canada E-mail:
[email protected] Andrzej Skowron Warsaw University Institute of Mathematics Banacha 2, 02-097, Warsaw, Poland E-mail:
[email protected]
Library of Congress Control Number: 2009943065 CR Subject Classification (1998): F.4.1, F.1.1, H.2.8, I.5, I.4, I.2 ISSN ISSN ISBN-10 ISBN-13
0302-9743 (Lecture Notes in Computer Science) 1861-2059 (Transaction on Rough Sets) 3-642-11478-4 Springer Berlin Heidelberg New York 978-3-642-11478-6 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12837151 06/3180 543210
Preface
Volume XI of the Transactions on Rough Sets (TRS) provides evidence of further growth in the rough set landscape, both in terms of its foundations and applications. This volume provides further evidence of the number of research streams that were either directly or indirectly initiated by the seminal work on rough sets by Zdzislaw Pawlak (1926-2006)1. Evidence of the growth of various rough set-based research streams can be found in the rough set database2 . This volume contains articles introducing advances in the foundations and applications of rough sets. These advances include: calculus of attribute-value pairs useful in mining numerical data, definability and coalescence of approximations, variable consistency generalization approach to bagging controlled by measures of consistency, classical and dominance-based rough sets in the search for genes, judgement about satisfiability under incomplete information, irreducible descriptive sets of attributes for information systems useful in the design of concurrent data models, computational theory of perceptions (CTP) and its characteristics and the relation with fuzzy-granulation, methods and algorithms of the NetTRS system, a recursive version of the apriori algorithm designed for parallel processing, and decision table reduction method based on fuzzy rough sets. The editors and authors of this volume extend their gratitude to the reviewers of articles in this volume, Alfred Hofmann, Ursula Barth, Christine Reiss and the LNCS staff at Springer for their support in making this volume of the TRS possible. The editors of this volume were supported by the Mininstry of Science and Higher Education of the Republic of Poland, research grants N N516 368334, N N516 077837, and the Natural Sciences and Engineering Research Council of Canada (NSERC) research grant 185986, Canadian Network of Excellence (NCE), and a Canadian Arthritis Network (CAN) grant SRI-BIO-05. October 2009
1
2
James F. Peters Andrzej Skowron
See, e.g., Peters, J.F., Skowron, A.: Zdzislaw Pawlak: Life and Work, Transactions on Rough Sets V, (2006), 1-24; Pawlak, Z., A Treatise on Rough Sets, Transactions on Rough Sets IV, (2006), 1-17. See, also, Pawlak, Z., Skowron, A.: Rudiments of rough sets, Information Sciences 177 (2007) 3-27; Pawlak, Z., Skowron, A.: Rough sets: Some extensions, Information Sciences 177 (2007) 28-40; Pawlak, Z., Skowron, A.: Rough sets and Boolean reasoning, Information Sciences 177 (2007) 41-73. http://rsds.wsiz.rzeszow.pl/rsds.php
LNCS Transactions on Rough Sets
The Transactions on Rough Sets has as its principal aim the fostering of professional exchanges between scientists and practitioners who are interested in the foundations and applications of rough sets. Topics include foundations and applications of rough sets as well as foundations and applications of hybrid methods combining rough sets with other approaches important for the development of intelligent systems. The journal includes high-quality research articles accepted for publication on the basis of thorough peer reviews. Dissertations and monographs up to 250 pages that include new research results can also be considered as regular papers. Extended and revised versions of selected papers from conferences can also be included in regular or special issues of the journal. Editors-in-Chief: Managing Editor: Technical Editor:
James F. Peters, Andrzej Skowron Sheela Ramanna Marcin Szczuka
Editorial Board Mohua Banerjee Jan Bazan Gianpiero Cattaneo Mihir K. Chakraborty Davide Ciucci Chris Cornelis Ivo D¨ untsch Anna Gomoli´ nska Salvatore Greco Jerzy W. Grzymala-Busse Masahiro Inuiguchi Jouni J¨ arvinen Richard Jensen Bo˙zena Kostek Churn-Jung Liau Pawan Lingras Victor Marek Mikhail Moshkov Hung Son Nguyen
Ewa Orlowska Sankar K. Pal Lech Polkowski Henri Prade Sheela Ramanna Roman Slowi´ nski Jerzy Stefanowski Jaroslaw Stepaniuk Zbigniew Suraj Marcin Szczuka ´ ezak Dominik Sl¸ ´ Roman Swiniarski Shusaku Tsumoto Guoyin Wang Marcin Wolski Wei-Zhi Wu Yiyu Yao Ning Zhong Wojciech Ziarko
Table of Contents
Mining Numerical Data – A Rough Set Approach . . . . . . . . . . . . . . . . . . . . Jerzy W. Grzymala-Busse Definability and Other Properties of Approximations for Generalized Indiscernibility Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jerzy W. Grzymala-Busse and Wojciech Rz¸asa Variable Consistency Bagging Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jerzy Blaszczy´ nski, Roman Slowi´ nski, and Jerzy Stefanowski Classical and Dominance-Based Rough Sets in the Search for Genes under Balancing Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof A. Cyran
1
14 40
53
Satisfiability Judgement under Incomplete Information . . . . . . . . . . . . . . . Anna Gomoli´ nska
66
Irreducible Descriptive Sets of Attributes for Information Systems . . . . . . Mikhail Moshkov, Andrzej Skowron, and Zbigniew Suraj
92
Computational Theory Perception (CTP), Rough-Fuzzy Uncertainty Analysis and Mining in Bioinformatics and Web Intelligence: A Unified Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sankar K. Pal Decision Rule-Based Data Models Using TRS and NetTRS – Methods and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marek Sikora
106
130
A Distributed Decision Rules Calculation Using Apriori Algorithm . . . . . Tomasz Str¸akowski and Henryk Rybi´ nski
161
Decision Table Reduction in KDD: Fuzzy Rough Based Approach . . . . . . Eric Tsang and Suyun Zhao
177
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
189
Mining Numerical Data – A Rough Set Approach Jerzy W. Grzymala-Busse Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA and Institute of Computer Science Polish Academy of Sciences, 01-237 Warsaw, Poland
[email protected], http://lightning.eecs.ku.edu/index.html Abstract. We present an approach to mining numerical data based on rough set theory using calculus of attribute-value blocks. An algorithm implementing these ideas, called MLEM2, induces high quality rules in terms of both simplicity (number of rules and total number of conditions) and accuracy. MLEM2 induces rules not only from complete data sets but also from data with missing attribute values, with or without numerical attributes. Additionally, we present experimental results on a comparison of three commonly used discretization techniques: equal interval width, equal interval frequency and minimal class entropy (all three methods were combined with the LEM2 rule induction algorithm) with MLEM2. Our conclusion is that even though MLEM2 was most frequently a winner, the differences between all four data mining methods are statistically insignificant.
1
Introduction
For knowledge acquisition (or data mining) from data with numerical attributes special techniques are applied [13]. Most frequently, an additional step, taken before the main step of rule induction or decision tree generation and called discretization is used. In this preliminary step numerical data are converted into symbolic or, more precisely, a domain of the numerical attribute is partitioned into intervals. Many discretization techniques, using principles such as equal interval width, equal interval frequency, minimal class entropy, minimum description length, clustering, etc., were explored, e.g., in [1,2,3,5,6,8,9,10,20,23,24,25,26], and [29]. Discretization algorithms which operate on the set of all attributes and which do not use information about decision (concept membership) are called unsupervised, as opposed to supervised, where the decision is taken into account [9]. Methods processing the entire attribute set are called global, while methods working on one attribute at a time are called local [8]. In all of these methods discretization is a preprocessing step and is undertaken before the main process of knowledge acquisition. Another possibility is to discretize numerical attributes during the process of knowledge acquisition. Examples of such methods are MLEM2 [14] and MODLEM [21,31,32] for rule induction and C4.5 [30] and CART [4] for decision tree J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 1–13, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
J.W. Grzymala-Busse
generation. These algorithms deal with original, numerical data and the process of knowledge acquisition and discretization are conducted at the same time. The MLEM2 algorithm produces better rule sets, in terms of both simplicity and accuracy, than clustering methods [15]. However, discretization is an art rather than a science, and for a specific data set it is advantageous to use as many discretization algorithms as possible and then select the best approach. In this paper we present the MLEM2 algorithm, one of the most successful approaches to mining numerical data. This algorithm uses rough set theory and calculus of attribute-value pair blocks. A similar approach is represented by MODLEM. Both MLEM2 and MODLEM algorithms are outgrowths of the LEM2 algorithm. However, in MODLEM the most essential part of selecting the best attribute-value pair is conducted using entropy or Laplacian conditions, while in MLEM2 this selection uses the most relevance condition, just like in the original LEM2. Additionally, we present experimental results on a comparison of three commonly used discretization techniques: equal interval width, equal interval frequency and minimal class entropy (all three methods combined with the LEM2 rule induction algorithm) with MLEM2. Our conclusion is that even though MLEM2 was most frequently a winner, the differences between all four data mining methods are statistically insignificant. A preliminary version of this paper was presented at the International Conference of Rough Sets and Emerging Intelligent Systems Paradigms, Warsaw, Poland, June 28–30, 2007 [19].
2
Discretization Methods
For a numerical attribute a with an interval [a, b] as a range, a partition of the range into n intervals {[a0 , a1 ), [a1 , a2 ), ..., [an−2 , an−1 ), [an−1 , an ]}, where a0 = a, an = b, and ai < ai+1 for i = 0, 1, ..., n − 1, defines discretization of a. The numbers a1 , a2 ,..., an−1 are called cut-points. The simplest and commonly used discretization methods are local methods called Equal Interval Width and Equal Frequency per Interval [8,13]. Another local discretization method [10] is called a Minimal Class Entropy. The conditional entropy, defined by a cut-point q that splits the set U of all cases into two sets, S1 and S2 is defined as follows E(q, U ) =
|S1 | |S2 | E(S1 ) + E(S2 ), |U | |U |
where E(S) is the entropy of S and |X| denotes the cardinality of the set X. The cut-point q for which the conditional entropy E(q, U ) has the smallest value is selected as the best cut-point. If k intervals are required, the procedure is applied recursively k − 1 times. Let q1 and q2 be the best cut-points for sets S1 and S2 ,
Mining Numerical Data – A Rough Set Approach
3
respectively. If E(q1 , S1 ) > E(q2 , S2 ) we select q1 as the next cut-point, if not, we select q2 . In our experiments three local methods of discretization: Equal Interval Width, Equal Interval Frequency and Minimal Class Entropy were converted to global methods using an approach of globalization presented in [8]. First, we discretize all attributes, one at a time, selecting the best cut-point for all attributes. If the level of consistency is sufficient, the process is completed. If not, we further discretize, selecting an attribute a for which the following expression has the largest value |B| B∈{a}∗ |U| E(B) Ma = . |{a}∗ | In all six discretization methods discussed in this paper, the stopping condition was the level of consistency [8], based on rough set theory introduced by Z. Pawlak in [27]. Let U denote the set of all cases of the data set. Let P denote a nonempty subset of the set of all variables, i.e., attributes and a decision. Obviously, set P defines an equivalence relation ℘ on U , where two cases x and y from U belong to the same equivalence class of ℘ if and only if both x and y are characterized by the same values of each variable from P . The set of all equivalence classes of ℘, i.e., a partition on U , will be denoted by P ∗ . Equivalence classes of ℘ are called elementary sets of P . Any finite union of elementary sets of P is called a definable set in P . Let X be any subset of U . In general, X is not a definable set in P . However, set X may be approximated by two definable sets in P , the first one is called a lower approximation of X in P , denoted by P X and defined as follows {Y ∈ P ∗ |Y ⊆ X}. The second set is called an upper approximation of X in P , denoted by P X and defined as follows {Y ∈ P ∗ |Y ∩ X = ∅}. The lower approximation of X in P is the greatest definable set in P , contained in X. The upper approximation of X in P is the least definable set in P containing X. A rough set of X is the family of all subsets of U having the same lower and the same upper approximations of X. A level of consistency [8], denoted Lc , is defined as follows X∈{d}∗ |AX| Lc = . |U | Practically, the requested level of consistency for discretization is 100%, i.e., we want the discretized data set to be consistent.
3
MLEM2
The MLEM2 algorithm is a part of the LERS (Learning from Examples based on Rough Sets) data mining system. Rough set theory was initiated by Z. Pawlak
4
J.W. Grzymala-Busse
[27,28]. LERS uses two different approaches to rule induction: one is used in machine learning, the other in knowledge acquisition. In machine learning, or more specifically, in learning from examples (cases), the usual task is to learn the smallest set of minimal rules, describing the concept. To accomplish this goal, LERS uses two algorithms: LEM1 and LEM2 (LEM1 and LEM2 stand for Learning from Examples Module, version 1 and 2, respectively) [7,11,12]. The LEM2 algorithm is based on an idea of an attribute-value pair block. For an attribute-value pair (a, v) = t, a block of t, denoted by [t], is a set of all cases from U such that for attribute a have value v. For a set T of attributevalue pairs, the intersection of blocks for all t from T will be denoted by [T ]. Let B be a nonempty lower or upper approximation of a concept represented by a decision-value pair (d, w). Set B depends on a set T of attribute-value pairs t = (a, v) if and only if ∅ = [T ] = [t] ⊆ B. t∈T
Set T is a minimal complex of B if and only if B depends on T and no proper subset T of T exists such that B depends on T . Let T be a nonempty collection of nonempty sets of attribute-value pairs. Then T is a local covering of B if and only if the following conditions are satisfied: – each member T of T is a minimal complex of B, – t∈T [T ] = B, and – T is minimal, i.e., T has the smallest possible number of members. The user may select an option of LEM2 with or without taking into account attribute priorities. The procedure LEM2 with attribute priorities is presented below. The option without taking into account priorities differs from the one presented below in the selection of a pair t ∈ T (G) in the inner loop WHILE. When LEM2 is not to take attribute priorities into account, the first criterion is ignored. In our experiments all attribute priorities were equal to each other. Procedure LEM2 (input: a set B, output: a single local covering T of set B); begin G := B; T := ∅; while G =∅ begin T := ∅; T (G) := {t|[t] ∩ G = ∅} ; while T = ∅ or [T ] ⊆B begin select a pair t ∈ T (G) with the highest attribute priority, if a tie occurs, select a pair t ∈ T (G) such that |[t] ∩ G| is maximum;
Mining Numerical Data – A Rough Set Approach
5
if another tie occurs, select a pair t ∈ T (G) with the smallest cardinality of [t]; if a further tie occurs, select first pair; T := T ∪ {t} ; G := [t] ∩ G ; T (G) := {t|[t] ∩ G = ∅}; T (G) := T (G) − T ; end {while} for each t ∈ T do if [T − {t}] ⊆ B then T := T − {t}; T := T ∪ {T }; G := B − T ∈T [T ];
end {while}; for each T ∈ T do if S∈T −{T } [S] = B then T := T − {T }; end {procedure}.
For a set X, |X| denotes the cardinality of X. Rules induced from raw, training data are used for classification of unseen, testing data. The classification system of LERS is a modification of the bucket brigade algorithm. The decision to which concept a case belongs is made on the basis of three factors: strength, specificity, and support. They are defined as follows: Strength is the total number of cases correctly classified by the rule during training. Specificity is the total number of attribute-value pairs on the left-hand side of the rule. The matching rules with a larger number of attributevalue pairs are considered more specific. The third factor, support, is defined as the sum of scores of all matching rules from the concept. The concept C for which the support (i.e., the sum of all products of strength and specificity, for all rules matching the case, is the largest is a winner and the case is classified as being a member of C). MLEM2, a modified version of LEM2, categorizes all attributes into two categories: numerical attributes and symbolic attributes. For numerical attributes MLEM2 computes blocks in a different way than for symbolic attributes. First, it sorts all values of a numerical attribute. Then it computes cutpoints as averages for any two consecutive values of the sorted list. For each cutpoint x MLEM2 creates two blocks, the first block contains all cases for which values of the numerical attribute are smaller than x, the second block contains remaining cases, i.e., all cases for which values of the numerical attribute are larger than x. The search space of MLEM2 is the set of all blocks computed this way, together with blocks defined by symbolic attributes. Starting from that point, rule induction in MLEM2 is conducted the same way as in LEM2. Let us illustrate the MLEM2 algorithm using the following example from Table 1. Rows of the decision table represent cases, while columns are labeled by variables. The set of all cases will be denoted by U . In Table 1, U = {1, 2, ..., 6}. Independent variables are called attributes and a dependent variable is called a
6
J.W. Grzymala-Busse Table 1. An example of the decision table Attributes
Decision
Case
Gender
Cholesterol
Stroke
1 2 3 4 5 6
man man man woman woman woman
180 240 280 240 280 320
no yes yes no no yes
decision and is denoted by d. The set of all attributes will be denoted by A. In Table 1, A = {Gender, Cholesterol }. Any decision table defines a function ρ that maps the direct product of U and A into the set of all values. For example, in Table 1, ρ(1, Gender ) = man. The decision table from Table 1 is consistent, i.e., there are no conflicting cases in which all attribute values are identical yet the decision values are different. Subsets of U with the same decision value are called concepts. In Table 1 there are two concepts: {1, 4, 5} and {2, 3, 6}. Table 1 contains one numerical attribute (Cholesterol). The sorted list of values of Cholesterol is 180, 240, 280, 320. The corresponding cutpoints are: 210, 260, 300. Since our decision table is consistent, input sets to be applied to MLEM2 are concepts. The search space for MLEM2 is the set of all blocks for all possible attribute-value pairs (a, v) = t. For Table 1, the set of all attribute-value pair blocks are [(Gender, man)] = {1, 2, 3}, [(Gender, woman)] = {4, 5, 6}, [(Cholesterol, 180..210)] = {1}, [(Cholesterol, 210..320)] = {2, 3, [(Cholesterol, 180..260)] = {1, 2, [(Cholesterol, 260..320)] = {3, 5, [(Cholesterol, 180..300)] = {1, 2, [(Cholesterol, 300..320)] = {6}.
4, 5, 6}, 4}, 6}, 3, 4, 5},
Let us start running MLEM2 for the concept {1, 4, 5}. Thus, initially this concept is equal to B (and to G). The set T (G) is equal to {(Gender, man), (Gender, woman), (Cholesterol, 180..210), (Cholesterol, 210..320), (Cholesterol, 180..260), (Cholesterol, 260..320), (Cholesterol, 180..300)}. For the attribute-value pair (Cholesterol, 180..300) from T (G) the following value |[(attribute, value)] ∩ G| is maximum. Thus we select our first attributevalue pair t = (Cholesterol, 180..300). Since [(Cholesterol, 180..300)] ⊆ B, we have to perform the next iteration of the inner WHILE loop. This time T (G) =
Mining Numerical Data – A Rough Set Approach
7
{(Gender, man), (Gender, woman), (Cholesterol, 180..210), (Cholesterol, 210.. 320), (Cholesterol, 180..260), (Cholesterol, 260..320)}. For three attribute-value pairs from T (G): (Gender, woman), (Cholesterol, 210..320) and (Cholesterol, 180..260) the value of |[(attribute, value)] ∩ G| is maximum (and equal to two). The second criterion, the smallest cardinality of [(attribute, value)], indicates (Gender, woman) and (Cholesterol, 180..260) (in both cases that cardinality is equal to three). The last criterion, ”first pair”, selects (Gender, woman). Moreover, the new T = {(Cholesterol, 180..300), (Gender, woman)} and new G is equal to {4, 5}. Since [T ] = [(Cholesterol, 180..260] ∩ [(Gender, woman)] = {4, 5} ⊆ B, the first minimal complex is computed. Furthermore, we cannot drop any of these two attribute-value pairs, so T = {T }, and the new G is equal to B − {4, 5} = {1}. During the second iteration of the outer WHILE loop, the next minimal complex T is identified as {(Cholesterol, 180..210)}, so T = {{(Cholesterol, 180..300), (Gender, woman)}, {(Cholesterol, 180..210)}} and G = ∅. The remaining rule set, for the concept {2, 3, 6} is induced in a similar manner. Eventually, rules in the LERS format (every rule is equipped with three numbers, the total number of attribute-value pairs on the left-hand side of the rule, the total number of examples correctly classified by the rule during training, and the total number of training cases matching the left-hand side of the rule) are: 2, 2, 2 (Gender, woman) & (Cholesterol, 180..300) -> (Stroke, no) 1, 1, 1 (Cholesterol, 180..210) -> (Stroke, no) 2, 2, 2 (Gender, man) & (Cholesterol, 210..320) -> (Stroke, yes) 1, 1, 1 (Cholesterol, 300..320) -> (Stroke, yes)
4
Numerical and Incomplete Data
Input data for data mining are frequently affected by missing attribute values. In other words, the corresponding function ρ is incompletely specified (partial). A decision table with an incompletely specified function ρ will be called incompletely specified, or incomplete. Though four different interpretations of missing attribute values were studied [18]; in this paper, for simplicity, we will consider only two: lost values (the values that were recorded but currently are unavailable) and ”do not care” conditions (the original values were irrelevant). For the rest of the paper we will assume that all decision values are specified, i.e., they are not missing. Also, we will assume that all missing attribute values are denoted either by ”?” or by ”∗”, lost values will be denoted by ”?”, ”do not care” conditions will be denoted by ”∗”. Additionally, we will assume that for each case at least one attribute value is specified.
8
J.W. Grzymala-Busse
Incomplete decision tables are described by characteristic relations instead of indiscernibility relations. Also, elementary blocks are replaced by characteristic sets, see, e.g., [16,17,18]. An example of an incomplete table is presented in Table 2. Table 2. An example of the incomplete decision table Attributes
Decision
Case
Gender
Cholesterol
Stroke
1 2 3 4 5 6
? man man woman woman woman
180 * 280 240 ? 320
no yes yes no no yes
For incomplete decision tables the definition of a block of an attribute-value pair must be modified. If for an attribute a there exists a case x such that ρ(x, a) =?, i.e., the corresponding value is lost, then the case x is not included in the block [(a, v)] for any value v of attribute a. If for an attribute a there exists a case x such that the corresponding value is a ”do not care” condition, i.e., ρ(x, a) = ∗, then the corresponding case x should be included in blocks [(a, v)] for all values v of attribute a. This modification of the definition of the block of attribute-value pair is consistent with the interpretation of missing attribute values, lost and ”do not care” condition. Numerical attributes should be treated in a little bit different way as symbolic attributes. First, for computing characteristic sets, numerical attributes should be considered as symbolic. For example, for Table 2 the blocks of attribute-value pairs are: [(Gender, man)] = {2, 3}, [(Gender, woman)] = {4, 5, 6}, [(Cholesterol, 180)] = {1, 2}, [(Cholesterol, 240)] = {2, 4}, [(Cholesterol, 280)] = {2, 3}, [(Cholesterol, 320)] = {2, 6}. The characteristic set KB (x) is the intersection of blocks of attribute-value pairs (a, v) for all attributes a from B for which ρ(x, a) is specified and ρ(x, a) = v. The characteristic sets KB (x) for Table 2 and B = A are: KA (1) = U ∩ {1, 2} = {1, 2}, KA (2) = {2, 3} ∩ U = {2, 3}, KA (3) = {2, 3} ∩ {2, 3} = {2, 3}, KA (4) = {4, 5, 6} ∩ {2, 4} = {4},
Mining Numerical Data – A Rough Set Approach
9
KA (5) = {4, 5, 6} ∩ U = {4, 5, 6}, KA (6) = {4, 5, 6} ∩ {2, 6} = {6}. For incompletely specified decision tables lower and upper approximations may be defined in a few different ways [16,17,18]. We will quote only one type of approximations for incomplete decision tables, called concept approximations. A concept B-lower approximation of the concept X is defined as follows: BX = ∪{KB (x)|x ∈ X, KB (x) ⊆ X}. A concept B-upper approximation of the concept X is defined as follows: BX = ∪{KB (x)|x ∈ X, KB (x) ∩ X = ∅} = ∪{KB (x)|x ∈ X}. For Table 2, concept lower and upper approximations are: A{1, 4, 5} = {4}, A{2, 3, 6} = {2, 3, 6}, A{1, 4, 5} = {1, 2, 4, 5, 6}, A{2, 3, 6} = {2, 3, 6}. For inducing rules from data with numerical attributes, blocks of attribute-value pairs are defined differently than in computing characteristic sets. Blocks of attribute-value pairs for numerical attributes are computed in a similar way as for complete data, but for every cutpoint the corresponding blocks are computed taking into account interpretation of missing attribute values. Thus, [(Gender, man)] = {1, 2}, [(Gender, woman)] = {4, 5, 6}, [(Cholesterol, 180..210)] = {1, 2}, [(Cholesterol, 210..320)] = {2, 3, 4, 6}, [(Cholesterol, 180..260)] = {1, 2, 4}, [(Cholesterol, 260..320)] = {2, 3, 6}, [(Cholesterol, 180..300)] = {1, 2, 3, 4}, [(Cholesterol, 300..320)] = {2, 6}. Using the MLEM2 algorithm, the following rules are induced: certain rule set (induced from the concept lower approximations): 2, 1, 1 (Gender, woman) & (Cholesterol, 180..260) -> (Stroke, no) 1, 3, 3 (Cholesterol, 260..320) -> (Stroke, yes)
10
J.W. Grzymala-Busse Table 3. Data sets Data set
Bank
Number of cases
attributes
concepts
66
5
2
Bupa
345
6
2
Glass
214
9
6
Globe
150
4
3
Image
210
19
7
Iris
150
4
3
Wine
178
13
3
possible rule set (induced from the concept upper approximations): 1, 2, 3 (Gender, woman) -> (Stroke, no) 1, 1, 3 (Cholesterol, 180..260) -> (Stroke, no) 1, 3, 3 (Cholesterol, 260..320) -> (Stroke, yes)
5
Experiments
Our experiments, aimed at a comparison of three commonly used discretization techniques with MLEM2, were conducted on seven data sets, summarized in Table 3. All of these data sets, with the exception of bank and globe, are available at the University of California at Irvine Machine Learning Repository. The bank data set is a well-known data set used by E. Altman to predict a bankruptcy of companies. The globe data set describes global warming and was presented in [22]. The following three discretization methods were used in our experiments: – Equal Interval Width method, combined with the LEM2 rule induction algorithm, coded as 11, – Equal Frequency per Interval method, combined with the LEM2 rule induction algorithm, coded as 12, – Minimal Class Entropy method, combined with the LEM2 rule induction algorithm, coded as 13. All discretization methods were applied with the level of consistency equal to 100%. For any discretized data set, except bank and globe, the ten-fold cross validation experiments for determining an error rate were used, where rule sets were induced using the LEM2 algorithm [7,12]. The remaining two data sets, bank and globe were subjected to leave-one-out validation method because of their small size. Results of experiments are presented in Table 4.
Mining Numerical Data – A Rough Set Approach
11
Table 4. Results of validation Data set
Bank
6
Error rate 11
12
13
MLEM2
9.09%
3.03%
4.55%
4.55%
Bupa
33.33%
39.71%
44.06%
34.49%
Glass
32.71%
35.05%
41.59%
29.44%
Globe
69.70%
54.05%
72.73%
63.64%
Image
20.48%
20.48%
52.86%
17.14%
Iris
5.33%
10.67%
9.33%
4.67%
Wine
11.24%
6.18%
2.81%
11.24%
Conclusions
We demonstrated that both rough set theory and calculus of attribute-value pair blocks are useful tools for data mining from numerical data. The same idea of an attribute-value pair block may be used in the process of data mining not only for computing elementary sets (for complete data sets) but also for rule induction. The MLEM2 algorithm induces rules from raw data with numerical attributes, without any prior discretization, and MLEM2 provides the same results as LEM2 for data with all symbolic attributes. As follows from Table 4, even though MLEM2 was most frequently a winner, using the Wilcoxon matched-pairs signed rank test we may conclude that the differences between all four data mining methods are statistically insignificant. Thus, for a specific data set with numerical attributes the best approach to discretization should be selected on a case by case basis.
References 1. Bajcar, S., Grzymala-Busse, J.W., Hippe, Z.S.: A comparison of six discretization algorithms used for prediction of melanoma. In: Proc. of the Eleventh International Symposium on Intelligent Information Systems, IIS 2002, Sopot, Poland, pp. 3–12. Physica-Verlag (2002) 2. Bay, S.D.: Multivariate discretization of continuous variables for set mining. In: Proc. of the 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Boston, MA, pp. 315–319 (2000) 3. Biba, M., Esposito, F., Ferilli, S., Mauro, N.D., Basile, T.M.A.: Unsupervised discretization using kernel density estimation. In: Proc. of the 20th Int. Conf. on AI, Hyderabad, India, pp. 696–701 (2007) 4. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth & Brooks, Monterey (1984) 5. Catlett, J.: On changing continuous attributes into ordered discrete attributes. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS (LNAI), vol. 482, pp. 164–178. Springer, Heidelberg (1991)
12
J.W. Grzymala-Busse
6. Chan, C.C., Batur, C., Srinivasan, A.: Determination of quantization intervals in rule based model for dynamic systems. In: Proc. of the IEEE Conference on Systems, Man, and Cybernetics, Charlottesville, VA, pp. 1719–1723 (1991) 7. Chan, C.C., Grzymala-Busse, J.W.: On the attribute redundancy and the learning programs ID3, PRISM, and LEM2. Department of Computer Science, University of Kansas, TR-91-14, December 1991, 20 p. (1991) 8. Chmielewski, M.R., Grzymala-Busse, J.W.: Global discretization of continuous attributes as preprocessing for machine learning. Int. Journal of Approximate Reasoning 15, 319–331 (1996) 9. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proc of the 12th Int. Conf. on Machine Learning, Tahoe City, CA, July 9–12, pp. 194–202 (1995) 10. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. of the 13th Int. Joint Conference on AI, Chambery, France, pp. 1022–1027 (1993) 11. Grzymala-Busse, J.W.: LERS—A system for learning from examples based on rough sets. In: Slowinski, R. (ed.) Intelligent Decision Support. Handbook of Applications and Advances of the Rough Set Theory, pp. 3–18. Kluwer Academic Publishers, Dordrecht (1992) 12. Grzymala-Busse, J.W.: A new version of the rule induction system LERS. Fundamenta Informaticae 31, 27–39 (1997) 13. Grzymala-Busse, J.W.: Discretization of numerical attributes. In: Kl¨ osgen, W., Zytkow, J. (eds.) Handbook of Data Mining and Knowledge Discovery, pp. 218– 225. Oxford University Press, New York (2002) 14. Grzymala-Busse, J.W.: MLEM2: A new algorithm for rule induction from imperfect data. In: Proc. of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, IPMU 2002, Annecy, France, pp. 243–250 (2002) 15. Grzymala-Busse, J.W.: A comparison of three strategies to rule induction from data with numerical attributes. In: Proc. of the Int. Workshop on Rough Sets in Knowledge Discovery (RSKD 2003), in conjunction with the European Joint Conferences on Theory and Practice of Software, Warsaw, pp. 132–140 (2003) 16. Grzymala-Busse, J.W.: Rough set strategies to data with missing attribute values. In: Workshop Notes, Foundations and New Directions of Data Mining, in conjunction with the 3rd International Conference on Data Mining, Melbourne, FL, pp. 56–63 (2003) 17. Grzymala-Busse, J.W.: Data with missing attribute values: Generalization of indiscernibility relation and rule induction. In: Peters, J.F., Skowron, A., Grzymala´ Busse, J.W., Kostek, B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 78–95. Springer, Heidelberg (2004) 18. Grzymala-Busse, J.W.: Incomplete data and generalization of indiscernibility re´ ezak, D., Wang, G., Szczuka, M.S., lation, definability, and approximations. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3641, pp. 244–253. Springer, Heidelberg (2005) 19. Grzymala-Busse, J.W.: Mining numerical data—A rough set approach. In: Kryszkiewicz, M., Peters, J.F., Rybi´ nski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 12–21. Springer, Heidelberg (2007) 20. Grzymala-Busse, J.W., Stefanowski, J.: Discretization of numerical attributes by direct use of the rule induction algorithm LEM2 with interval extension. In: Proc. of the Sixth Symposium on Intelligent Information Systems (IIS 1997), Zakopane, Poland, pp. 149–158 (1997)
Mining Numerical Data – A Rough Set Approach
13
21. Grzymala-Busse, J.W., Stefanowski, J.: Three discretization methods for rule induction. Int. Journal of Intelligent Systems 16, 29–38 (2001) 22. Gunn, J.D., Grzymala-Busse, J.W.: Global temperature stability by rule induction: An interdisciplinary bridge. Human Ecology 22, 59–81 (1994) 23. Kerber, R.: ChiMerge: Discretization of numeric attributes. In: Proc. of the 10th National Conf. on AI, San Jose, CA, pp. 123–128 (1992) 24. Kohavi, R., Sahami, M.: Error-based and entropy-based discretization of continuous features. In: Proc of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, pp. 114–119 (1996) 25. Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: An enabling technique. Data Mining and Knowledge Discovery 6, 393–423 (2002) 26. Nguyen, H.S., Nguyen, S.H.: Discretization methods for data mining. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery, pp. 451–482. Physica, Heidelberg (1998) 27. Pawlak, Z.: Rough Sets. International Journal of Computer and Information Sciences 11, 341–356 (1982) 28. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991) 29. Pensa, R.G., Leschi, C., Besson, J., Boulicaut, J.F.: Assessment of discretization techniques for relevant pattern discovery from gene expression data. In: Proc. of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics, pp. 24–30 (2004) 30. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993) 31. Stefanowski, J.: Handling continuous attributes in discovery of strong decision rules. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 394–401. Springer, Heidelberg (1998) 32. Stefanowski, J.: Algorithms of Decision Rule Induction in Data Mining. Poznan University of Technology Press, Poznan (2001)
Definability and Other Properties of Approximations for Generalized Indiscernibility Relations Jerzy W. Grzymala-Busse1,2 and Wojciech Rz¸asa3 1
Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA 2 Institute of Computer Science, Polish Academy of Sciences, 01–237 Warsaw, Poland
[email protected] 3 Department of Computer Science, University of Rzeszow, 35–310 Rzeszow, Poland
[email protected]
Abstract. In this paper we consider a generalization of the indiscernibility relation, i.e., a relation R that is not necessarily reflexive, symmetric, or transitive. On the basis of granules, defined by R, we introduce the idea of definability. We study 28 basic definitions of approximations, two approximations are introduced for the first time. Furthermore, we introduce additional 8 new approximations. Our main objective is to study definability and coalescence of approximations. We study definability of all 28 basic approximations for reflexive, symmetric, and transitive relations. In particular, for reflexive relations, the set of 28 approximations is reduced, in general, to the set of 16 approximations.
1
Introduction
One of the basic ideas of rough set theory, introduced by Z. Pawlak in 1982, is the indiscernibility relation [17, 18] defined on the finite and nonempty set U called the universe. An ordered pair (U, R), where R denotes a binary relation is called an approximation space. This idea is essential for rough sets. The set U represents cases that are characterized by the relation R. If R is an equivalence relation then it is called an indiscrenibility relation. Two cases being in the relation R are indiscernible or indistinguishable. The original idea of the approximation space was soon generalized. In 1983 W. Zakowski redefined the approximation space as a pair (U, Π), where Π was a covering of the universe U [33]. Another example of the generalization of the original idea of approximation space was a tolerance approximation space, presented by A. Skowron and J. Stepaniuk in 1996 [21]. The tolerance approximation space was introduced as a four-tuple (U, I, ν, P ), where U denotes the universe, I : U → 2U represents uncertainty, ν : 2U × 2U → [0, 1] is a vague inclusion, and P : I(U ) → {0, 1} is yet another function. In 1989 T.Y. Lin introduced a J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 14–39, 2010. c Springer-Verlag Berlin Heidelberg 2010
Definability and Other Properties of Approximations
15
neighborhood system, yet another generalization of the approximation space [13]. A few other authors [19, 22, 27–29] considered some extensions of the approximation space. Some extensions of the approximation space were considered in papers dealing with data sets with missing attribute values [1–5, 10, 11, 24, 25]. In this paper we will discuss a generalization of the indiscernibility relation, an arbitrary binary relation R. Such relation R does not need to be reflexive, symmetric, or transitive. Our main objective was to study the definability and coalescence of approximations of any subset X of the universe U . A definable set is a union of granules, defined by R, that are also known as R-successor or R-predecessor sets or as neighborhoods. In this paper 28 definitions of approximations are discussed. These definitions were motivated by a variety of reasons. Some of the original approximations do not satisfy, in general, the inclusion property (Section 3, Properties 1a, 1b), hence modified approximations were introduced. In [5] two most accurate approximations were defined. In this paper, using duality (Section 3, Properties 8a, 8b), we define two extra approximations for the first time. Additionally, we show that it is possible to define additional 8 approximations, thus reaching 36 approximations altogether. Such generalizations of the indiscernibility relation have immediate application to data mining (machine learning) from incomplete data sets. In these applications the binary relation R, called the characteristic relation and describing such data, is reflexive. For reflexive relations the system of 28 approximations is reduced to 16 approximations. Note that some of these 16 approximations are not useful for data mining from incomplete data [2–5, 8, 9]). A preliminary version of this paper was presented at the IEEE Symposium on Foundations of Computational Intelligence (FOCI’2007), Honolulu, Hawaii, April 1–5, 2007, [6].
2
Basic Definitions
First we will introduce the basic granules (or neighborhoods), defined by a relation R. Such granules are called here R-successor and R-predecessor sets. In this paper R is a generalization of the indiscernibility relation. The relation R, in general, does not need to be reflexive, symmetric, or transitive, while the indiscernibility relation is an equivalence relation. Let U be a finite nonempty set, called the universe, let R be a binary relation on U , and let x be a member of U . The R-successor set of x, denoted by Rs (x), is defined as follows Rs (x) = {y | xRy}. The R-predecessor set of x, denoted by Rp (x), is defined as follows Rp (x) = {y | yRx}. R-successor and R-predecessor sets are used to form larger sets that are called R-successor and R-predecessor definable.
16
J.W. Grzymala-Busse and W. Rz¸asa
Let X be a subset of U . A set X is R-successor definable if and only if X = ∅ or there exists a subset Y of U such that X = ∪ {Rs (y) | y ∈ Y }. A set X is R-predecessor definable if and only if X = ∅ or there exists a subset Y of U such that X = ∪ {Rp (y) | y ∈ Y }. Note that definability is described differently in [19, 22]. It is not difficult to recognize that a set X is R-successor definable if and only if X = ∪{Rs (x) | Rs (x) ⊆ X} while a set X jest R-predecessor definable if and only if X = ∪{Rp (x) | Rp (x) ⊆ X}. It will be convenient to define a few useful maps with some interesting properties. Let U be a finite nonempty set and let f : 2U → 2U be a map. A map f is called an increasing if and only if for any subsets X and Y of U X ⊆ Y ⇒ f (X) ⊆ f (Y ). Theorem 1. Let X and Y be subsets of U and let f : 2U → 2U be increasing. Then f (X ∪ Y ) ⊇ f (X) ∪ f (Y ) and f (X ∩ Y ) ⊆ f (X) ∩ f (Y ). Proof. An easy proof is based on an observation that if both sets f (X) and f (Y ) are subsets of f (X ∪ Y ) (since the map f is increasing), then the union of f (X) and f (Y ) is also a subset of f (X ∪ Y ). By analogy, since f (X ∩ Y ) is a subset of both f (X) and f (Y ), then f (X ∩ Y ) is also a subset of f (X) ∩ f (Y ). Again, let U be a finite and nonempty set and let f : 2U → 2U be a map. A map f is non-decreasing if and only if there do not exist subsets X and Y of U such that X ⊂ Y and f (X) ⊃ f (Y ). Let U be a finite nonempty set and let f : 2U → 2U and g : 2U → 2U be maps defined on the power set of U . Maps f and g will be called dual if for any subset X of U sets f (X) and g(¬X) are complementary. The symbol ¬X means the complement of the set X. Theorem 2. For a finite and nonempty set U and subsets X, Y and Z of U if sets X and Y are complementary then sets X ∪Z and Y − Z are complementary.
Definability and Other Properties of Approximations
17
Proof is based on de Morgan laws. Let U be a finite nonempty set and let f : 2U → 2U . The map f is called idempotent if and only if f (f (X)) = f (X) for any subset X of U . A mixed idempotency property is defined in the following way. Let U be a finite nonempty set and let f : 2U → 2U and g : 2U → 2U be maps defined on the power set of U . A pair (f, g) has Mixed Idempotency Property if and only if f (g(X)) = g(X) for any subset X of U . Theorem 3. Let U be a finite nonempty set and let f : 2U → 2U and g : 2U → 2U be dual maps. (a) for any X ⊆ U
f (X) ⊆ X
if and only if for any X ⊆ U X ⊆ g(X), (b) for any X ⊆ U f (X) ⊆ X if and only if ¬X ⊆ g(¬X), (c) (d) (e) (f )
f (∅) = ∅ if and only if g(U ) = U , the map f is increasing if and only if the map g is increasing, the map f is non-decreasing if and only if the map g is non-decreasing. for any X, Y ⊆ U f (X ∪ Y ) = f (X) ∪ f (Y ) if and only if g(X ∩ Y ) = g(X) ∩ g(Y ),
(g) for any X, Y ⊆ U f (X ∪ Y ) ⊇ f (X) ∪ f (Y ) if and only if g(X ∩ Y ) ⊆ g(X) ∩ g(Y ), (h) f is idempotent if and only if g is idempotent, (i) the pair (f, g) has Mixed Idempotency Property if and only if the pair (g, f ) has the same property. Proof. For properties (a) – (i) only sufficient conditions will be proved. Proofs for the necessary conditions may be achieved by replacing all symbols f with g, ∪ with ∩, ⊆ with ⊇ and ⊂ with ⊃, respectively, and by replacing symbols ¬X with X in the proof of property (b) and symbols ∅ with U in the proof of property (c). For (a) let us observe that f (X) ⊆ X for any X subset of U if and only if ¬X ⊆ ¬f (X), so ¬X ⊆ g(¬X) for any ¬X subset of U . For the proof of (b), if f (X) ⊆ X then ¬X ⊆ g(¬X), see the proof for (a). A brief proof for (c) is based on de Morgan laws f (∅) = ¬g(¬∅) = ¬g(U ) = ∅. For (d) let us assume that X and Y are such subsets of U that X ⊆ Y . If map f is increasing then f (X) ⊆ f (Y ). Thus f (¬Y ) ⊆ f (¬X) or ¬f (¬Y ) ⊇ ¬f (¬X) or g(Y ) ⊇ g(X).
18
J.W. Grzymala-Busse and W. Rz¸asa
For (e) let us assume that X and Y are subsets of U that X ⊂ Y . If the map f is non-decreasing then ∼ (f (X) ⊃ f (Y )). Thus ∼ (f (¬X) ⊂ f (¬Y )) and ∼ (¬f (¬X) ⊃ ¬f (¬Y )), or ∼ (g(X) ⊃ g(Y )). For (f) and (g) proofs are very similar. First, g(X ∩ Y ) = ¬f (¬(X ∩ Y )) = ¬f ((¬X) ∪ (¬Y )). Then, by definition of f ¬f ((¬X) ∪ (¬Y )) = ¬(f (¬X) ∪ f (¬Y )) = g(X) ∩ g(Y ) for (f), or ¬f ((¬X) ∪ (¬Y )) ⊆ ¬(f (¬X) ∪ f (¬Y )) = g(X) ∩ g(Y ) for (g). For (h) first observe that g(g(X)) = g(¬f (¬X)) = ¬f (f (¬X)). The last set, by definition of f is equal to ¬f (¬X) = g(X). For (i) let us assume that the pair (f, g) has Mixed Idempotency Property. Then g(f (X)) = ¬f (¬f (X)) = ¬f (g(¬X)) since f and g are dual. The set ¬f (g(¬X)) is equal to ¬g(¬X) since the pair (f, g) has Mixed Idempotency Property. Theorem 4. Let U be a finite nonempty set, f : 2U → 2U and g : 2U → 2U be maps defined on the power set of U , and F, G be maps dual to f and g, respectively. If f (X) ⊆ g(X) for some subset X of U then F (X) ⊇ G(X). Proof. Let X be a subset of U . Then G(X) = U − g(¬X) by definition of dual maps. Let x be an element of U . Then x ∈ G(X) if and only if x ∈ / g(¬X). Additionally, f (¬X) ⊆ g(¬X) hence x ∈ / f (¬X), or x ∈ F (X).
3
Set Approximations in the Pawlak Space
Let (U, R) be an approximation space, where R is an equivalence relation. Let R be a family of R-definable sets (we may ignore adding successor or predecessor since R is symmetric). A pair (U, R) is a topological space, called the Pawlak space, where R is a family of all open and closed sets [17]. Let us recall that Z. Pawlak defined lower and upper approximations [17, 18], denoted by appr(X) and appr(X), in the following way appr(X) = ∪{[x]R | x ∈ U and [x]R ⊆ X}, appr(X) = ∪{[x]R | x ∈ U and [x]R ∩ X = ∅}, where [x]R denotes an equivalence class containing an element x of U . Maps appr and appr are operations of interior and closure in a topology defined by R-definable sets. As observed Z. Pawlak, [18], the same maps appr and appr may be defined using different formulas appr(X) = {x ∈ U | [x]R ⊆ X} and appr(X) = {x ∈ U | [x]R ∩ X = ∅}.
Definability and Other Properties of Approximations
19
These approximations have the following properties. For any X, Y ⊆ U 1. (a) (b) 2. (a) (b) 3. (a) (b) 4. (a) (b) 5. (a) (b) 6. (a) (b) 7. (a) (b)
appr(X) ⊆ X, inclusion property for lower approximation, X ⊆ appr(X), inclusion property for upper approximation, appr(∅) = ∅, appr(∅) = ∅, appr(U ) = U , appr(U ) = U , X ⊆ Y ⇒ appr(X) ⊆ appr(Y ), monotonicity of lower approximation, X ⊆ Y ⇒ appr(X) ⊆ appr(Y ), monotonicity of upper approximation, appr(X ∪ Y ) ⊇ appr(X) ∪ appr(Y ), appr(X ∪ Y ) = appr(X) ∪ appr(Y ), appr(X ∩ Y ) = appr(X) ∩ appr(Y ), appr(X ∩ Y ) ⊆ appr(X) ∩ appr(Y ), appr(appr(X)) = appr(X) = appr(appr(X)), appr(appr(X)) = appr(X) = appr(appr(X)),
8. (a) appr(X) = ¬appr(¬X), duality property, (b) appr(X) = ¬appr(¬X), duality property. Remark. Due to Theorem 3 we may observe that all Properties 1a–8b, for dual approximations, can be grouped into the following sets {1(a), 1(b)}, {2(a), 3(b)}, {2(b), 3(a)}, {4(a), 4(b)}, {5(a), 6(b)}, {5(b), 6(a)}, {7(a), 7(b)}, {8(a), 8(b)}. In all of these eight sets the first property holds if and only if the second property holds. A problem is whether conditions 1–8 uniquely define maps appr : 2U → 2U and appr : 2U → 2U in the approximation space (U, R). We assume that definitions of both appr and appr should be constructed only from using information about R, i.e., we can test, for any x ∈ U , whether it belongs to a lower and upper approximation only on the basis of its membership to some equivalence class of R. Obviously, for any relation R different from the set of all ordered pairs from U and from the set of all pairs (x, x), where x ∈ U the answer is negative, since we may define an equivalence relation S such that R is a proper subset of S and for any x ∈ U a membership of x in an equivalence class of S can be decided on the basis of a membership of x in an equivalence relation of R (any equivalence class of R is a subset of some equivalence class of S).
4
Subset, Singleton and Concept Approximations
In this paper we will discuss only nonparametric approximations. For approximations depending on same parameters see [21, 32]. In this and the following sections we will consider a variety of approximations for which we will test which of
20
J.W. Grzymala-Busse and W. Rz¸asa
Properties 1a–8b from the previous section are satisfied. Proofs will be restricted only for R-successor sets, since the corresponding proofs for R-predecessor sets can be obtained from the former proofs by replacing the relation R by the converse relation R−1 and using the following equality: Rs (x) = Rp−1 (x), for any relation R and element x of U . Unless it is openly stated, we will not assume special properties for the relation R. For an approximation space (U, R) and X ⊆ U , Xscov will be defined as follows ∪{Rs (x) | x ∈ U } ∩ X. By analogy, the set Xpcov will be defined as ∪{Rp (x) | x ∈ U } ∩ X. Let (U, R) be an approximation space. Let X be a subset of U . The R-subset successor lower approximation of X, denoted by apprsubset (X), is defined as s follows ∪ {Rs (x) | x ∈ U and Rs (x) ⊆ X}. The subset successor lower approximations were introduced in [1, 2]. The R-subset predecessor lower approximation of X, denoted by apprsubset (X), p is defined as follows ∪ {Rp (x) | x ∈ U and Rp (x) ⊆ X}. The subset predecessor lower approximations were studied in [22]. The R-subset successor upper approximation of X, denoted by apprsubset (X), s is defined as follows ∪ {Rs (x) | x ∈ U and Rs (x) ∩ X = ∅}. The subset successor upper approximations were introduced in [1, 2]. The R-subset predecessor upper approximation of X, denoted by apprsubset (X), p is defined as follows ∪ {Rp (x) | x ∈ U and Rp (x) ∩ X = ∅}. The subset predecessor upper approximations were studied in [22]. Sets apprsubset (X) and appr subset (X) are R-successor definable, while sets s s subset apprsubset (X) and appr (X) are R-predecessor definable for any approxip p mation space (U, R), see. e.g., [1, 3]. R-subset successor (predecessor) lower approximations of X have the following Properties: 1a, 2a, the generalized Property 3a, i.e., apprsubset (U ) = Uscov s (apprsubset (U ) = Upcov ), 4a, 5a and the first equality of 7a. p Proof. Proofs for Properties 1a, 2a, the generalized Property 3a and 4a will be skipped since they are elementary. Property 5a follows from Property 4a, as it was shown in Theorem 1, Section 2.
Definability and Other Properties of Approximations
21
For the proof of the first part of Property 7a, i.e., idempotency of the map apprsubset s apprsubset (appr subset (X)) = appr subset (X), s s s let us observe that from Property 1a and 4a we may conclude that apprsubset (appr subset (X)) ⊆ apprsubset (X). For the proof of the reverse inclus s s subset sion let x ∈ apprs (X). Thus, there exists some set Rs (y) ⊆ X such that x ∈ Rs (y). The set Rs (y) is a subset of apprsubset (X), from the definition of s subset subset apprsubset . Hence x ∈ appr (appr (X)). s s s R-subset successor upper approximations of X have the following Properties: a generalized Property 1b, i.e., apprsubset (X) ⊇ Xscov (appr subset (X) ⊇ Xpcov ), 2b, s p a generalized Property 3b, i.e., apprsubset (U ) = Uscov (apprsubset (U ) = Upcov ), s p 4b, 5b and 6b. Proof. Proofs for a generalized Property 1b, 2b, a generalized Property 3b and 4b will be skipped since they are elementary. The inclusion apprsubset (X ∪ Y ) ⊇ s apprsubset (X) ∪ apprsubset (Y ) in Property 5b follows from Property 4b, as it was s s explained in Theorem 1, Section 2. To show the reverse inclusion, i.e., appr(X ∪ Y ) ⊆ appr(X) ∪ appr(Y ) let us consider any element x ∈ appr(X ∪ Y ). There exists Rs (y), such that x ∈ Rs (y) and Rs (y)∩(X ∪Y ) = ∅. Hence x is a member of the set apprsubset (X) s subset ∪ apprs (Y ). To show Property 6b it is enough to apply Theorem 1 and Property 4b. The R-singleton successor lower approximation of X, denoted by appr singleton s (X), is defined as follows {x ∈ U | Rs (x) ⊆ X}. The singleton successor lower approximations were studied in many papers, see, e.g., [1, 2, 10, 11, 13–15, 20, 22–26, 28–31]. The R-singleton predecessor lower approximation of X, denoted by apprsingleton (X), is defined as follows p {x ∈ U | Rp (x) ⊆ X}. The singleton predecessor lower approximations were studied in [22]. The R-singleton successor upper approximation of X, denoted by appr singleton s (X), is defined as follows {x ∈ U | Rs (x) ∩ X = ∅}. The singleton successor upper approximations, like singleton successor lower approximations, were also studied in many papers, e.g., [1, 2, 10, 11, 22– 26, 28–31].
22
J.W. Grzymala-Busse and W. Rz¸asa
The R-singleton predecessor upper approximation of X, denoted by apprsingleton (X), is defined as follows p {x ∈ U | Rp (x) ∩ X = ∅}. The singleton predecessor upper approximations were introduced in [22]. In general, for any approximation space (U,R), sets apprsingleton (X) and s apprsingleton (X) are neither R-successor definable nor R-predecessor definable, p while set apprsingleton (X) is R-predecessor definable and apprsingleton (X) is s p R-successor definable, see, e.g. [1, 3, 16]. R-singleton successor (predecessor) lower approximations of X have the following Properties: 3a, 4a, 5a, 6a and 8a [29]. R-singleton successor (predecessor) upper approximations of X have the following Properties: 2b, 4b, 5b, 6b and 8b [29]. The R-concept successor lower approximation of X, denoted by apprconcept s (X), is defined as follows ∪ {Rs (x) | x ∈ X and Rs (x) ⊆ X}. The concept successor lower approximations were introduced in [1, 2]. The R-concept predecessor lower approximation of X, denoted by apprconcept p (X), is defined as follows ∪ {Rp (x) | x ∈ X and Rp (x) ⊆ X}. The concept predecessor lower approximations were introduced, for the first time, in [6]. The R-concept successor upper approximation of X, denoted by apprconcept s (X), is defined as follows ∪ {Rs (x) | x ∈ X and Rs (x) ∩ X = ∅} The concept successor upper approximations were studied in [1, 2, 15]. The R-concept predecessor upper approximation of X, denoted by apprconcept p (X), is defined as follows ∪ {Rp (x) | x ∈ X and Rp (x) ∩ X = ∅} The concept predecessor upper approximations were studied in [22]. Sets apprconcept (X) and appr concept (X) are R-successor definable, while sets s s concept apprconcept (X) and appr (X) are R-predecessor definable for any approxip p mation space (U, R), see, e.g., [1, 3]. R-concept successor (predecessor) lower approximations of X have the following Properties: 1a, 2a, generalized 3a, i.e., apprconcept (U ) = Uscov s concept cov (apprp (U ) = Up ), 4a and 5a. Proof. For a concept successor (predecessor) lower approximation proofs of Properties 1a, 2a, generalized Property 3a and 4a are elementary and will be ignored.
Definability and Other Properties of Approximations
23
Moreover, proofs for Properties, 5a and 7a are almost the same as for subset lower approximations. Note that in the proof of 7a besides Rs (y) ⊆ X and x ∈ Rs (y) we know additionally that y ∈ X, but it does not affect the proof. R-concept successor (predecessor) upper approximations of X have the following Properties: 2b, generalized Property 3b, i.e., apprconcept (U ) = Uscov s concept cov (apprp (U ) = Up ), 4b and 6b. Proof. Proofs of Property 2b, generalized Property 3b, and 4b are elementary and are omitted. Proof for Property 6b is a consequence of Theorem 1.
5
Modified Singleton Approximations
Definability and duality of lower and upper approximations of a subset X of the universe U are basic properties of rough approximations defined for the indiscernibility relation originally formulated by Z. Pawlak [17, 18]. Inclusion between the set and its approximations (Properties 1a and 1b) is worth some attention. For reflexive relation R any subset, singleton, and concept predecessor (successor) lower and upper approximations satisfy Properties 1a and 1b. However, for not reflexive relation R, in general, if the family of sets {Rs (x) | x ∈ U } ({Rp (x) | x ∈ U }) is not a covering of U , we have X ⊆ apprsubset (X) (and X ⊆ apprsubset (X). s p For R-subset successor (predecessor) upper approximations we have Xscov ⊆ apprsubset (X) (and Xpcov ⊆ apprsubset (X)). s p On the other hand, for R-concept successor (predecessor) upper approximations, in general, not only X ⊆ apprconcept (X) (and X ⊆ appr concept (X) s p but also Xscov ⊆ apprconcept (X) (and Xpcov ⊆ apprconcept (X)), s p as follows from the following example. Example. Let U = {1, 2}, X = {1}, R = {(1, 2), (2, 1)}. Then the family of two sets: Rs (1) and Rs (2) is a covering of U , Xscov = {1} and apprconcept (X) = ∅. s For R-singleton successor (predecessor) approximations the following situations may happen, does not matter if R is symmetric or transitive: appr singleton (X) ⊆ X (and apprsingleton (X) ⊆ X), s p X ⊆ appr singleton (X) (and X ⊆ apprsingleton (X). s p
24
J.W. Grzymala-Busse and W. Rz¸asa
even apprsingleton (X) ⊂ X ⊂ apprsingleton (X). s s The following example shows such three situations for a symmetric and transitive relation R. Example. Let U={1, 2, 3, 4, 5}, X = {1, 2},R={(1, 1), (3, 3), (3, 4), (4, 3), (4, 4)}. Then apprsingleton (X) = {1, 2, 5}, apprsingleton (X) = {1}, so that s s apprsingleton (X) ⊆ X and X ⊆ apprsingleton (X), and s s apprsingleton (X) ⊂ X ⊂ apprsingleton (X). s s To avoid the situation described by the last inclusion for singleton approximations, the following modification of the corresponding definitions were introduced: The R-modified singleton successor lower approximation of X, denoted by apprmodsingleton (X), is defined as follows s {x ∈ U | Rs (x) ⊆ X and Rs (x) = ∅}. The R-modified singleton predecessor lower approximation of X, denoted by apprmodsingleton (X), is defined as follows p {x ∈ U | Rp (x) ⊆ X and Rp (x) = ∅}. The R-modified singleton successor upper approximation of X, denoted by apprmodsingleton (X), is defined as follows s {x ∈ U | Rs (x) ∩ X = ∅ or Rs (x) = ∅}. The R-modified singleton predecessor upper approximation of X, denoted by apprmodsingleton (X), is defined as follows p {x ∈ U | Rp (x) ∩ X = ∅ or Rp (x) = ∅}. These four approximations were introduced, for the first time, in [6]. In an arbitrary approximation space, R-modified singleton successor (predecessor) lower and upper approximations have Properties 8a and 8b. Proof. Since apprmodsingleton (X) = apprsingleton (X) − {x ∈ U : Rs (x) = ∅} s s modsingleton singleton and apprs (X) = apprs (X) ∪ {x ∈ U : Rs (x) = ∅} the Properties 8a and 8b for R-modified singleton successor (predecessor) lower and upper approximations follow from Theorem 2 and duality of R-singleton successor (predecessor) lower and upper approximations [29]. R-modified singleton successor (predecessor) lower approximations have the following Properties: 2a, 4a, 5a and 6a.
Definability and Other Properties of Approximations
25
Proof. Proofs of Properties 2a and 4a are elementary and will be skipped. Properties 5a and a part of 6a, namely apprmodsingleton (X ∩Y ) ⊆ apprmodsingleton (X) s s modsingleton ∩ apprs (Y ) are simple consequences of Theorem 1. A second part of the proof for 6a, i.e., reverse inclusion, is almost identical with the proof for this property for a R-singleton successor lower approximation (compare with [29]). Here we assume that the set Rs (y) is nonempty, but it does not change the proof. R-modified singleton successor (predecessor) upper approximations have the following Properties: 3b, 4b, 5b and 6b. Proof. The proof follows immediately from Theorem 3 since we proved that maps apprmodsingleton and appr modsingleton are dual. s s In general, for any approximation space (U, R), sets apprmodsingleton (X), s modsingleton modsingleton apprmodsingleton (X), appr (X) and appr (X) are neither s p p R-successor definable nor R-predecessor definable.
6
Largest Lower and Smallest Upper Approximations
Properties 1a, 3a, 6a and the first equality of 7a indicate that a lower approximation, in the Pawlak space, is an operation of interior [12]. Properties 1b, 2b, 5b and the first equality of 7b indicate that an upper approximation, in the Pawlak space, is an operation of closure [12]. For any relation R, the R-subset successor (predecessor) lower approximation of X is the largest R-successor (predecessor) definable set contained in X. It follows directly from the definition. On the other hand, if R is not reflexive and at the same time transitive, then a family of R-successor (predecessor) definable sets is not necessarily a topology and then no upper approximation of X defined so far must be the smallest Rsuccessor (predecessor) definable set containing X. It was observed, for the first time, in [5]. In that paper it was also shown that a smallest R-successor definable set is not unique. Any R-smallest successor upper approximation, denoted by apprsmallest (X), s is defined as an R-successor definable set with the smallest cardinality containing X. An R-smallest successor upper approximation does not need to be unique. An R-smallest predecessor upper approximation, denoted by apprsmallest (X), p is defined as an R-predecessor definable set with the smallest cardinality containing X. Likewise, an R-smallest predecessor upper approximation does not need to be unique. Let apprsmallest (apprsmallest ) be a map that for any subset X of U des p termines one of the possible R-smallest successor (predecessor) upper approximation of X. Additionally, if Xscov = X (Xpcov = X) then we assume that smallest cov cov Xs ⊆ apprs (X), (Xp ⊆ apprsmallest (X)). Such a map apprsmallest p s (apprsmallest ) has the following properties [7]: p
26
J.W. Grzymala-Busse and W. Rz¸asa
1. apprsmallest (∅) = ∅, (apprsmallest (∅) = ∅), s p 2. apprsmallest (U ) = Uscov , (apprsmallest (U ) = Upcov ), s p 3. map appr smallest (appr smallest ) is non-decreasing, s p 4. if for any X ⊆ U there exists exactly one apprsmallest (X) (apprsmallest (X)) s p then for any subset Y of U apprsmallest (X ∪ Y ) ⊆ apprsmallest (X) ∪ apprsmallest (Y ), s s s (apprsmallest (X ∪ Y ) ⊆ apprsmallest (X) ∪ apprsmallest (Y )), p p p otherwise card apprsmallest (X ∪ Y ) ≤ card apprsmallest (X) + s s card apprsmallest (Y − X) , s (card appr smallest (X ∪ Y ) ≤ card apprsmallest (X) + p p card apprsmallest (Y − X) ), p 5. apprsmallest (apprsmallest (X)) = apprsmallest (X) = s s s apprsubset (apprsmallest (X)), s s (apprsmallest (appr smallest (X)) = apprsmallest (X) = p p p apprsubset (apprsmallest (X))), p p 6. X is R-successor (predecessor) definable if and only if X = apprsmallest (X) s (X = apprsmallest (X)). p Proof. Properties 1 and 2 follow directly from the definition of the smallest upper approximation. For the proof of Property 3 let us suppose that the map apprsmallest is not non-decreasing. Then there exist X, Y ⊆ U such that X ⊂ Y s and apprsmallest (X) ⊃ apprsmallest (Y ). For a map, defined for any subset Z of s s U in the following way apprsmallest (Z) if Z =X s appr (Z) = apprsmallest (Y ) if Z =X s apprsmallest (X) s contradiction.
⊃
appr (X) and appr (X) is R-successor definable, a
For the proof of Property 4 let us assume that apprsmallest (X) is uniquely des termined for any X ⊆ U . For all subsets Y and Z of U with X = Y ∪ Z, a set apprsmallest (Y ) ∪ apprsmallest (Z) is R-successor definable and Xscov ⊆ s s smallest smallest apprs (Y ) ∪ apprs (Z). Let us suppose that for subsets X, Y , and
Definability and Other Properties of Approximations
27
Z of U , with X = Y ∪ Z, the first inclusion of 4 does not hold. Due to the fact that there exists exactly one apprsmallest (X) we have s card(apprsmallest (Y ∪ Z)) > card(appr smallest (Y ) ∪ apprsmallest (Z)), s s s a contradiction with the assumption that appr smallest (X) is the R-smallest sucs cessor upper approximation of X. In particular, set X = Y ∪ Z may be presented as a union of two disjoint sets, e.g., X = Y ∪ (Z − Y ). Then the inequality of 4 for card(apprsmallest (Y ∪ Z)) s has the following form card(appr smallest (Y ∪ Z)) ≤ card(apprsmallest (Y ) ∪ appr smallest (Z − Y )) s s s ≤ card(apprsmallest (Y )) + card(apprsmallest (Z − Y )). s s If the set apprsmallest (X) is not unique for any X ⊆ U , then by a similar argus ment as in the first part of the proof for 4 card(apprsmallest (Y ∪ Z)) ≤ card(apprsmallest (Y ) ∪ apprsmallest (Z − Y )), s s s hence
card apprsmallest (Y ∪ Z) ≤ s card apprsmallest (Y ) + s card apprsmallest (Z − Y ) . s
Property 5 follows from the following facts: appr smallest (X) is R-successor defins able, apprsubset (X) is the R-successor largest lower approximation, and s apprsmallest (X) is the R-successor smallest upper approximation. s For the proof of Property 6 let us suppose that X is R-successor definable, i.e., X is equal to a union of R-successor sets. This union is the R-smallest successor upper approximation X. The converse is true since any R-smallest upper approximation of X is R-successor definable. Definitions of the smallest successor and predecessor approximations are not constructive in the sense that (except brute force) there is no indication how to determine these sets. Note that definitions of other approximations indicate how to compute the corresponding sets. Moreover, definitions of the smallest successor and predecessor approximations are the only definitions of approximations that refer to cardinalities.
7
Dual Approximations
As it was shown in [29], singleton approximations are dual for any relation R. In Section 5 we proved that modified singleton approximations are also dual. On the other hand it was shown in [29] that if R is not an equivalence relation then subset approximations are not dual. Moreover, concept approximations are not dual as well, unless R is reflexive and transitive [6].
28
J.W. Grzymala-Busse and W. Rz¸asa
All approximations discussed in this Section are dual to some approximations from Sections 4 and 6. By Theorem 3, properties of dual approximations follow from the corresponding properties of the original approximations. Two additional approximations were defined in [29]. The first approximation, denoted by apprdualsubset (X), was defined by s ¬(appr subset (¬X)) s while the second one, denoted by apprdualsubset (X) was defined by s ¬(appr subset (¬X)). s These approximations are called the R-dual subset successor lower and R-dual subset successor upper approximations, respectively. Obviously, we may define as well the R-dual subset predecessor lower approximation ¬(appr subset (¬X)) p and the R-dual subset predecessor upper approximation ¬(appr subset (¬X)). p R-dual subset successor (predecessor) lower approximations have the following Properties: 3a, 4a, 5a, 6a. R-dual subset successor (predecessor) upper approximations have the following Properties: 1b, 3b, 4b, 6b, 7b. By analogy we may define dual concept approximations. Namely, the R-dual concept successor lower approximation of X, denoted by apprdualconcept (X) is s defined by ¬(apprconcept (¬X)). s The R-dual concept successor upper approximation of X, denoted by apprdualconcept (X) is defined by s ¬(appr concept (¬X)). s Set denoted by apprdualconcept (X) and defined by the following formula p ¬(apprconcept (¬X)) p will be called an R-dual concept predecessor lower approximation, while set apprdualconcept (X) defined by the following formula p ¬(apprconcept (¬X)) p will be called an R-dual concept predecessor upper approximation. These four R-dual concept approximations were introduced in [6]. R-dual concept successor (predecessor) lower approximations have the following Properties: 3a, 4a, 5a.
Definability and Other Properties of Approximations
29
The R-dual concept successor (predecessor) upper approximations have the following Properties: 1b, 3b, 4b, 6b. Again, by analogy we may define dual approximations for the smallest upper approximations. The set, denoted by apprdualsmallest (X) and defined by s ¬(apprsmallest (¬X)), s will be called an R-dual smallest successor lower approximation of X while the set denoted by apprdualsmallest (X) and defined by p ¬(apprsmallest (¬X)). p will be called an R-dual smallest predecessor lower approximation of X. These two approximations are introduced in this work for the first time. The R-dual smallest successor (predecessor) lower approximations have the following Properties: 3a, the first equality of 7a, and the map apprdualsmallest is s non-decreasing.
8
Approximations with Mixed Idempotency
Smallest upper approximations, introduced in Section 6, and subset lower approximations are the only approximations discussed so far that satisfy the Mixed Idempotency Property, so apprs (X) = apprs (apprs (X))(apprp (X) = apprp (apprp (X))),
(1)
apprs (X) = apprs (apprs (X))(apprp (X) = apprp (apprp (X))).
(2)
and For lower and upper approximations that satisfy conditions 1 and 2, for any subset X of U , both approximations of X are fixed points. For apprsubset and apprsmallest , definable sets (successor or predecessor) are fixed points. For the following upper approximations definable sets are fixed points (and they are computable in a polynomial time): -concept (X) and defined as An upper approximation, denoted by apprsubset s follows apprsubset (X) ∪ {Rs (x) | x ∈ X − apprsubset (X) and Rs (x) ∩ X = ∅} s s will be called the R-subset-concept successor upper approximation of X. -concept (X) and defined as An upper approximation, denoted by apprsubset p follows appr subset (X) ∪ {Rp (x) | x ∈ X − apprsubset (X) and Rp (x) ∩ X = ∅} p p will be called the R-subset-concept predecessor upper approximation of X. An -subset (X) and defined as follows upper approximation, denoted by apprsubset s apprsubset (X) ∪ {Rs (x) | x ∈ U − apprsubset (X) and Rs (x) ∩ X = ∅} s s will be called the R-subset-subset successor upper approximation of X.
30
J.W. Grzymala-Busse and W. Rz¸asa
-subset (X) and defined as An upper approximation, denoted by apprsubset p follows apprsubset (X) ∪ {Rp (x) | x ∈ U − apprsubset (X) and Rp (x) ∩ X = ∅} p p will be called the R-subset-subset predecessor upper approximation of X. Any of these four upper approximations, together with appr subset (or s apprsubset , respectively), satisfies Mixed Idempotency Property (both second p equalities of 7a and 7b). Upper approximations, presented in this Section, do not preserve many Properties listed as 1–8 in Section 3. -concept (X) and apprsubset-concept (X) have only Approximations apprsubset s p Properties 2b and 7b. For these approximations Properties 4b, 5b, and 6b do not -concept hold even if the relation R is reflexive. For an approximation apprsubset s (X) it can be shown by the following example. Let U = {1, 2, 3, 4}, R = {(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (3, 1), (3, 3), (4, 4). Then Rs (1) = U , Rs (2) = {1, 2}, Rs (3) = {1, 3}, Rs (4) = {4}. Property 4b is not satisfied for the sets X = {1} and Y = {1, 2}, Property 5b is not satisfied for the sets X = {1} and Y = {2}, Property 6b is not satisfied for the sets X = {1, 2} and Y = {1, 3}. -subset (X) and apprsubset-subset (X) generalized For approximations apprsubset s p Properties 1b and 3b, as well as Properties 2b and 7b hold. By analogy with Section 7, for these four upper approximations we may define corresponding dual lower approximations. Thus in this Section we introduced 8 new approximations.
9
Coalescence of Rough Approximations
Finding apprsmallest (X) and apprsmallest (X) approximations is NP-hard. In s p real-life application such approximations should be replaced by other approximations that are easier to compute. Therefore it is important to know under which conditions approximations are equal or comparable. Figures 1–8 show relations between lower and upper approximations. Sets in the same box are identical. Arrows indicate inclusions between corresponding sets. Information about successor (predecessor) definability is added as well. For a symmetric relation R, the corresponding sets Rs and Rp are identical, so the corresponding successor (predecessor) approximations are also identical. For the symmetric relation R the corresponding approximations have double subscripts, s and p. Some inclusions between subset, concept and singleton approximations, discussed in this Section, were previously presented in [1, 2, 22, 24, 29]. Remark For any approximation space, relations between successor approximations are the same as for predecessor approximations. Therefore, we will prove only properties for successor approximations. Due to Theorem 4, we restrict our attention to proof for lower approximations and skip proofs for dual upper approximations.
Definability and Other Properties of Approximations
31
apprdualsmallest (X) s
p-definable apprsubset (X) p
6 p-definable apprconcept (X) p
@ I @ @ Y H HH@ apprdualconcept (X) s H@ H 6 @ H apprsingleton (X) p
apprdualsubset (X) s
6 apprmodsingleton (X) p
s-definable apprconcept (X) s
?
s-definable apprsubset (X) s
apprmodsingleton (X) s
? apprsingleton (X) s
? apprdualconcept (X) p
apprdualsubset (X) p
apprdualsmallest (X) p Fig. 1. Inclusions for lower approximations and an approximation space with an arbitrary relation
Proofs for equalities and inclusions from Figures 1 and 2. The inclusion apprconcept (X) ⊆ apprsubset (X) follows from the definitions of both approximap p concept tions. Indeed, apprp (X) is the union of some subsets of U that are included in apprsubset (X). p
The inclusion apprmodsingleton (X) ⊆ appr singleton (X) follows directly from p p the definitions of both approximations. For the proof of the inclusion apprsingleton (X) ⊆ apprdualconcept (X) let x p s
be an element of U . If x ∈ apprsingleton (X) then Rp (x) ⊆ X. Thus for any p
y ∈ U such that x ∈ Rs (y) we have y ∈ / ¬X. Hence x ∈ / apprconcept (¬X) and s x ∈ apprdualconcept (X). s The inclusion apprdualsubset (X) ⊆ apprdualconcept (X) follows from apprconcept s s s (X) ⊆ apprsubset (X) and from duality of corresponding lower and upper approxs imations (Theorem 4). Similarly, appr dualsubset (X) ⊆ apprdualsmallest (X) follows from apprsmallest s s s subset (X) ⊆ apprs (X) and from duality of corresponding lower and upper approximations (Theorem 4).
32
J.W. Grzymala-Busse and W. Rz¸asa
s-definable
J
apprsmallest (X) J s
J J J H H apprconcept (X) s H J HH JJ ^ j ? s-definable
apprdualsubset (X) p
?
s-definable
s-definable
apprsingleton (X) p
apprsubset (X) s
?
apprdualconcept (X) p
apprmodsingleton (X) p
apprdualconcept (X) s
apprmodsingleton (X) s
6 apprdualsubset (X) s
6 p-definable
p-definable
apprsingleton (X) s
apprsubset (X) p
*
apprconcept (X) p
p-definable
6
p-definable
apprsmallest (X) p
Fig. 2. Inclusions for upper approximations and an approximation space with an arbitrary relation
Remaining equalities and inclusions from Figures 1 and 2 follow from the Remark from this Section and inclusion transitivity. Proofs for equalities and inclusions from Figures 3 and 4. The proof for apprsubset s (X) = apprconcept (X) follows from the fact that if R is reflexive, then for any s x∈ / X we have Rs (x) X. The equality of apprmodsingleton (X), apprsingleton (X) and appr dualconcept (X) s s p we will show by showing the equality respective upper approximations. The equality of apprmodsingleton (X) and apprsingleton (X) follows from their definis s tions since for a reflexive relation R for any x ∈ U there is Rs (x) = ∅. Proof for concept the equality of apprsingleton (X) and appr (X) is in [22] (formula 5). s p
Definability and Other Properties of Approximations
33
apprdualsmallest (X) s p-definable apprsubset (X) p
apprs
p
(X)
apprmodsingleton (X) @ I p
apprconcept (X) p s-definable apprconcept (X) s
@ I @ @ apprdualconcept (X) s @ singleton appr (X) dualsubset
apprsubset (X) s
@ @ apprmodsingleton (X) @ s dualsubset apprp
apprsingleton (X) s
(X)
apprdualconcept (X) p
apprdualsmallest (X) p Fig. 3. Inclusions for lower approximations and an approximation space with a reflexive relation
s-definable apprsmallest (X) @ s s-definable apprconcept (X) s apprdualsubset (X) p apprdualconcept (X) p
apprdualconcept (X) s apprdualsubset (X) s
-
@ @ @ @ R @
s-definable
subset (X) @ apprs modsingleton apprp (X) @ @ p-definable @ R @ p-definable modsingleton apprsingleton (X) p
apprs
(X)
apprsingleton (X) s
apprsubset (X) p
apprconcept (X) p p-definable apprsmallest (X) p Fig. 4. Inclusions for upper approximations and an approximation space with a reflexive relation
34
J.W. Grzymala-Busse and W. Rz¸asa
For the proof of apprsingleton (X) ⊆ appr subset (X) it is enough to observe that s s any element x is a member of Rs (x). Hence for any element x ∈ appr singleton (X) s subset we have Rs (x) ⊆ X, so x ∈ appr s (X). singleton The proof for apprdualsubset (X) ⊆ appr (X) follows from the fact that s s singleton the following inclusion apprs (X) ⊆ apprs apprsubset (X) is true (for the proof, see [22] (formula 15), [29] (Theorem 7). Remaining equalities and inclusions from Figures 3 and 4 follow from Figures 1 and 2 and Remark from this Section and inclusion transitivity. Proofs for equalities and inclusions from Figures 5 and 6. All equalities and inclusions from Figures 5 and 6 were proved for a reflexive R or follow from the fact that R is symmetric. Proofs for equalities and inclusions from Figures 7 and 8. Since R is reflexive (Figures 3 and 4) it is enough to show that appr concept (X) ⊆ appr singleton (X) s s dualsmallest dualconcept and apprs (X) = apprs (X).
apprdualsmallest (X) s,p
@ I @ @ @ apprdualconcept (X) s,p-definable s,p @ singleton subset dualsubset appr (X) apprs,p (X) apprs,p (X) s,p apprconcept (X) s,p
apprmodsingleton (X) s,p
Fig. 5. Inclusions for lower approximations and an approximation space with a reflexive and symmetric relation s,p-definable apprsmallest (X) @ s,p s,p-definable apprconcept (X) s,p apprdualsubset (X) - apprsingleton (X) s,p s,p apprdualconcept (X) s,p
apprmodsingleton (X) s,p
@ @ @ -
@ @ R @
s,p-definable apprsubset (X) s,p
Fig. 6. Inclusions for upper approximations and an approximation space with a reflexive and symmetric relation
Definability and Other Properties of Approximations
35
apprdualsmallest (X) s p-definable
apprdualconcept (X) s
apprsubset (X) p
apprsingleton (X) p
s-definable apprdualsubset (X) s
apprconcept (X) apprmodsingleton (X) @ I p p s-definable apprconcept (X) s apprsubset (X) s
apprmodsingleton (X) s
@ @ @
apprsingleton (X) s apprdualconcept (X) p
p-definable
apprdualsubset (X) p
apprdualsmallest (X) p Fig. 7. Inclusions for lower approximations and an approximation space with a reflexive and transitive relation
apprsmallest (X) s s-definable
apprconcept (X) s s-definable
@ apprsubset (X) s modsingleton @ apprdualconcept (X) appr (X) p p @ modsingleton @ R p-definable apprs (X) apprdualsubset (X) p
apprsingleton (X) p
apprdualconcept (X) s
apprsingleton (X) s
apprdualsubset (X) s
apprconcept (X) p
-
p-definable
apprsubset (X) p
apprsmallest (X) p Fig. 8. Inclusions for upper approximations and an approximation space with a reflexive and transitive relation
For the proof of apprconcept (X) ⊆ apprsingleton (X) let x ∈ apprconcept (X). s s s Hence there exists y ∈ X, such that Rs (y) ⊆ X and x ∈ Rs (y). Since R is transitive, for any z ∈ Rs (x) also z ∈ Rs (y). It is clear that Rs (y) ⊆ X. Hence x ∈ apprsingleton (X). s
36
J.W. Grzymala-Busse and W. Rz¸asa Table 1. Conditions for definability Approximation apprsingleton (X) s apprsingleton (X) p singleton apprs (X) singleton apprp (X) apprmodsingleton (X) s modsingleton apprp (X) apprmodsingleton (X) s modsingleton apprp (X) subset apprs (X) apprsubset (X) p subset apprs (X) subset apprp (X) apprdualsubset (X) s dualsubset apprp (X) apprdualsubset (X) s dualsubset apprp (X) concept apprs (X) apprconcept (X) p concept apprs (X) concept apprp (X) apprdualconcept (X) s dualconcept apprp (X) dualconcept apprs (X) apprdualconcept (X) p smallest apprs (X) apprsmallest (X) p dualsmallest apprs (X) apprdualsmallest (X) p
R-successor def.
R-predecessor def.
r∧t
r∧s∧t
r∧s∧t
r∧t
s
any
any
s
r∧t∨s∧t
s∧t
s∧t
r∧t∨s∧t
r∧s
r
r
r∧s
any
s
s
any
any
s
s
any
r∧s∧t
r∧t
r∧t
r∧s∧t
r∧s∧t
r∧t
r∧t
r∧s∧t
any
s
s
any
any
s
s
any
r∧s∧t
r∧t
r∧t
r∧s∧t
r∧s∧t
r∧t
r∧t
r∧s∧t
any
s
s
any
r∧s∧t
r∧t
r∧t
r∧s∧t
Instead of showing that apprdualsmallest (X) = apprdualconcept (X) we will show s s smallest concept firstly that apprs (X) = apprs (X). In our proof we will firstly show that if Rs (x) ⊆ apprsmallest (X) then Rs (x) ⊆ apprconcept (X) for any x ∈ U . s s To do so, we will show that if Rs (x) ⊆ apprsmallest (X) then x ∈ X or Rs (x) = s ∪{Rs (y)|y ∈ Rs (x) ∩ X}. For all these cases, taking into account that Rs (x) ⊆ apprsmallest (X) ⇒ Rs (x) ∩ X =∅ s
Definability and Other Properties of Approximations
37
our proof is completed. Is it possible that Rs (x) ⊆ apprsmallest (X) and x ∈ / X? s Since R is transitive, Rs (y) ⊆ Rs (x) for any y ∈ Rs (x). Additionally, due to the fact that R is reflexive ∪{Rs (y)|y ∈ Rs (x) ∩ X} ∩ X = Rs (x) ∩ X. On the other hand ∪{Rs (y)|y ∈ Rs (x) ∩ X} ∩ ¬X = Rs (x) ∩ ¬X, since Rs (x) ⊆ apprsmallest (X). Therefore no family of R-successor definable sets s covering at least these elements of X as Rs (x) cannot cover less elements of ¬X. Thus, our assumption that Rs (x) ⊆ apprsmallest (X) and x ∈ / X implies s that Rs (x) = ∪{Rs (y)|y ∈ Rs (x) ∩ X}, and this observation ends the proof for apprsmallest (X) ⊆ apprconcept (X). For the proof of the converse inclusion s s let us take an element x ∈ ¬X such that x ∈ apprconcept (X). We will show s that x ∈ apprsmallest (X). Let R = {R (y)|y ∈ X and x ∈ Rs (y)} and let s s s Y = {y|Rs (y) ∈ Rs }. R is reflexive and transitive. Thus, Rs (z) is a minimal R-successor definable set containing z for any z ∈ U and for any R-successor definable set V contained in X, V ∩ Y = ∅. Therefore there is no family of R-successor definable sets that can cover the set Y and does not contain x. Table 1 summarizes conditions for the R-successor and R-predecessor definability. The following notation is used: r denotes reflexivity, s denotes symmetry, t denotes transitivity and ”any” denotes lack of constrains on a relation R that are needed to guarantee given kind of definability of an arbitrary subset of the universe.
10
Conclusions
In this paper we studied 28 approximations defined for any binary relation R on universe U , where R is not necessarily reflexive, symmetric or transitive. Additionally, we showed that it is possible to define 8 additional approximations. Our main focus was on coalescence and definability of lower and upper approximations of a subset X of U . We checked which approximations of X are, in general, definable. We discussed special cases of R being reflexive, reflexive and symmetric, and reflexive and transitive. Note that these special cases have immediate applications to data mining (or machine learning). Indeed, if a data set consists of some missing attribute values in the form of both lost values (missing attribute values that were, e.g., erased; they were given in the past but currently these values are not available) and ”do not care” conditions (missing attribute values that can be replaced by any attribute value, e.g., the respondent refused to give an answer) then the corresponding characteristic relation [1–4, 7] is reflexive. If the data set contains some missing attribute values and all of them are ”do not care” conditions, the corresponding characteristic relation is reflexive and symmetric. Finally, if all missing attribute values of the data set contains are lost values then the corresponding characteristic relation is reflexive and transitive.
38
J.W. Grzymala-Busse and W. Rz¸asa
Acknowledgements This research has been partially supported by the Ministry of Science and Higher Education of the Republic of Poland, grant N N206 408834.
References 1. Grzymala-Busse, J.W.: Rough set strategies to data with missing attribute values. In: Proc. Foundations and New Directions of Data Mining, the 3rd International Conference on Data Mining, pp. 56–63 (2003) 2. Grzymala-Busse, J.W.: Data with missing attribute values: Generalization of indiscernibility relation and rule induction. In: Peters, J.F., Skowron, A., Grzymala´ Busse, J.W., Kostek, B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 78–95. Springer, Heidelberg (2004) 3. Grzymala-Busse, J.W.: Three approaches to missing attribute values– A rough set perspective. In: Proc. Workshop on Foundation of Data Mining, within the Fourth IEEE International Conference on Data Mining, pp. 55–62 (2004) 4. Grzymala-Busse, J.W.: Incomplete data and generalization of indiscernibility re´ ezak, D., Wang, G., Szczuka, M.S., lation, definability, and approximations. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3641, pp. 244–253. Springer, Heidelberg (2005) 5. Grzymala-Busse, J.W., Rzasa, W.: Local and global approximations for incomplete data. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowi´ nski, R. (eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 244–253. Springer, Heidelberg (2006) 6. Grzymala-Busse, J.W., Rzasa, W.: Definability of approximations for a generalization of the indiscernibility relation. In: Proceedings of the IEEE Symposium on Foundations of Computational Intelligence (FOCI 2007), Honolulu, Hawaii, pp. 65–72 (2007) 7. Grzymala-Busse, J.W., Rzasa, W.: Approximation Space and LEM2-like Algorithms for Computing Local Coverings. Accepted to Fundamenta Informaticae (2008) 8. Grzymala-Busse, J.W., Santoso, S.: Experiments on data with three interpretations of missing attribute values—A rough set approach. In: Proc. IIS 2006 International Conference on Intelligent Information Systems, New Trends in Intelligent Information Processing and WEB Mining, pp. 143–152. Springer, Heidelberg (2006) 9. Grzymala-Busse, J.W., Siddhaye, S.: Rough set approaches to rule induction from incomplete data. In: Proc. IPMU 2004, the 10th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, vol. 2, pp. 923–930 (2004) 10. Kryszkiewicz, M.: Rough set approach to incomplete information systems. In: Proc. Second Annual Joint Conference on Information Sciences, pp. 194–197 (1995) 11. Kryszkiewicz, M.: Rules in incomplete information systems. Information Sciences 113, 271–292 (1999) 12. Kuratowski, K.: Introduction to Set Theory and Topology, PWN, Warszawa (in Polish) (1977) 13. Lin, T.T.: Neighborhood systems and approximation in database and knowledge base systems. In: Fourth International Symposium on Methodologies of Intelligent Systems, pp. 75–86 (1989)
Definability and Other Properties of Approximations
39
14. Lin, T.Y.: Chinese Wall security policy—An aggressive model. In: Proc. Fifth Aerospace Computer Security Application Conference, pp. 286–293 (1989) 15. Lin, T.Y.: Topological and fuzzy rough sets. In: Slowinski, R. (ed.) Intelligent Decision Support. Handbook of Applications and Advances of the Rough Sets Theory, pp. 287–304. Kluwer Academic Publishers, Dordrecht (1992) 16. Liu, G., Zhu, W.: Approximations in rough sets versus granular computing for coverings. In: RSCTC 2008, the Sixth International Conference on Rough Sets and Current Trends in Computing, Akron, OH, October 23–25 (2008); A presentation at the Panel on Theories of Approximation, 18 p. 17. Pawlak, Z.: Rough Sets. International Journal of Computer and Information Sciences 11, 341–356 (1982) 18. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991) 19. Pomykala, J.A.: On definability in the nondeterministic information system. Bulletin of the Polish Academy of Science Mathematics 36, 193–210 (1988) 20. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. In: Slowinski, R. (ed.) Handbook of Applications and Advances of the Rough Sets Theory, pp. 331–362. Kluwer Academic Publishers, Dordrecht (1992) 21. Skowron, A., Stepaniuk, J.: Tolerance approximation space. Fundamenta Informaticae 27, 245–253 (1996) 22. Slowinski, R., Vanderpooten, D.: A generalized definition of rough approximations based on similarity. IEEE Transactions on Knowledge and Data Engineering 12, 331–336 (2000) 23. Stefanowski, J.: Algorithms of Decision Rule Induction in Data Mining. Poznan University of Technology Press, Poznan (2001) 24. Stefanowski, J., Tsoukias, A.: On the extension of rough sets under incomplete information. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC 1999. LNCS (LNAI), vol. 1711, pp. 73–81. Springer, Heidelberg (1999) 25. Stefanowski, J., Tsoukias, A.: Incomplete information tables and rough classification. Computational Intelligence 17, 545–566 (2001) 26. Wang, G.: Extension of rough set under incomplete information systems. In: Proc. IEEE International Conference on Fuzzy Systems, vol. 2, pp. 1098–1103 (2002) 27. Wybraniec-Skardowska, U.: On a generalization of approximation space. Bulletin of the Polish Academy of Sciences. Mathematics 37, 51–62 (1989) 28. Yao, Y.Y.: Two views of the theory of rough sets in finite universes. International J. of Approximate Reasoning 15, 291–317 (1996) 29. Yao, Y.Y.: Relational interpretations of neighborhood operators and rough set appro ximation operators. Information Sciences 111, 239–259 (1998) 30. Yao, Y.Y.: On the generalizing rough set theory. In: Proc. 9th Int. Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, pp. 44–51 (2003) 31. Yao, Y.Y., Lin, T.Y.: Generalization of rough sets using modal logics. Intelligent Automation and Soft Computing 2, 103–119 (1996) 32. Ziarko, W.: Variable Precision Rough Set Model. Journal of Computer System Sciences 46, 39–59 (1993) 33. Zakowski, W.: Approximations in the space (U, Π). Demonstratio Mathematica 16, 761–769 (1983)
Variable Consistency Bagging Ensembles Jerzy Błaszczyński1, Roman Słowiński1,2 , and Jerzy Stefanowski1 1
Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland 2 Systems Research Institute, Polish Academy of Sciences, 01-447 Warsaw, Poland {jurek.blaszczynski,roman.slowinski,jerzy.stefanowski}@cs.put.poznan.pl
Abstract. In this paper we claim that the classification performance of bagging classifier can be improved by drawing to bootstrap samples objects being more consistent with their assignment to decision classes. We propose a variable consistency generalization of the bagging scheme where such sampling is controlled by two types of measures of consistency: rough membership and monotonic measure. The usefulness of this proposal is experimentally confirmed with various rule and tree base classifiers. The results of experiments show that variable consistency bagging improves classification accuracy on inconsistent data.
1
Introduction
In the last decade, a growing interest has been noticed in integrating several base classifiers into one classifier in order to increase classification accuracy. Such classifiers are known as multiple classifiers, ensembles of classifiers or committees [15,31]. Ensembles of classifiers perform usually better than their component classifiers used independently. Previous theoretical research (see, e.g., their summary in [10,15]) clearly indicated that combining several classifiers is effective only if there is a substantial level of disagreement among them, i.e., if they make errors independently with respect to one another. Thus, a necessary condition for the efficient integration is diversification of the component base classifiers. Several methods have been proposed to get diverse base classifiers inside an ensemble of classifiers, e.g., by changing the distributions of examples in the learning set, manipulating the input features, using different learning algorithms to the same data – for comprehensive reviews see again [15,31]. The best known methods are bagging and boosting which modify the set of objects by sampling or weighting particular objects and use the same learning algorithm to create base classifiers. Multiple classifiers have also attracted the interest of rough sets researchers, however, they were mainly focused on rather simple and partly evident solutions as applying various sets of attributes, e.g reducts, in the context of rule induction [29,22], using rough set based rule classifiers inside framework of some ensembles, see e.g. [25,24]. One can also notice some kind of this type of inspiration in constructing hierarchical classifiers using specific features [18]. In this study, we consider one of basic concept of rough sets, i.e., a measure of consistency of objects. The research question is whether it is possible, while J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 40–52, 2010. c Springer-Verlag Berlin Heidelberg 2010
Variable Consistency Bagging Ensembles
41
constructing multiple classifiers, to consider consistency of objects. Our research hypothesis is that not all objects in the training data may be equally useful to induce accurate classifiers. Usually noisy or inconsistent objects are source of difficulties that may lead to overfitting of the standard, single classifies and decrease their classification performance. Pruning techniques in machine learning and variable consistency [3,4] or variable precision [33] generalizations of rough sets are applied to reduce this effect. As a multiple classifier we choose bagging mainly because it is easier to be generalized according to our needs than boosting. Moreover it has been already successfully studied with rule induction algorithms [25]. This is the first motivation of our current research. The other results from analysing some related modifications of bagging, as Breiman’s proposals of Random Forests [7] or Pasting Small Votes [6]. Let us remark that the main idea of the standard version of the bagging method [5] is quite simple and appealing - the ensemble consists of several classifiers induced by the same learning algorithm over several different distributions of input objects and the outputs of base classifiers are aggregated by equal weight voting. The base classifiers used in bagging are expected to have sufficiently high predictive accuracy apart from being diversified [7]. The key issue in the standard bagging concerns bootstrap sampling – drawing many different bootstrap samples by uniformly sampling with replacement. So, each object is assigned the same probability of being sampled. In this context the research question is whether consistency of objects is worth taking into account while sampling training objects to the bootstrap samples used in bagging. Giving more chance to select consistent objects (i.e., objects which certainly belong to a class) and decreasing a chance for selecting border or noisy ones may lead to creating more accurate and diversified base classifiers in the bagging scheme. Of course, such modifications should be addressed to data being sufficiently inconsistent. Following these motivations, let us remark that rough set approaches can provide useful information about the consistency of an assignment of an object to a given class. This consistency results from granules (classes) defined by a given relation (e.g. indiscernibility, dominance) and from dependency between granules and decision categories. Objects that belong to the lower approximation of a class are consistent, otherwise they are inconsistent. The consistency of objects can be measured by consistency measures, such as rough membership function. Rough membership function introduced by Ziarko et al. is used in variable precision and variable consistency generalizations of rough sets [3,33]. Yet other monotonic measures of consistency were introduced by Błaszczyński et al. [4]. These measures allow to define variable consistency rough set approaches that preserve basic properties of rough sets and also simplify construction of classifiers. The main aim of our study is to propose a new generalization of the bagging scheme, called variable consistency bagging, where the sampling of objects is controlled by rough membership or monotonic consistency measure. That is why we call this approach variable consistency bagging (VC-bagging). We will
42
J. Błaszczyński, R. Słowiński, and J. Stefanowski
consider rule and tree base classifiers learned by different algorithms, also including those adapted to dominance-based rough set approach. Another aim is to evaluate experimentally the usefulness of variable consistency bagging on data sets characterized by a different level of consistency. The paper is organized as follows. In the next section, we remind bagging scheme and we present sampling algorithms based on consistency measures. In section 3, we describe variable consistency bagging and learning algorithms used to create base classifiers. In the following section 4, results of experiments are presented. We conclude by giving remarks and recommendations for applications of presented techniques.
2 2.1
Consistency Sampling Algorithms Bagging Scheme
The Bagging approach (an acronym from Bootstrap aggregating) was introduced by Breiman [5]. It aggregates by voting classifiers generated from different bootstrap samples. A bootstrap sample is obtained by uniformly sampling with replacement objects from the training set. Let the training set consist of m objects. Each sample contains n ≤ m objects (usually it has the same size as the original set), however, some objects do not appear in it, while others may appear more than once. The same probability 1/m of being sampled is assigned to each object. The probability of an object being selected at least once is 1 − (1 − 1/m)m . For a large m, this is about 1 - 1/e. Each bootstrap sample contains, on the average, 63.2% unique objects from the training set. Given the parameter T which is the number of repetitions, T bootstrap samples S1 , S2 , . . . , ST are generated. From each sample Si a classifier Ci is induced by the same learning algorithm and the final classifier C ∗ is formed by aggregating T classifiers. A final classification of object x is built by a uniform voting scheme on C1 , C2 , . . . , CT , i.e., it is assigned to the class predicted most often by these base classifiers, with ties broken arbitrarily. The approach is presented briefly below. For more details see [5]. (input LS learning set; T number of bootstrap samples; LA learning algorithm output C ∗ classifier) begin for i = 1 to T do begin Si := bootstrap sample from LS; {sample with replacement} Ci := LA(Si ); {generate a base classifier} end; {end for} C ∗ (x) = arg maxy∈Kj Ti=1 (Ci (x) = y) {the most often predicted class} end
Experimental results show a significant improvement of the classification accuracy, in particular, using decision tree classifiers. An improvement is also observed when using rule classifiers, as it was shown in [25]. However, the choice of a base classifier is not indifferent. According to Breiman [5] what makes a
Variable Consistency Bagging Ensembles
43
base classifier suitable is its unstability. A base classifier is unstable, when small changes in the learning set does cause major changes in the classifier. For instance, the decision tree and rule classifiers are unstable, while k-Nearest Neighbor classifiers are not. For more theoretical discussion on the justification of the problem "why bagging works" the reader is referred to [5,15]. Since bagging is an effective and open framework, several researchers have proposed its variants, some of which have turned out to have lower classification error than the original version proposed by Breiman. Some of them are summarized in [15]. Let us remind only Breiman’s proposals of Random Forests [7] and Pasting Small Votes [6] or integration of bootstrap sampling with feature selection [16,26]. 2.2
Variable Consistency Sampling
The goal of variable consistency sampling is to increase predictive accuracy of bagged classifiers by using additional information that reflects the treatment of consistency of objects, which could be easy applied to training data. The resulting bagged classifiers are trained on bootstrap samples slightly shifted towards more consistent objects. In general, the above idea is partly related to some earlier proposals of changing probability distributions while constructing bagging inspired ensembles. In particular Breiman proposed a variant called Pasting Small Votes [6]. Its main motivation was to handle massive data, which do not fit into the computer operating memory, by particular sampling of much smaller bootstraps. Drawing objects to these small samples is a key point of the modification. In Ivotes they are sequentially sampled, where at each step probability of selecting a given object is modified by its importance. This importance is estimated at each step by the accuracy of a new base classifier. Importance sampling is known to provide better results than simple bagging scheme. Variable consistency sampling could be seen as similar to the above importance sampling because it is also based on modification of probability distribution. However, difference between these two approaches lies in the fact that consistency is evaluated in the pre-processing before learning of the base classifiers is made. Our expectation is that drawing bootstrap samples from a distribution that reflects their consistency will not decrease the diversity of the samples. The reader familiar with ensembles classifiers can notice other solutions to taking into account misclassification of training objects by base classifiers. In boosting more focus is given on objects difficult to be classified by iteratively extended set of base classifiers. We argue that these ensembles are based on different principle of stepwise adding classifiers and using accuracy from previous step of learning while changing weights of objects. In typical bagging there is no such information and all bootstrap samples are independent. Our other observation is that estimating the role of objects in the preprocessing of training data is more similar to previous works on improving rough set approaches by variable precision models or some other works on edited knearest neighbor classifiers, see e.g. review in [30]. For instance in IBL family of
44
J. Błaszczyński, R. Słowiński, and J. Stefanowski
algorithms proposed by Aha et al. in [1], the IBL3 version, which kept the most useful objects for correct classification and removed noisy or borderline examples, was more accurate than IBL2 version, which focused on difficult examples from border between classes. Stefanowski et al. also observed in [28] similar performance of another variant of nearest neighbor cleaning rule in a specific approach to pre-process of imbalanced data. Let us comment more precisely the meaning of consistency. Object x is consistent if a base classifier trained on a sample that includes this object is able to re-classify this object correctly. Otherwise, object x is inconsistent. Remark that lower approximations of sets include consistent objects only. Calculating consistency measures of objects is sufficient to detect consistent objects in the pre-processing and thus learning is not required. In variable consistency sampling, the procedure of bootstrap sampling is different than in the bagging scheme described in section 2.1. The rest of the bagging scheme remains unchanged. When sampling with replacement from the training set is performed, a measure of consistency c(x) is calculated for each object x from the training set. A consistent object x has c(x) = 1, inconsistent objects have 0 ≤ c(x) < 1. The consistency measure is used to tune the probability of object x being sampled to a bootstrap sample, e.g. by calculating a product of c(x) and 1/m. Thus, objects that are inconsistent have decreased probability of being sampled. Objects that are more consistent (i.e., have higher value of a consistency measure) are more likely to appear in the bootstrap sample. Different measures of consistency may result in different probability of inconsistent object x being sampled. 2.3
Consistency Measures
To present a consistency measure, we first remind basic notions from rough set theory [19]. Namely, the relations used to compare objects and elementary classes (also called atoms or granules) defined by these relations. Consideration of the indiscernibility relation is meaningful when set of attributes A is composed of regular attributes only. Indiscernibility relation makes a partition of universe U into disjoint blocks of objects that have the same description and are considered indiscernible. Let Vai be the value set of attribute ai and f : U × A → Vai be a total function such that f (x, ai ) ∈ Vai . Indiscernibility relation IP is defined for a subset of attributes P ⊆ A as IP = {(x, y) ∈ U × U : f (x, ai ) = f (y, ai ) for all ai ∈ P }.
(1)
If an object x belongs to the class of relation IP , where all objects are assigned to the same decision class, it is consistent. The indiscernibility relation is not the only possible relation between objects. When attributes from A have preference-ordered value sets they are called criteria. In order to make meaningful classification decisions on criteria, one has to consider the dominance relation instead of the indiscernibility relation. The resulting approach called Dominance-based Rough Set Approach (DRSA), has been presented in [11,21]. Dominance relation makes a partition of universe U
Variable Consistency Bagging Ensembles
45
into granules being dominance cones. The dominance relation DP is defined for criteria P ⊆ A as DP = {(x, y) ∈ U × U : f (x, ai ) f (y, ai ) for all ai ∈ P },
(2)
where f (x, ai ) f (y, ai ) means “x is at least as good as y w.r.t. criterion ai ”. Dominance relation DP is a partial preorder (i.e. reflexive and transitive). For each object x two dominance cones are defined. Cone DP+ (x) composed of all objects that are dominating x and cone DP− (x) composed of all objects that are dominated by x. While in the indiscernibility-based rough set approach, rough sets decision classes Xi , i = 1, . . . , n, are not necessarily ordered, in DRSA they are ordered, such that if i < j, then class Xi is considered to be worse than Xj . In DRSA, unions of decision classes are approximated: upward unions Xi≥ = ≤ t≥i Xt , i = 2, . . . n, and downward unions Xi = t≤i Xt , i = 1, . . . , n − 1. Considering above concepts allow us to handle another kind of inconsistency of object description manifested by violation of the dominance principle (i.e. object x having not worse evaluations on criteria than y cannot be assigned to a worse class than x) while is not discovered by the standard rough sets with indiscernibility relation. To simplify notation, we define EP (x) as a granule defined for object x by indiscernibility or dominance relation. For the same reason, consistency measures are presented here in the context of sets X of objects, being either a given class Xi , or union Xi≥ or Xi≤ . In fact, this is not ambiguous if we specify that DP+ (x) is used when Xi≥ are considered and that DP− (x) is used when Xi≤ are considered. Rough membership consistency measure was introduced in [32]. It is used to control positive regions in Variable Precision Rough Set (VPRS) model [33]. Rough membership of x ∈ U to X ⊆ U is defined for P ⊆ A as μP X (x) =
|EP (x) ∩ X| , |EP (x)|
(3)
Rough membership captures a ratio of objects in granule EP (x) and in considered set X, to all objects belonging to granule EP (x). This measure is an estimate of conditional probability P r(x ∈ X|x ∈ EP (x)). Measure P Xi (y) was applied in monotonic variable consistency rough set approaches [4]. For P ⊆ A, X, ¬X = U − X, x ∈ U , it is defined as P X (x) =
|EP (x) ∩ ¬X| . |¬X|
(4)
In the numerator of (4) is the number of objects in U that do not belong to set X and belong to granule EP (x). In the denominator, the number of objects in U that do not belong to set X. The ratio P X (x) is an estimate of conditional probability P r(x ∈ EP (x)|x ∈ ¬X), called also a catch-all likelihood. P To use measures μP X (x) and X (x) in consistency sampling we need to transform them to measure c(x) defined for a given object x, a fixed set of attributes P and a fixed set of objects X as c(x) = μP X (x)
or
c(x) = 1 − P X (x),
(5)
46
J. Błaszczyński, R. Słowiński, and J. Stefanowski
For DRSA, the higher value of consistency c(x) calculated for union X ≥ and X ≤ is taken.
3
Experimental Setup for Variable Consistency Bagging with Rules and Trees Base Classifiers
The key issue in the variable consistency bagging is a modification of probability of sampling objects into bootstraps with regards to either rough membership μ or consistency measure . To evaluate whether this modification may improve the classification performance of bagging we planned an experiment where the standard bagging will be compared against two variants of sampling called μ bagging and bagging and single classifier. In all of compared bagging versions, base classifiers are learned using the same learning algorithms tuned with the same set of parameters. The same concerns learning single classifiers. First, we chose the ModLEM algorithm [23] as it could induce a minimal set of rules from rough approximations, it may lead to efficient classifiers and it has been already successfully applied inside few multiple classifiers [25]. Its implementation was run with the following parameters: evaluation measures based on entropy information gain, no pruning of rules, and application of classification strategy for solving ambiguous, multiple or partial matches as proposed by Grzymala-Busse et al. [13]. Ripper algorithm [8] was also considered to study performance of other popular rule induction algorithm based on different principles. We also used it with no prunning of rules. Additionally, for data where attributes have preference ordered scales we studied DomLEM algorithm specific for inducing rules in DRSA [12]. Also in this case no pruning of rules is made and classification strategy suitable for preference ordered classes is applied. Finally, the comparison was also extended to using Quinlan C4.5 algorithm to generate unprunned decision tree base classifier, as bagging is known to work efficiently for decision trees. Predictions of base classifiers were always aggregated into the final classification decision by majority voting. The classification accuracy was the main evaluation criterion and it was estimated by 10 fold stratified cross validation repeated 10 times. We evaluated performance for seven data sets listed in Table 1. They come mainly from the UCI repository [2] with one exception, acl from our own clinical applications. We were able to find orders of preference for attributes of three of these data sets. These were: car, denbosch and windsor. Four of the considered sets (car, bupa, ecoli, pima) were originally too consistent (i.e., they had too high quality of classification). So, we modified them to make them more inconsistent. Three data sets (bupa, ecoli, pima) included real-valued attributes, so they were discretized by a local approach based on minimizing entropy. Moreover, we used also a reduced set of attributes to decrease the level of consistency. The final characteristic of data is summarized in Table 1. The data sets that were modified are marked as1 . 1
Modified data set.
Variable Consistency Bagging Ensembles
47
Table 1. Characteristic of data sets data set # acl bupa1 car1 denbosch ecoli1 pima1 windsor
4
objects # attributes # classes quality of classification 140 6 2 0.88 345 6 2 0.72 1296 6 4 0.61 119 8 2 0.9 336 7 8 0.41 768 8 2 0.79 546 10 4 0.35
Results of Experiments
Table 2 summarizes the consistency of bootstrap samples obtained by bagging and VC-bagging with measures μ and . For each data set an average percentage of inconsistent objects and average consistency of sample (calculated over 1000 samples) are taken. The average consistency of sample is calculated as an average value of a given consistency measure. Table 2. Consistency of bootstrap samples resulting from bagging and VC-bagging data set type of sampling % inconsistent objects average consistency bagging 12.161 0.943 acl μ bagging 7.7 0.966 bagging 11.987 0.997 bagging 28.265 0.873 bupa μ bagging 20.244 0.916 bagging 28.1 0.996 bagging 38.96 0.95 car μ bagging 36.601 0.958 bagging 38.298 0.988 bagging 10.175 0.994 denbosch μ bagging 9.8 0.994 bagging 10.023 0.997 bagging 59.617 0.799 ecoli μ bagging 52.041 0.873 bagging 59.472 0.992 bagging 20.783 0.907 pima μ bagging 14.389 0.941 bagging 20.766 0.999 bagging 65.183 0.921 windsor μ bagging 63.143 0.929 bagging 64.637 0.979
The results show that VC-bagging results in more consistent samples (i.e., samples with less inconsistent objects and higher average value of consistency of object in the sample). However, the magnitude of this effect depends on the
48
J. Błaszczyński, R. Słowiński, and J. Stefanowski
Table 3. Classification accuracy resulting from 10 x 10-fold cross validation of single classifier and an ensemble of 50 classifers resulting from standard bagging and VCbagging. Rank of the result for the same data set and classifier is given in brackets. data set classifier C4.5 acl ModLEM Ripper C4.5 bupa ModLEM Ripper C4.5 ModLEM car Ripper DomLEM C4.5 denbosch ModLEM Ripper DomLEM C4.5 ecoli ModLEM Ripper C4.5 pima ModLEM Ripper C4.5 windsor ModLEM Ripper DomLEM average rank
single 84.64+ − 0.92 (4) 86.93+ − 0.96 (4) 85.79+ − 1.4 (2) 66.67+ − 0.94 (4) 68.93+ − 1.0 (4) 65.22+ − 1.4 (4) 78+ 0.42 (2.5) − 69.64+ − 0.33 (3) 69.96+ − 0.23 (4) 81.69+ − 0.17 (4) 80.92+ − 2.3 (4) 79.33+ − 2.3 (4) 81.93+ − 2.3 (4) 83.95+ − 1.27 (4) 81.4+ − 0.48 (4) 49.23+ − 0.58 (2) 77.59+ − 0.61 (4) 72.21+ − 1.1 (4) 72.98+ − 0.76 (4) 71.3+ − 0.84 (4) 45.53+ − 1.5 (4) 41.03+ − 1.6 (4) 40.05+ − 1.4 (4) 50.11+ − 0.75 (4) 3.729
bagging 85.21+ − 1.1 (1) 88.21+ − 0.86 (1) 85.64+ − 0.81 (3) 70.35+ − 0.93 (3) 69.28+ − 0.85 (2.5) 67.22+ − 1.1 (3) 77.97+ − 0.46 (4) 69.44+ − 0.2 (4) 70.86+ − 0.39 (3) 82.32+ − 0.17 (3) 85.46+ − 1.2 (3) 85.04+ − 1.6 (3) 86.8+ − 1.5 (2) 86.77+ − 1.6 (1) 81.37+ − 0.6 (3) 49.08+ − 0.64 (3) 78.72+ − 0.68 (3) 72.88+ − 1.2 (1) 73.95+ − 0.7 (3) 72.64+ − 0.71 (3) 48+ 1.2 (2) − 42.89+ − 1.4 (2) 49.16+ − 1.7 (1) 53.1+ − 0.68 (3) 2.521
μ bagging 84.71+ − 1.1 (3) 87.36+ − 1.6 (3) 86.07+ − 0.66 (1) 71.01+ − 0.53 (1) 71.07+ − 1 (1) 71.77+ − 1.0 (1) 79.76+ − 0.42 (1) 69.88+ − 0.31 (1) 72.56+ − 0.37 (1) 82.5+ − 0.18 (2) 85.63+ − 1.4 (1.5) 85.55+ − 1.0 (1) 86.72+ − 1.2 (3) 86.6+ − 1.2 (2) 81.52+ − 0.57 (1) 68.96+ − 0.87 (1) 81.31+ − 0.51 (1) 72.8+ − 1.1 (2) 75.01+ − 0.47 (1) 73.84+ − 0.6 (1) 48.15+ − 0.85 (1) 42.77+ − 1.3 (3) 48.24+ − 1.4 (2) 53.37+ − 1.0 (2) 1.562
bagging 84.79+ − 0.9 (2) 88.14+ − 0.73 (2) 85.57+ − 0.62 (4) 70.55+ − 0.96 (2) 69.28+ − 1.0 (2.5) 67.3+ − 1.0 (2) 78+ 0.46 (2.5) − 69.75+ − 0.21 (2) 71.94+ − 0.44 (2) 82.55+ − 0.29 (1) 85.63+ − 1.7 (1.5) 85.46+ − 1.1 (2) 87.23+ − 1.3 (1) 86.53+ − 1.3 (3) 81.46+ − 0.58 (2) 49.05+ − 0.76 (4) 78.75+ − 0.65 (2) 72.5+ − 1.2 (3) 74.02+ − 0.75 (2) 72.83+ − 0.61 (2) 47.91+ − 0.74 (3) 43.81+ − 1.1 (1) 46.78+ − 1.3 (3) 53.68+ − 0.92 (1) 2.188
measure used and the data set. It follows that μ bagging leads to more consistent samples while bagging is less restrictive to introducing inconsistent objects into a sample. The value of c(x) involving measure is usually higher than c(x) involving μ measure. As it comes from formula (4), relates the number of inconsistent objects to the whole number of objects in the data set that may cause inconsistencies. On the other hand, from (3), μ measures inconsistency more locally. It relates the number of consistent objects in the granule to the number of objects in the granule. A significant increase of consistency in variable consistency samples is visible for data sets like bupa, car, ecoli and pima. In case of other data sets (acl, denbosch, windsor) this effect is not visible. Moreover, the consistency in a sample is high for these data sets even when the standard bagging scheme is applied. To explore this point it is useful to see the value of quality of classification in the whole data set shown in Table 1. Less consistent data sets are those for which variable consistency sampling worked better. Naturally, this will also be reflected by results of VC-bagging. Another issue is
Variable Consistency Bagging Ensembles
49
the size of data sets used in experiment. Two data sets: acl and denbosch are considerably smaller than the others, which can affect the results. The experimental comparison of classification performance with standard bagging scheme against VC-bagging is summarized in Table 3. We measured average classification accuracy and its standard deviation. Each classifier used in experiment is an ensemble of 50 base classifiers. One can notice that results presented in the Table 3 show the two VCbagging variants do well against the standard bagging. VC-bagging with μ measure (μ bagging) almost always improves the results. Generally, application of VC-bagging never decreased the predictive accuracy of compared classifiers with one exception of ModLEM on data set ecoli. VC-bagging gives less visible improvements for data sets acl, denbosch and windsor. This can be explained by the smallest increase of consistency in samples observed for these data sets. One can notice that bagging seem to give better results than other bagging techniques when it is used with DomLEM. However, this point needs further research. Application of DomLEM increased significantly results in case of car and windsor data sets. To compare more formally performance of all classifiers on multiple data sets we follow a statistical approach proposed in [9]. More precisely, we apply Friedman test, which tests the null-hypothesis that all classifiers perform equally well. It uses ranks of each of classifiers on each of the data sets. In this test, we compare performance of single classifier, bagging, μ bagging and bagging with different base classifiers. Following principles of Friedman test, it is assumed that results of base classifiers on the same data set are independent. We are aware that this assumption is not completely fulfilled in our case. However, we claim that compared base classifiers are substantially different (and in this sense independent). In our experiment, we use one tree based classifier and three rule classifiers. Each of rule classifiers employ different strategy of rule induction. Friedman statistics, for data in Table 3, gives 35.82 which exceeds the critical value 7.82 (for confidence level 0.05). We can reject the null hypothesis. To compare performance of classifiers, we use a post-hoc analysis and we calculate critical difference (CD) according to Nemenyi statistics [17]. In our case, for confidence level 0.05, CD = 0.95. Classifiers that have difference in average ranks higher than CD are
CD = 0.957 single classifier bagging
ε bagging μ bagging
4
3
2
Fig. 1. Critical difference for data from Table 3
1
50
J. Błaszczyński, R. Słowiński, and J. Stefanowski
significantly different. We present results of the post-hoc analysis in Figure 1. Average ranks are marked on the bottom line. Groups of classifiers that are not significantly different are connected. Single classifiers are significantly worse than any bagging variant. Among bagging variants, μ bagging is significantly better than standard bagging. bagging leads to lower value of the average rank, however, this difference turns out to be not significant in the post-hoc analysis.
5
Conclusions
In this study, we have considered the question whether consistency of objects should be taken into account while drawing training objects to the bootstrap samples used in bagging. We claim that increasing a chance to select more consistent learning objects and reducing probability of selecting too inconsistent objects leads to creating more accurate and still enough diversified base classifiers. The main methodological contribution of this paper is proposing a variable consistency bagging, where such sampling is controlled by rough membership or monotonic consistency measures. The statistical analysis of experimental results, discussed in the previous section, has shown that the VC-bagging performed better than standard bagging. Although only μ bagging is significantly better. One may also notice that improvement of the classification performance depends on the quality of classification on the original data set. Thus, better performance of the VC-bagging could be observed for more inconsistent data sets. To sum up, the proposed approach does not decrease the accuracy of classification for consistent data sets while it improves the accuracy for inconsistent data sets. Moreover, its key concept is highly consistent with the principle of rough set theory and can be easily implemented without too many additional computational costs. It can be thus considered as an out-of-box method to improve the prediction of classifiers.
Acknowledgments The authors wish to acknowledge financial support from the Ministry of Science and Higher Education, grant N N519 314435.
References 1. Aha, D.W., Kibler, E., Albert, M.K.: Instance-based learning algorithms. Machine Learning Journal 6, 37–66 (1991) 2. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2007), http://www.ics.uci.edu/~mlearn/MLRepositoru.html 3. Błaszczyński, J., Greco, S., Słowiński, R., Szeląg, M.: On Variable Consistency Dominance-based Rough Set Approaches. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Słowiński, R. (eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 191–202. Springer, Heidelberg (2006)
Variable Consistency Bagging Ensembles
51
4. Błaszczyński, J., Greco, S., Słowiński, R., Szeląg, M.: Monotonic Variable Consistency Rough Set Approaches. In: Yao, J., Lingras, P., Wu, W.-Z., Szczuka, M.S., Cercone, N.J., Śl¸ezak, D. (eds.) RSKT 2007. LNCS (LNAI), vol. 4481, pp. 126–133. Springer, Heidelberg (2007) 5. Breiman, L.: Bagging predictors. Machine Learning Journal 24(2), 123–140 (1996) 6. Breiman, L.: Pasting small votes for classification in large databases and on-line. Machine Learning Journal 36, 85–103 (1999) 7. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001) 8. Cohen, W.W.: Fast effective rule induction. In: Proc. of the 12th Int. Conference on Machine Learning, pp. 115–123 (1995) 9. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) 10. Dietrich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000) 11. Greco, S., Matarazzo, B., Słowiński, R.: Rough sets theory for multicriteria decision analysis. European Journal of Operational Research 129, 1–47 (2001) 12. Greco, S., Matarazzo, B., Słowiński, R., Stefanowski, J.: An algorithm for induction of decision rules consistent with dominance principle. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 304–313. Springer, Heidelberg (2001) 13. Grzymala-Busse, J.W.: Managing uncertainty in machine learning from examples. In: Proc. 3rd Int. Symp. in Intelligent Systems, pp. 70–84 (1994) 14. Grzymala-Busse, J.W., Stefanowski, J.: Three approaches to numerical attribute discretization for rule induction. International Journal of Intelligent Systems 16(1), 29–38 (2001) 15. Kuncheva, L.: Combining Pattern Classifiers. In: Methods and Algorithms, Wiley, Chichester (2004) 16. Latinne, P., Debeir, O., Decaestecker, Ch.: Different ways of weakening decision trees and their impact on classification accuracy of decision tree combination. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, p. 200. Springer, Heidelberg (2000) 17. Nemenyi, P.B.: Distribution free multiple comparison. Ph.D. Thesis, Princenton Univeristy (1963) 18. Hoa, N.S., Nguyen, T.T., Son, N.H.: Rough sets approach to sunspot classification problem. In: Ślęzak, D., Yao, J., Peters, J.F., Ziarko, W.P., Hu, X. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3642, pp. 263–272. Springer, Heidelberg (2005) 19. Pawlak, Z.: Rough sets. International Journal of Information & Computer Sciences 11, 341–356 (1982) 20. Quinlan, J.R.: Bagging, boosting and C4.5. In: Proc. of the 13th National Conference on Artificial Intelligence, pp. 725–730 (1996) 21. Słowiński, R., Greco, S., Matarazzo, B.: Rough set based decision support. In: Burke, E.K., Kendall, G. (eds.) Search Methodologies: Introductory Tutorials in Optimization and Decision Support Techniques, pp. 475–527. Springer, Heidelberg (2005) 22. Slezak, D.: Approximate entropy reducts. Fundamenta Informaticae 53(3/4), 365– 387 (2002) 23. Stefanowski, J.: The rough set based rule induction technique for classification problems. In: Proc. of 6th European Conference on Intelligent Techniques and Soft Computing. EUFIT 1998, pp. 109–113 (1998)
52
J. Błaszczyński, R. Słowiński, and J. Stefanowski
24. Stefanowski, J.: Multiple and hybrid classifiers. In: Polkowski, L. (ed.) Formal Methods and Intelligent Techniques in Control, Decision Making, Multimedia and Robotics, Post-Proceedings of 2nd Int. Conference, Warszawa, pp. 174–188 (2001) 25. Stefanowski, J.: The bagging and n2-classifiers based on rules induced by MODLEM. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 488–497. Springer, Heidelberg (2004) 26. Stefanowski, J., Kaczmarek, M.: Integrating attribute selection to improve accuracy of bagging classifiers. In: Proc. of the AI-METH 2004 Conference - Recent Developments in Artificial Intelligence Methods, Gliwice, pp. 263–268 (2004) 27. Stefanowski, J., Nowaczyk, S.: An experimental study of using rule induction in combiner multiple classifier. International Journal of Computational Intelligence Research 3(4), 335–342 (2007) 28. Stefanowski, J., Wilk, S.: Improving Rule Based Classifiers Induced by MODLEM by Selective Pre-processing of Imbalanced Data. In: Proc. of the RSKD Workshop at ECML/PKDD, Warsaw, pp. 54–65 (2007) 29. Suraj, Z., Gayar Neamat, E., Delimata, P.: A Rough Set Approach to Multiple Classifier Systems. Fundamenta Informaticae 72(1-3), 393–406 (2006) 30. Wilson, D.R., Martinez, T.: Reduction techniques for instance-based learning algorithms. Machine Learning Journal 38, 257–286 (2000) 31. Valentini, G., Masuli, F.: Ensembles of Learning Machines. In: Marinaro, M., Tagliaferri, R. (eds.) WIRN 2002. LNCS, vol. 2486, pp. 3–19. Springer, Heidelberg (2002) 32. Wong, S.K.M., Ziarko, W.: Comparison of the probabilistic approximate classification and the fuzzy set model. Fuzzy Sets and Systems 21, 357–362 (1987) 33. Ziarko, W.: Variable precision rough sets model. Journal of Computer and Systems Sciences 46(1), 39–59 (1993)
Classical and Dominance-Based Rough Sets in the Search for Genes under Balancing Selection Krzysztof A. Cyran Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
[email protected]
Abstract. Since the time of Kimura’s theory of neutral evolution at molecular level the search for genes under natural selection is one of the crucial problems in population genetics. There exists quite a number of statistical tests designed for it, however, the interpretation of the results is often hard due to the existence of extra-selective factors, such as population growth, migration and recombination. The author, in his earlier work, has proposed the idea of multi-null hypotheses methodology applied for testing the selection in ATM, RECQL, WRN and BLM genes the foursome implicated in human familial cancer. However, because of high computational effort required for estimating the critical values under nonclassical null hypotheses, mentioned strategy is not an appropriate tool for selection screening. The current article presents novel, rough set based methodology, helpful in the interpretation of the tests outcomes applied only versus classical nulls. The author considers for this purpose both classical and dominance based rough set frameworks. None of rough set based methods requires long-lasting simulations and, as it is shown in a paper, both give reliable results. The advantage of dominance based approach over classical one is more natural treatment of statistical test outcomes, resulting in better generalization without necessity of manual incorporating the domain-dependent reasoning to the process of knowledge processing. However, in testing this gain in generalization proved to be at the price of a slight loss of accuracy. Keywords: classical rough sets approach, dominance-based rough sets approach, natural selection, balancing selection, ATM, BLM, RECQL, WRN, neutrality tests.
1
Introduction
According to Kimura’s neutral model of evolution [1] the majority of genetic variation at the molecular level is caused by the selectively neutral forces. These include genetic drift and silent mutations. Although the role of selection has been reduced as compared to selection driven evolution model, it is obvious that some mutations must be deleterious and some are selectively positive (the ASMP locus, which is a major contributor to the brain size in primates [2,3] J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 53–65, 2010. c Springer-Verlag Berlin Heidelberg 2010
54
K.A. Cyran
is the well known example of it). Another examples of research for detection of natural selection can be found in [4,5,6,7]. There exists one more kind of natural selection called balancing selection, detection of which with the use of rough set methods is described in the paper. The search for balancing selection is the crucial problem in contemporary population genetics due to the fact that such selection is associated with serious genetic disorders. Many statistical tests [8,9,10,11] were proposed for the detection of such a selection. Yet, the interpretation of the outcomes of tests is not a trivial task because such factors like population growth, migration and recombination can force similar results of the tests [12]. The author in his earlier work (published in part in [13] and in part unpublished), has proposed the ideas of multi-null hypotheses methodology used for the detection of a balancing selection in genes implicated in human familial cancer. One of the genes was the ATM (ataxiatelangiectasia mutated) and three others were DNA helicases involved in repair of it. The names of the helicases are RECQL, WRN (Werners syndrome, see [14]) and BLM (Blooms syndrome, see [15]). However, the methodology proposed earlier (even when get formalized) is not appropriate as a screening tool, because long lasting simulations are required for computing the critical values of the tests, assuming nonclassical null hypotheses. To avoid the need of extensive computer simulations the author proposed the artificial intelligence based methodology. In particular, rough set theory was studied in the context of its applicability in the aforementioned problem. As the result of this research the author presents in the current paper the application of two rough set approaches: classical [16,17] and dominance-based models [18,19,20]. Instead of trying to incorporate all feasible demography and recombination based parameters into null hypotheses, the newly proposed methodology relies on the assumption that extra-selective factors influence different neutrality tests in different way. Therefore it should be possible to detect the selection signal from the battery of neutrality tests even using classical null hypotheses. For this purpose the expert knowledge aquisition method should be used, where the expert knowledge can be obtained from multi null hypotheses technique applied for some small set of genes. The decision algorithm obtained in this way can be subsequently applied for the selection search in genes which were not subject for the study with the use of multi null hypotheses methodology. Since the critical values of neutrality tests for classical null hypotheses are known, aforementioned strategy does not require time-consuming computer simulations. The use of classical rough set theory was dictated by the fact that test outcomes are naturally discretized to a few values only - the approach was presented in [21]. The current paper, which is the extended and corrected version of that article, deals also with dominance-based rough sets. Dominance-based approach was not yet considered in this context neither in [21] nor elsewhere,
Classical and Dominance Rough Sets in Selection Search
55
and comparison of the applicability of the two approaches to this particular case study is done for the first time.
2
Genetic Data
The single nucleotide polymorphisms (SNP) data, taken from the intronic regions of target genes were used as genetic material for this study. The SNPs form so called haplotypes of the mentioned loci. A number of interesting problems about these genes were adressed by the author and his co-workers, including the question of selection signatures and other haplotypes related problems [13,21,22,23,24]. The first gene analyzed is ataxia-telangiectasia mutated (ATM) [25,26,27,28,29]. The product of this gene is a large protein implicated in the response to DNA damage and regulation of the cell cycle. The other three genes are human helicases [30] RECQL [24], Blooms syndrome (BLM) [31,32,33,34] and Werners syndrome (WRN) [35,36,37,38]. The products of these three genes are enzymes involved in various types of DNA repair, including direct repair, mismatch repair and nucleotide excision repair. The locations of the genes are as follows. The ATM gene is located in human chromosomal region 11q22-q23 and spans 184 kb of genomic DNA. The WRN locus spans 186 kb at 8p12-p11.2 and its intron-exon structure includes 35 exons, with the coding sequence beginning in the second exon. RECQL locus contains 15 exons, spans 180 kb and is located at 12p12-p11, whereas BLM is mapped to 15q26.1, spans 154 kb, and is composed of 22 exons. For this study blood samples were taken and genotyped from the residents of Houston, TX. The blood donors belonged to four major ethnic groups: Caucasians, Asians, Hispanics, and African-Americans, and therefore they are representative for human population.
3
Methodology
To detect departures from the neutral model, the author used the following tests: Tajima’s (1989) T (for uniformity, author follows here the nomenclature of Fu [9] and Wall [11]), Fu and Li’s (1993) F ∗ and D∗ , Wall’s (1999) Q and B, Kelly’s (1997) ZnS and Strobeck’s S test. The definitions of these tests can be found in original works of the inventors and, in a summarized form, in Cyran et. al. (2004) article [13]. Such tests like McDonald-Kreitman’s (1991) [39], Akashi’s (1995) [40], Nielsen and Weinreich’s (1999) [41], as well as Hudson, Kreitman and Aguade’s (1987) HKA test [42] were excluded because of the form of genetic data required for them. The haplotypes for particular loci were inferred and their frequencies were estimated by using the Expectation-Maximization algorithm [43,44,45]. The rough set methods are used to simplify the search for balancing selection. Such a selection (if present) should be reflected by statistically significant departures from the null of the Tajima’s and Fu’s tests towards positive values. Since not all such departures are indeed caused by a balancing selection
56
K.A. Cyran
[12], but can be the result of such factors like population change in time, migration between subpopulations and a recombination, therefore, a wide battery of tests was used. The problem with the interpretation of their combinations was solved with the use of rough set methods. Classical Rough Set Approach (CRSA) and Dominance-based Rough Set Approaches (DRSA) were applied for this purpose. The decision table contained tests results treated as conditional attributes and a decision about the balancing selection treated as the decision attribute. The author was able to determine the value of this decision attribute for given combination of conditional attributes, basing on previous studies and using heavy computer simulations. The goal of this work is to compare two methodologies used for automatic interpretation of the battery of tests. While using any of them the interpretation can be done without time consuming simulations. In order to find the required set of tests, which is informative about the problem, there was applied the notion of a relative reduct with respect to decision attribute. In case of classical rough sets, in order to obtain as simple decision rules as possible, the relative value reducts were also used for particular elements of the Universe. To study the generalization properties and to estimate the decision error, the jack-knife crossvalidation technique was used, known to generate an unbiased estimate of the decision error. The search for reducts, value reducts and rule generation was performed by the written by author software based on discernibility matrices and minimization of positively defined boolean functions. The search for reducts and rule generation process in DRSA was conducted with the use of 4eMka System - a rule system for multicriteria decision support integrating dominance relation with rough approximation [46,47]. This sofware is available at a web page of Laboratory of Intelligent Decision Support Systems, Institute of Computing Science, Poznan University of Technology (Poznan 2000, http://www-idss.cs.put.poznan.pl/).
4
Results and Discussion
The results of tests T , D∗ , F ∗ , S, Q, B and ZnS are given in Table 1 together with the decision attribute denoting the evidence of balancing selection based on computer simulations. The values of the test are: Non significant (NS) when p > 0.05, significant (S) if 0.01 < p < 0.05, and strongly significant (SS) when p < 0.01. The last column indicates the evidence or no evidence of balancing selection, based on the detailed analysis according to multi-null methodology. The CRSA-based analysis of the Decision Table 1 revealed that there existed two relative reducts: RED1 = {D∗ , T, ZnS } and RED2 = {D∗ , T, F ∗ }. It is clearly visible that the core set is composed of tests D∗ and T , whereas tests ZnS and F ∗ can be in general chosen arbitrarily. However, since both Fu’s tests F ∗ and D∗ are examples of tests belonging to the same family, and their outcomes can be strongly correlated, it seems to be advantageous to choose Kelly’s ZnS instead of F ∗ test.
Classical and Dominance Rough Sets in Selection Search
57
Table 1. Statistical tests results for the classical null hypothesis. Adopted from the article [21].
AfAm ATM Cauc Asian Hispanic AfAm RECQL Cauc Asian Hispanic AfAm WRN Cauc Asian Hispanic AfAm BLM Cauc Asian Hispanic
D∗ B S NS S NS NS NS S NS NS NS S NS NS S S NS NS NS S NS S NS NS NS NS NS NS NS NS NS NS NS
Q T NS S NS SS NS S NS SS NS SS NS SS S S NS SS NS NS NS NS NS NS NS NS NS NS NS S NS NS NS NS
S ZnS NS NS SS S NS S NS S NS NS NS NS NS S NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
F ∗ Balancing selection S Yes SS Yes NS Yes S Yes NS Yes SS Yes NS Yes S Yes NS No NS No NS No NS No NS No S No NS No NS No
Probably ZnS outcomes are (at least theoretically) less correlated with outcomes of test D∗ , belonging to the core and therefore required in each reduct. In the DRSA only one reduct was found, and interestingly, it was the one preferred by geneticists, i.e. RED1 . The Decision Table 1 with set of conditional attributes reduced to the set RED1 is presented in Table 2. Table 2. The Decision Table, in which the set of tests is reduced to relative reduct RED1 . Adopted from [21].
ATM
RECQL
WRN
BLM
AfAm Cauc Asian Hispanic AfAm Cauc Asian Hispanic AfAm Cauc Asian Hispanic AfAm Cauc Asian Hispanic
D∗ T S S S SS NS S S SS NS SS S SS NS S S SS NS NS S NS S NS NS NS NS NS NS S NS NS NS NS
ZnS Balancing selection NS Yes S Yes S Yes S Yes NS Yes NS Yes S Yes NS Yes NS No NS No NS No NS No NS No NS No NS No NS No
58
K.A. Cyran
After the reduction of informative tests to the set of tests RED1 = {D∗ , T , ZnS }, the problem of coverage of the (discrete) space generated by these statistics was considered. Since the reduct was the same for classical and dominance-based rough sets, the coverage of the space by the examples included in the training set was identical in both cases. The results are presented in Table 3. The domain of each test outcome (coordinate) is composed of three values: SS (strong statistical significance p < 0.01), S (statistical significance 0.01 < p < 0.05), and NS (no significance p > 0.05). The given point in a space is assigned to: Sel (the evidence of balancing selection), N Sel (no evidence of balancing selection) or empty cell (point not covered by the training data). The assignment is done basing on raw training data with conditional part reduced to the relative reduct RED1 . Note that the fraction of points covered by training examples is only 30%. The next step, available only in the case of classical rough sets, was the computation of the relative value reducts for particular decision rules in the Decision Table 2. The new Decision Table with relative value reducts used is presented in Table 4. This table is the basis for the Classical Rough Sets (CRS) Decision Algorithm 1. CRS Algorithm 1, adopted from [21] BALANCING_SELECTION If: T = SS or (T = S and D* = S) or ZnS = S NO_SELECTION If: T = NS or (T = S and D* = NS and ZnS = NS) Table 3. The discrete space of three tests: D∗ , T and ZnS . Adopted from [21].
SS D∗ S NS
T SS S NS ZnS ZnS ZnS SS S N S SS S N S SS S N S Sel Sel Sel
Sel Sel N Sel
N Sel N Sel
Certainly this algorithm is simplified and more general as compared to the algorithm that corresponds to the Decision Table 2. The increase of generality can be observed by comparizon of Table 5 with Table 3. In Table 5 the domain of each test outcome (coordinate) is also composed of three values: SS (strong statistical significance p < 0.01), S (statistical significance 0.01 < p < 0.05), and NS (no significance p > 0.05). The given point in a space is assigned to Sel and N Sel (with the meaning identical to that in Table 3), or ”-” having the meaning of contradiction between evidence and no evidence of the balancing selection. The coverage of points is based on the number of points which are classified with the use of the CRS Algorithm 1. Observe, that the fraction of points covered by this algorithm is 74%, however, since 11% is classified as both with and without selection, therefore only 63% of the points could be really treated as covered.
Classical and Dominance Rough Sets in Selection Search
59
Table 4. The set of tests with relative value reducts used in classical rough sets. Adopted from [21].
AfAm ATM Cauc Asian Hispanic AfAm RECQL Cauc Asian Hispanic AfAm WRN Cauc Asian Hispanic AfAm BLM Cauc Asian Hispanic
D∗ T ZnS Balancing selection S S Yes SS Yes S Yes SS Yes SS Yes SS Yes S Yes SS Yes NS No NS No NS No NS No NS No NS S NS No NS No NS No
In the case of classical apporach, the CRS Algorithm 1 is the final result of purely automatic knowledge processing technique. It can be further improved manually by supplying it with the additional information, concerning the domain under study. But such solution is not elegant. The elegant solution uses dominance-based rough sets capable of automaic knowledge processing for domains with ordered results (or preferences). It is clearly true that if a balancing selection is determined by the statistical significance of the given test, then such selection is even more probable when the outcome of this test is strongly statistically significant. Application of dominance-based rough sets results in the Dominance Rough Set (DRS) Algorithm, presented below. DRS Algorithm at least BALANCING_SELECTION If: T >= SS or (T >= S and D* >= S) or ZnS >= S at most NO_SELECTION If: T <= NS or (T <= S and D* <= NS and ZnS <= NS) Table 5. The discrete space of three tests: D∗, T and ZnS basing on the simplified Decision Algorithm 1 for classical rough sets. Adopted from [21].
SS SS Sel D∗ S Sel N S Sel
SS ZnS S Sel Sel Sel
T S NS ZnS ZnS N S SS S N S SS S Sel Sel Sel Sel Sel Sel Sel N Sel -
NS N Sel N Sel N Sel
60
K.A. Cyran
Table 6. The discrete space of three tests: D∗ , T and ZnS basing on the DRS Algorithm
SS SS Sel D∗ S Sel N S Sel
SS ZnS S Sel Sel Sel
NS Sel Sel Sel
SS Sel Sel Sel
T S ZnS S Sel Sel Sel
NS Sel Sel N Sel
SS -
NS ZnS S -
NS N Sel N Sel N Sel
The corresponding to DRS Algorithm coverage of the discrete space of tests results is presented in the Table 6. Like in previous tables of this type, the domain of each test outcome (coordinate) is composed of three values: SS (strong statistical significance p < 0.01), S (statistical significance 0.01 < p < 0.05), and NS (no significance p > 0.05). The given point in a space is assigned to Sel, N Sel or “-” (with the meaning identical to that of Table 5). This table shows that all points are covered by DRS Algorithm, yet since 22% are designated as contradictions, therefore 78% points in a space are really recognizable. Manual placing of the domain knowledge to classical rough set based inference results in the following change. Instead of equalities in CRS Algorithm 1, one should propose inequalities in the generalized version referred to as CRS Algorithm 2. Such inequality means that the given test is at least of the value of statistical significance shown to the right of the inequality sign, but it can obviously be also more significant. In other words, the main difference between CRS Algorithm 2, as compared to the CRS Algorithm 1, is that instead of formulas of the type testoutcome = S it uses formulas of the type testoutcome >= S in rules for Selection - meaning that the test outcome is at least significant (and perhaps strongly significant) - and testoutcome <= S in rules for NoSelection. But this is exactly what dominance-based rough sets approach offers as a standard. The CRS Algorithm 2 deals also with the problem of contradiction in inferred decisions. If the inference leads to the contradiction, the algorithm avoids it by generating no decision about balancing selection in a gene under study. The problem of covering points in a discrete space generated by three tests in CRS Algorithm 2 is presented in Table 7. Algorithm 2, adopted from [21] BALANCING_SELECTION := False; NO_DECISION := False; If T >= SS or (T >= S and D* >= S) or ZnS >= S then BALANCING_SELECTION := True; If T <= NS or (T <= S and D* <= NS and ZnS <= NS) then If BALANCING_SELECTION then NO_DECISION := True else BALANCING_SELECTION := False;
Classical and Dominance Rough Sets in Selection Search
61
The meaning of symbols in Table 7 is identical to that of Tables 3, 5 and 6. There is, however, one difference in notation. If any character is in parentheses, it means that the point is not assigned to the given value automatically. Rather, domain specific reasoning is used. In our case it states that the selection is even more probable for given test showing strong significance (SS), when automatic classical rough sets based knowledge acquisition indicated such selection for this test being just significant (S) with the values of other tests unchanged. Observe that the table is essentially (i.e. except for existance of parentheses) identical to the Table 6. However, the CRS Algorithm 2, being the basis for Table 7, must be obtained manually (indicated by points with paretheses). It gives the impression what must be changed manually in order to obtain solutions automatically generated by dominance-based rough sets. In the case of greater decisions tables it could be extremely difficult. Therefore it would be more fair to compare results of automatic knowledge processing by classical and dominance-based rough sets. In other words fair comparison of CRSA and DRSA should be performed using CRS Algrithm 1 and DRS Algorithm (or equivalently, by comparing Table 5 with Table 6). Such comparison clearly reveals better generalization achieved with the use of DRSA. Table 7. The discrete space of three tests: D∗ , T and ZnS basing on the CRS Algorithm 2. Adopted from [21].
SS SS Sel D∗ S Sel N S Sel
SS ZnS S Sel Sel Sel
NS Sel Sel Sel
T S ZnS SS S (Sel) (Sel) Sel Sel (Sel) Sel
NS (Sel) Sel N Sel
SS (-) (-) (-)
NS ZnS S -
NS N Sel N Sel N Sel
As it was stated, the comparison of Table 5 with Table 6 shows the degree of generalization obtained automatically with the use of classical and dominancebased rough sets, respectively. However, as will be seen, such increase in generalizaction for dominance-based rough sets is obtained at the price of slightly worse classification results on test examples as compared to classical rough sets with manual knowledge acquisition. One can argue once more that it is not fair to compare completely automatic knowledge processing done by dominance-based rough sets methods with human-tuned methods based on classical rough sets. It is definitely true, yet such comparison shows that the sole application of dominance relation, as it is proposed in dominance-based rough set approach, can in some cases lead to suboptimal solutions, becuse they are worse (in some aspects at least) than solutions based on classical indiscernibility relation coupled with domain specific tuning of the resulting rules. The analysis of this suboptimality led the author to formulation of the weaker version of the dominance-based rough sets named quasi-dominance rough sets approach (QDRSA) taking advantages both from classical and dominance-based rough sets. The presentation
62
K.A. Cyran
of this new approach is beyond the scope of the paper, but it should be noted that quasi-dominance rough sets can outperform dominance-based rough sets only if attributes of a decision table are discrete. The elegant way of dealing with continuous attributes by DRSA is not transferable to QDRSA. To study the quality of generalization, a jack-knife cross-validation was used, which is known to be a method of unbiased estimating the decision error of any classifier. Classical jack-knife strategy uses for training all-but-one examples, and testing is done for the excluded example. After iterating this procedure N times (where N is of course the number of training facts) the average number of decision errors in separate runs constitutes an unbiased estimate of the decision error. Yet, in the case considered in the paper, such a strategy could give too optimistic results. It is so, because training facts describe one gene in four different populations. Therefore, these examples are too much dependent, and even after excluding one of them some knowledge about it is passed to the classifier by example concerning the same gene. That is why, to be rigorous about the conclusions, all four examples concerning one particular gene were exluded from iterations. In the presentation of results of cross-validation the author must admit the results presented there for CRSA were too optimistic, because 18.8testing facts - which actually were neither classified as a selection nor as a lack of it - were reported as classified correctly. Having a chance to correct that error, the author would only like to point out that relatively large decrease of the number of training examples (25%), which was the result of the assumed strategy, could easily produce too pessimistic estimates of the decision error. With that remark let us report correct cross-validation results. For the CRSA the decision error estimate was 12.5%. The results for DRSA are worse in the sense that all examples not classified in CRSA framwork, in DRSA were classified erronously. Relatively large decision errors indicate that despite some potential of rule based artificial intelligence methods like CRSA or DRSA, after initial screening stage, the interpretation of the neutrality tests results should be followed by the thorough multi-null methodology for loci proposed as candidates. The development of the multi-null hypotheses methodology is still in progress but since it does not consider rough sets, reporting stages of this development is irrelevant for the goals of the current paper.
5
Conclusion
The article presents a comparison of CRSA and DRSA methodologies in a case study. The battery of neutrality tests are used as conditions in a decision table analyzed with the use of indiscernibility (CRSA) and dominance (DRSA) relations. The advantage of DRSA over CRSA lies in its larger generalization and natural automatic processing of dependencies present in sets of ordered attributes. However, the quality of classification in tests is the worse for DRSA as compared to CRSA with domain specific human-aided knowledge processing. An analysis of the reasons for such behavior of DRSA led the author to formulation
Classical and Dominance Rough Sets in Selection Search
63
of a weaker form of it, called quasi-dominance rough set approach, subject to publish. QDRSA cannot be as easily as DRSA applied for continuous attributes. However, in the case of problems with naturally discrete values of attributes (like the one studied here) the QDRSA can achieve level of automatic generalization characteristic to DRSA with the quality of recognition not worse than that of CRSA assisted by domain expert. Acknowledgments. The scientific work was financed by Ministry of Science and Higher Education in Poland from funds for supporting science in 2008-2010, as a research project number N N519 319035. The author would also like to thank to Prof. M. Kimmel from Rice University in Houston for long discussions and advice concerning the research concerning detection of natural selection at molecular level using statistical neutrality tests.
References 1. Kimura, M.: The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge (1983) 2. Zhang, J.: Evolution of the Human ASPM Gene, a Major Determinant of Brain Size. Genetics 165, 2063–2070 (2003) 3. Evans, P.D., Anderson, J.R., Vallender, E.J., Gilbert, S.L., Malcom, Ch.M., et al.: Adaptive Evolution of ASPM, a Major Determinant of Cerebral Cortical Size in Humans. Human Molecular Genetics 13, 489–494 (2004) 4. Bamshad, M.J., Mummidi, S., Gonzalez, E., Ahuja, S.S., Dunn, D.M., et al.: A strong signature of balancing selection in the 5’ cis-regulatory region of CCR5. Proc. Nat. Acad. Sci. USA 99(16), 10539–10544 (2002) 5. Gilad, Y., Rosenberg, S., Przeworski, M., Lancet, D., Skorecki, K.: Evidence for positive selection and population structure at the human MAO-A gene. Proc. Natl. Acad. Sci. USA 99, 862–867 (2002) 6. Toomajian, C., Kreitman, M.: Sequence Variation and Haplotype Structure at the Human HFE Locus. Genetics 161, 1609–1623 (2002) 7. Wooding, S.P., Watkins, W.S., Bamshad, M.J., Dunn, D.M., Weiss, R.B., Jorde, L.B.: DNA sequence variation in a 3.7-kb noncoding sequence 5’ of the CYP1A2 Gene: Implications for Human Population History and Natural Selection. Am. J. Hum. Genet. 71, 528–542 (2002) 8. Fu, Y.X., Li, W.H.: Statistical Tests of Neutrality of Mutations. Genetics 133, 693–709 (1993) 9. Fu, Y.X.: Statistical Tests of Neutrality of Mutations Against Population Growth, Hitchhiking and Background Selection. Genetics 147, 915–925 (1997) 10. Kelly, J.K.: A Test of Neutrality Based on Interlocus Associations. Genetics 146, 1197–1206 (1997) 11. Wall, J.D.: Recombination and the Power of Statistical Tests of Neutrality. Genet. Res. 74, 65–79 (1999) 12. Nielsen, R.: Statistical Tests of Selective Neutrality in the Age of Genomics. Heredity 86, 641–647 (2001) 13. Cyran, K.A., Pola˜ nska, J., Kimmel, M.: Testing for Signatures of Natural Selection at Molecular Genes Level. J. Med. Inf. Techn. 8, 31–39 (2004)
64
K.A. Cyran
14. Dhillon, K.K., Sidorova, J., Saintigny, Y., Poot, M., Gollahon, K., Rabinovitch, P.S., Mon-nat Jr., R.J.: Functional Role of the Werner Syndrome RecQ Helicase in Human Fibroblasts. Aging Cell 6, 53–61 (2007) 15. Karmakar, P., Seki, M., Kanamori, M., Hashiguchi, K., Ohtsuki, M., Murata, E., Inoue, E., Tada, S., Lan, L., Yasui, A., Enomoto, T.: BLM is an Early Responder to DNA Double-strand Breaks. Biochem. Biophys. Res. Commun. 348, 62–69 (2006) 16. Pawlak, Z.: Rough sets. International Journal of Computer and Information Sciences 11(5), 341–356 (1982) 17. Pawlak, Z.: Rough sets: theoretical aspects of reasoning about data. Kluwer Academic, Dordrecht (1991) 18. Greco, S., Matarazzo, B., Sowi˜ nski, R.: Rough sets theory for multicriteria decision analysis. European Journal of Operational Research 129(1), 1–47 (2001) 19. Greco, S., Matarazzo, B., Sowi˜ nski, R.: Multicriteria classification by dominancebased rough set approach. In: Kloesgen, W., Zytkow, J. (eds.) Handbook of Data Mining and Knowledge Discovery. Oxford University Press, New York (2002) 20. Sowi˜ nski, R., Greco, S., Matarazzo, B.: Rough set based decision support. In: Burke, E.K., Kendall, G. (eds.) Search Methodologies: Introductory Tutorials in Optimization and Decision Support Techniques, ch. 16, pp. 475–527. Springer, New York (2005) 21. Cyran, K.A.: Rough Sets in the Interpretation of Statistical Tests Outcomes for Genes under Hypothetical Balancing Selection. In: Kryszkiewicz, M., Peters, J.F., Rybi´ nski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 716– 725. Springer, Heidelberg (2007) 22. Bonnen, P.E., Story, M.D., Ashorn, C.L., Buchholz, T.A., Weil, M.M., Nelson, D.L.: Haplotypes at ATM identify coding-sequence variation and indicate a region of extensive linkage disequilibrium. Am. J. Hum. Genet. 67, 1437–1451 (2000) 23. Bonnen, P.E., Wang, P.J., Kimmel, M., Chakraborty, R., Nelson, D.L.: Haplotype and linkage disequilibrium architecture for human cancer-associated genes. Genome Res. 12, 1846–1853 (2002) 24. Trikka, D., Fang, Z., Renwick, A., Jones, S.H., Chakraborty, R., et al.: Complex SNP-based haplotypes in three human helicases: implications for cancer association studies. Genome Res. 12, 627–639 (2002) 25. Uziel, T., Savitsky, K., Platzer, M., Ziv, Y., Helbitz, T., et al.: Genomic organization of the ATM gene. Genomics 33, 317–320 (1996) 26. Teraoka, S.N., Telatar, M., Becker-Catania, S., Liang, T., Onengut, S., et al.: Splicing defects in the ataxia-telangiectasia gene, ATM: underlying mutations and consequences. Am. J. Hum. Genet. 64, 1617–1631 (1999) 27. Li, A., Swift, M.: Mutations at the ataxia-telangiectasia locus and clinical phenotypes of A-T patients. Am. J. Med. Genet. 92, 170–177 (2000) 28. Golding, S.E., Rosenberg, E., Neill, S., Dent, P., Povirk, L.F., Valerie, K.: Extracellular Signal-Related Kinase Positively Regulates Ataxia Telangiectasia Mutated, Homologous Recombination Repair, and the DNA Damage Response. Cancer Res. 67, 1046–1053 (2007) 29. Schneider, J., Philipp, M., Yamini, P., Dork, T., Woitowitz, H.J.: ATM Gene Mutations in Former Uranium Miners of SDAG Wismut: a Pilot Study. Oncol. Rep. 17, 477–482 (2007) 30. Siitonen, H.A., Kopra, O., Haravuori, H., Winter, R.M., Saamanen, A.M., et al.: Molecular defect of RAPADILINO syndrom expands the phenotype spectrum of RECQL diseases. Hum. Mol. Genet. 12(21), 2837–2844 (2003)
Classical and Dominance Rough Sets in Selection Search
65
31. Yusa, K., Horie, K., Kondoh, G., Kouno, M., Maeda, Y., et al.: Genome-wide phenotype analysis in ES cells by regulated disruption of Bloom’s syndrome gene. Nature 429, 896–899 (2004) 32. Karow, J.K., Constantinou, A., Li, J.-L., West, S.C., Hickson, I.D.: The Bloom’s syndrome gene product promotes branch migration of Holliday junctions. Proc. Nat. Acad. Sci. USA 97, 6504–6508 (2000) 33. Wu, L., Hickson, I.D.: The Bloom’s syndrome helicase suppresses crossing over during homologous recombination. Nature 426, 870–874 (2003) 34. Adams, M.D., McVey, M., Sekelsky, J.J.: Drosophila BLM in double-strand break repair by synthesis-dependent strand annealing. Science 299, 265–267 (2003) 35. Yu, C.-E., Oshima, J., Wijsman, E.M., Nakura, J., Miki, T., Piussan, C., et al.: Werner’s Syndrome Collaborative Group: Mutations in the consensus helicase domains of the Werner syndrome gene. Am. J. Hum. Genet. 60, 330–341 (1997) 36. Sinclair, D.A., Mills, K., Guarente, L.: Accelerated aging and nucleolar fragmentation in yeast sgs1 mutants. Science 277, 1313–1316 (1997) 37. Huang, S., Li, B., Gray, M.D., Oshima, J., Mian, I.S., Campisi, J.: The premature ageing syndrome protein, WRN, is a 3-prime-5-prime exonuclease. Nature Genet. 20, 114–115 (1998) 38. Ellis, N.A., Roe, A.M., Kozloski, J., Proytcheva, M., Falk, C., German, J.: Linkage disequilibrium between the FES, D15S127, and BLM loci in Ashkenazi Jews with Bloom syndrome. Am. J. Hum. Genet. 55, 453–460 (1994) 39. McDonald, J.H., Kreitman, M.: Adaptive protein evolution at the Adh locus in Drosophila. Nature 351, 652–654 (1991) 40. Akashi, H.: Inferring weak selection from pattern of polymorphism and divergence at ’silent’ sites in Drosophila DNA. Genetics 139, 1067–1076 (1995) 41. Nielsen, R., Weinreich, D.M.: The Age of Nonsynonymous and Synonymous Mutations and Implications for the Slightly Deleterious Theory. Genetics 153, 497–506 (1999) 42. Hudson, R.R., Kreitman, M., Aguade, M.: A test of neutral molecular evolution based on nucleotide data. Genetics 116, 153–159 (1987) 43. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. With discussion. J. Roy. Stat. Soc. Ser. B 39, 1–38 (1977) 44. Excoffier, L., Slatkin, M.: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12, 921–927 (1995) 45. Polanska, J.: The EM Algorithm and its Implementation for the Estimation of the Frequencies of SNP-Haplotypes. Int. J. Appl. Math. Comp. Sci. 13, 419–429 (2003) 46. Greco, S., Matarazzo, B., Slowinski, R.: The use of rough sets and fuzzy sets in MCDM. In: Gal, T., Hanne, T., Stewart, T. (eds.) Advances in Multiple Criteria Decision Making, vol. 14. Kluwer Academic Publishers, Dordrecht (1999) 47. Greco, S., Matarazzo, B., Slowinski, R.: Handling missing values in rough set analysis of multi-attribute and multi-criteria decision problems. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC 1999. LNCS (LNAI), vol. 1711, pp. 146–157. Springer, Heidelberg (1999)
Satisfiability Judgement under Incomplete Information Anna Gomoli´ nska Bialystok University, Department of Mathematics, Akademicka 2, 15267 Bialystok, Poland
[email protected]
Abstract. In this paper we keep on discussing satisfiability of conditions by objects when information about the situation considered, including objects of some sort and concepts comprised of them, is incomplete. Our approach to satisfiability is that of concept modelling and we have a rough granular view on the problem. Objects considered are known partially, in terms of values of attributes of Pawlak information systems. An additional knowledge (domain knowledge) is assumed to be available. We choose descriptor languages for Pawlak information systems as specification languages in which we will express conditions about objects and concepts. Keywords: satisfiability of formulas, judgement making, concept modelling, knowledge discovery, Pawlak information system, descriptor language, approximation space.
To Ewa
1
Introduction
Making judgement of satisfiability is a common activity of all truly intelligent systems. Before performing an action or applying a rule any such system has to judge whether or not appropriate pre-conditions are satisfied. That is, according to [1], it has to make a sensible decision, after a careful consideration, whether or not some objects fulfil certain conditions or, in other words, have desired properties. Construction of an object which possesses given properties (e.g., writing of a programme satisfying a specification) involves satisfiability judgement as well. Therefore, the main motivation for studying satisfiability judgement under imperfect (uncertain [2], vague [3], incomplete [4], imprecise, and noisy) information comes from the area of intelligent systems, including multi-agent systems and agent technology. The future intelligent systems should themselves, among other things, verify their performance and adapt in a sensible way. The design and analysis of self-managed adaptive systems belong to major goals of autonomy-oriented computing [5,6,7]. Also, as pointed out in [8], the ability of making good, adaptive judgements by intelligent systems is crucial for the development of wistech (wisdom technology), a successor of knowledge technology. J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 66–91, 2010. c Springer-Verlag Berlin Heidelberg 2010
Satisfiability Judgement under Incomplete Information
67
The problem of making good judgements about satisfiability under incomplete or, more generally, imperfect information is also important for the foundational research both in machine learning [9,10,11,12] and in knowledge discovery [13,14, 15]. First, notice that the problem of matching patterns by objects may be seen as a counterpart of the question of satisfiability of formulas by objects. In the former case, instead of conditions expressed by formulas, we deal with patterns.1 More importantly, satisfiability links the world of objects with the world of linguistic expressions. By classifying objects to concepts labelled by formulas, satisfiability works as a concept approximator. It also aggregates simpler concepts into more complex ones by means of production rules which, e.g., enable to compute the satisfiability degree of a compound formula (finite set of formulas) φ on the basis of satisfiability degrees of components of φ. According to [1], reasoning is the process of thinking about things in a logical way. Thus, to reason may be understood as to form a judgement about a situation by considering the facts and using one’s power to think in a logical way. However, judgement making and reasoning are incomparable in general in the sense that none of them subsumes the other. On one hand, judgement making can take the form of reasoning, on the other – simple forms of judgement making can be steps in reasoning. In the paper we analyse satisfiability of formulas when information and knowledge about the situation considered, including objects and concepts (i.e., sets of objects), is available to an intelligent system S partially only. Why do we prefer a less known concept of judgement making to the more familiar notion of reasoning when dealing with satisfiability? Judgement making (see, e.g., [16,17,18,19] for judgement in general) allows any method of arriving at an opinion (decision), whereas reasoning is, in most cases, limited to logical ways of thinking in order to form opinions. This is one of the main reasons why judgement making is viewed as particularly suitable for complex, real-life situations when information/knowledge is imperfect. A less nice feature of judgement making is that judgements can be uncertain, so there is a risk of making wrong decisions. Satisfiability is a fundamental notion considered in logic. It is also familiar to and used by the computer science community and, especially, by the computational complexity researchers. In the field of logic, formulas are evaluated over abstract objects (states, worlds) according to a pre-defined concept of satisfiability provided by an expert (logician). Such practical aspects as suitability of the definition to the real-life situations or choice of the method of application of the definition are of minor interest. Satisfiability is often treated as an auxiliary concept making possible to define such central notions of logic as the logical truth and consequence [20]. From the semantical standpoint, logically true formulas together with a logical consequence relation constitute a logic. Obviously, there are many logics as there are lots of possible definitions of satisfiability of formulas (see, e.g., [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]). In computational complexity theory, the problem of satisfiability judgement given a formal language and a class of relational structures (models), called the SAT 1
Patterns can, in particular, be described by formulas, e.g. by conjunctions of descriptors.
68
A. Gomoli´ nska
problem, is a well-known research problem where the task is to find a possibly most efficient method of satisfiability checking [40, 41]. In computer science, methods based on satisfiability, e.g. some of model checking methods are used to verify such properties of programmes as satisfiability of specification and reachability [42, 43, 44], to name a few. Despite the fact that satisfiability of formulas has been subject to a long-term research study in logic, judgement of satisfiability of formulas (sets of formulas) is still a challenge when the information available is imperfect. The problem becomes especially hard when formulas refer to complex, real-life concepts as ‘safe situation on a road’. Our approach is partly inspired by solutions provided by logic and partly – by those worked out in rough set theory, concept modelling, and knowledge discovery. Unlike in purely logical approaches, the languages used to specify conditions and to represent knowledge, a relational structure or a class of structures in which to evaluate formulas, and a suitable notion of satisfiability are to be discovered by an intelligent system S, possibly with help of an expert. In our framework, information is provided to S in several forms. First, S is given descriptions of objects under consideration which form a Pawlak information system [45, 46, 47, 48]. Apart from that S is supplied by an expert with a non-empty finite collection of examples on how to judge satisfiability. Additionally, the expert may propose a usually parameterized definition of satisfiability. Let us emphasize that in order to apply the definition in real-life cases, S may need to discover the possibly most suitable values of such parameters as satisfiability degree, similarity relation, or inclusion measure. We also assume that some additional, firm information about objects, concepts, and relationships among them is available to S. Such information, called the domain knowledge, is often represented in the form of concept ontologies (see, e.g., [49, 50, 51, 52, 53, 54]). Descriptor languages for Pawlak information systems are chosen as specification and description languages in which various conditions about objects and concepts are expressed. Here, ‘condition’ is understood as an umbrella term for properties, premises of rules, and various conditions as, e.g., pre-conditions to execute actions. The domain knowledge may but need not be represented within a descriptor language. It can refer to complex vague concepts expressed in a natural language, for example. The system S is supposed to discover by inductive learning and with help of the expert how to make good judgements about satisfiability of conditions by objects. Each information system gives rise to a family of approximation spaces (see, e.g., [55, 56, 47, 57, 58, 59, 60, 61, 62, 63, 64]). Typically, not all concepts can be expressed precisely by means of formulas of the language used. Approximation spaces provide us with tools to approximate such concepts.2 Since objects are known to S by their attribute (or feature) values, they can be indiscernible. It can also happen that objects are distinguishable but so similar to one another that the differences may be neglected. As a consequence, the set of all objects 2
In fact, all concepts can be approximated but only those which are inexact need such an approximation.
Satisfiability Judgement under Incomplete Information
69
considered (the universe) is perceived as being granulated into a family of clusters called information granules [65, 66]. Yet another reason for granulation is functionality, i.e., objects can be drawn together to form a new, more complex object (information granule) just as single instructions are composed to form procedures and, in turn, computer programmes. Having such a perspective, it is clear that granulation of the universe should somehow be taken into account when making satisfiability judgement. The paper, being a substantially extended and revised version of [67], is organized as follows. Sections 2 and 3 recall elements of Pawlak information systems, descriptor languages, and approximation spaces. In Sect. 4 we describe three, purely theoretical models of satisfiability of (sets of) formulas. The need of a more realistic approach to satisfiability judgement under incomplete information is emphasized in Sect. 5. Drawing on the results from [68, 69, 70, 71], we propose such an approach in Sect. 6. Section 7 contains conclusions.
2
Pawlak Information Systems: Descriptor Languages
Conditions and properties are usually expressed as formulas of some language. We will consider descriptor languages for Pawlak information systems (infosystems in short) [45, 46, 47, 48] which should be sufficient for our purposes. By an infosystem we understand any pair of the form IS = (U, A) where U is a set with at least two elements called objects and A is a non-empty set of (possibly partial) mappings on U called attributes.3 Objects are denoted by u and attributes by a, with sub/superscripts whenever needed. With every attribute a there is associated a set of values of a, Va . Assume that each of A’s takes at least two different values on U . To avoid dealing with partial mappings directly one may introduce a symbol denoting the lack of information (or knowledge) about value of an attribute at an object, say ∗, so a(u) = ∗ is understood as the lack of information about the value of a at u. As a consequence, attributes may be treated as mappings a : U → Va ∪ {∗}. Let V = a∈A Va . Elements of V will be denoted by v with sub/superscripts whenever needed. The primitive symbols of the descriptor language for IS, L, are symbols denoting attributes, symbols denoting attribute values, and propositional connectives ∧, ∨, and ¬ understood as conjunction, disjunction, and negation, respectively. The auxiliary symbols are the parentheses (, ) and the comma. Pairs of the form (a, v), where v ∈ Va , are called descriptors. Descriptors are atomic formulas of L and their set is denoted by AT(L). Formulas will, in general, be denoted by α, β with sub/superscripts if needed. The set of all formulas of L, FOR(L), is the least set of expressions of this language containing AT(L) and such that for any α, β ∈ FOR(L), we have (α ∧ β), (α ∨ β), (¬α) ∈ FOR(L). If no confusion results, then the parentheses will be omitted for the sake of simplicity. 3
Thus, missing values are allowed [72,73,74,75,76]. Moreover, the usual assumption of finiteness of U and A is dropped for a technical reason. However, we only deal with finite universes and sets of attributes in practice. Such infosystems will be referred to as finite infosystems.
70
A. Gomoli´ nska
In order to classify objects of a universe, some attributes are distinguished and called decision attributes. Their values are interpreted as decisions at some issues. The remaining attributes are referred to as conditional attributes then. In real-life applications, values of decision attributes are provided by domain experts in contrast to values of conditional attributes which are typically results of measurements and observations. Infosystems with decision attributes, introduced by Pawlak and usually represented by decision tables [47,48], will be referred to as decision infosystems here. Keeping with the usual notation, decision attributes will be denoted by d with sub/superscripts when needed. Also slightly abusing our earlier notation, we will define a decision infosystem as a triple of the form (U, A, D) where U, A are as previously and D is a non-empty set of decision attributes on U disjoint with A. If D = {d}, then the brackets {, } will be dropped along the usual lines.
3
Approximation Spaces
Infosystems give rise to approximation spaces where sets of objects, viewed as concepts, can be approximated in many ways. In this section we recall elements of approximation spaces in a nut-shell. 3.1
Granulation of the Universe
Objects of an infosystem IS = (U, A) are mainly known by their descriptions in terms of attribute values. Even if an additional knowledge (domain knowledge) about elements of U is given, e.g. in the form of a concept ontology [49, 50, 51, 52, 53, 54], the available information/knowledge is usually insufficient to discern each and every object of U . As a matter of fact, the descriptions can match many objects of U . This is merely an apparent drawback since grouping objects into information granules will become necessary and useful from the standpoint of efficiency if the number of objects is huge. Let us also note that object descriptions can deliberately be shortened in accordance with the minimum description length principle [77]. As a result, some objects become indistinguishable. It also happens frequently that objects can be differentiated but their descriptions are very similar in some respect. Yet another reason for granulating objects is functionality: Single objects, e.g. agents, actions, or programme instructions are composed into more complex ones, structured in a suitable way as multi-agent systems, strategies, and computer programmes, respectively. Following Zadeh [65, 66], classes of objects grouped together with respect to indistinguishability, similarity or functionality are called information granules (infogranules in short). For simplicity, we view indistinguishability as a limit case of similarity. Summarizing, the universe U is perceived as granulated into a family of infogranules, and this view will influence our approach to satisfiability judgement. As a mathematical model of similarity of objects of U we take any reflexive relation on U , called a similarity relation henceforth. Thus, to obtain a
Satisfiability Judgement under Incomplete Information
71
similarity-based granulation of U , we start with a similarity relation ⊆ U × U .4 The term ‘(u, u ) ∈ ’ is understood as ‘u is -similar to u ’. For any set X, ℘X denotes its power set. Every relation induces mappings → , ← : ℘U → ℘U which assign to subsets of U their images and counter-images given by , respectively. Sets → {u} and ← {u} are viewed as the elementary infogranules drawn to u by means of . The former infogranule consists of all objects to which u is -similar, whereas the latter one contains all objects -similar to u.5 Henceforth we will only deal with elementary infogranules of the second kind. Elementary infogranules can serve as building blocks to obtain compound infogranules, e.g. by means of the set-theoretical sum and generalized sum operations.6 3.2
Rough Inclusion
Rough inclusion is a tool by means of which one can measure the degree of inclusion of one set in another one.7 The formal concept of a rough inclusion was worked out by Polkowski and Skowron who extended Le´sniewski’s mereology [88, 89] to a theory of graded part called rough mereology [90, 91, 92, 86]. The usual set-theoretical inclusion may be viewed as a special case of rough inclusion. Apart from the literature on rough mereology there are several papers where the problem of graded inclusion and, in particular, rough inclusion is addressed (see, e.g., [93, 94, 95, 96, 97, 87, 98]). The most popular function realizing the idea of rough inclusion is the standard rough inclusion function (RIF) going back to L ukasiewicz [99] and defined for finite first arguments. Assume for a while that U is finite. The cardinality of a set X will be denoted by #X. The standard RIF over U is a mapping κ£ : ℘U × ℘U → [0, 1] such that for any X, Y ⊆ U , #(X∩Y ) if X
= ∅, def £ #X κ (X, Y ) = (1) 1 otherwise. In general, by a RIF over U we mean any mapping from ℘U × ℘U into [0, 1] fulfilling rif 1 (κ) and rif 2 (κ) given below: def
rif 1 (κ) ⇔ ∀X, Y ⊆ U.(κ(X, Y ) = 1 ⇔ X ⊆ Y ), def
rif 2 (κ) ⇔ ∀X, Y, Z ⊆ U.(Y ⊆ Z ⇒ κ(X, Y ) ≤ κ(X, Z)). 3.3
Approximation of Concepts
In rough set theory, there exist several various notions of an approximation space (see, e.g., [55, 56, 47, 58, 59, 60, 61, 62, 63, 64]). Here is an approximation space 4 5 6
7
When is also symmetrical, it will be called a tolerance relation. Notice that → {u}, equal to ← {u} if is symmetrical, is actually the counter-image of {u} given by the converse relation of . For the scarcity of space we have to stop the discussion. An interested reader is referred to a vast literature on infogranules and granular computing (see, e.g., [78, 73, 79, 80, 81, 82, 83, 84, 66]). Thus, rough inclusion can be used as a measure of similarity of sets [85, 80, 86, 87].
72
A. Gomoli´ nska
understood as a structure M = (U, , κ) where U is as earlier, is a reflexive relation on U , and κ is a RIF over U . Within such structures we can approximate concepts in many ways (see, e.g., [73] and the forementioned papers). In this section we only recall a few examples of approximation operators, well-known from the literature. Thus, let the lower and upper approximation operators, low∪ , upp∪ : ℘U → ∪ ℘U , and the t-positive and t∗ -negative region operators, pos∪ → ℘U , t , negt∗ : ℘U where 0 ≤ t∗ < t ≤ 1, be defined as follows, for any X ⊆ U : low∪ X = {← {u} | ← {u} ⊆ X}, upp∪ X = {← {u} | ← {u} ∩ X
= ∅}, pos∪ {← {u} | κ(← {u}, X) ≥ t}, t X = neg∪ {← {u} | κ(← {u}, X) ≤ t∗ }. (2) t∗ X = The concept approximations obtained in this way are definable in the sense that they are set-theoretical sums of elementary infogranules. Observe that the result of approximation of a concept obviously depends not only on the kind of approximation operation and the input concept but also on the underlying approximation space. In practice, the latter is obtained from an infosystem. Thus, discovery of an approximation of a concept which is possibly the best in a given situation will resolve itself into discovery of the most suitable descriptor language, similarity relation, RIF, and approximation operators. Properties of the above operators can be found in the literature. Let us only ∪ note that pos∪ 1 = low . Along the standard lines we will also say that X is exact ∪ ∪ if upp X − low X = ∅; otherwise, X is rough.
4
Mathematical Models of Satisfiability
In this section we theorize about how to understand satisfiability of single conditions and sets of conditions by objects. We deliberately simplify the models presented by neglecting situational factors including time and context and imperfectness of the accessible information/knowledge. In particular, we neglect the lack of information about most of objects of a considered universe. The starting point is an infosystem IS as earlier and the descriptor language L for it. Satisfiability of sets of formulas by objects is secondary with respect to satisfiability of single formulas, and it is usually defined by means of the latter notion. We treat both cases in a uniform way to spare space. At the end we describe three research problems where satisfiability plays a prominent role. The first two are typical for logic, whereas the last one is studied in theoretical computer science. Although theoretical in most cases, the results on satisfiability worked out in logic and computer science are a useful and valuable contribution to understanding of practical satisfiability. Admitting their importance, we treat them as a basis in our approach.
Satisfiability Judgement under Incomplete Information
4.1
73
The Relational Model
If we are only interested whether or not an object of U satisfies a condition (set of conditions) expressed by a formula (set of formulas) of L, then satisfiability may be modelled as a relation |=⊆ U ×F where F = FOR(L) (resp., F = ℘(FOR(L))). If the question is to which extent an object satisfies a condition (set of conditions), then we will assume an ordered set of satisfiability degrees (T, ≤) containing at least two elements: the least element and the greatest element denoted, e.g. by 0 and 1, respectively. In this case, satisfiability may be modelled as a family of relations {|=t }t∈T where for any t ∈ T , a relation |=t ⊆ U × F is a model of satisfiability to the degree t understood, e.g., in the strict sense.8 It is worth emphasizing that satisfiability degrees need not be numbers. They may be tuples of elements of some sort (in particular, tuples of numbers) or words of a natural language (e.g., ‘low’, ‘sufficient’, ‘medium’, ‘high’), viewed as labels of infogranules in accordance with Zadeh’s idea [100]. 4.2
The Logical Value-Based Model
A model based on the concept of a logical value is another but equivalent model of satisfiability. Satisfiability degrees are treated as logical values here. Keeping with tradition, 0 denotes the falsity and 1 is understood as the logical truth. By an evaluation mapping (in short, an evaluation) we understand any mapping f :U ×F → T. Every evaluation f defines a satisfiability relation |=⊆ U × F such that u |= φ if and only if f (u, φ) = 1, for any formula (set of formulas) φ and any object u. On the other hand, every satisfiability relation |= defines a binary evaluation f : U ×F → T such that f (u, φ) = 1 if and only if u |= φ, and f (u, φ) = 0 otherwise. Every evaluation f also induces a family of relations of graded satisfiability {|=t }t∈T such that for any φ and u as earlier, we have u |=t φ if and only if f (u, φ) = t, and vice versa.9 If undecidability is allowed, T may be augmented by an extra symbol, say ⊥, denoting such a possibility. An evaluation is defined as any mapping f : U ×F → T ∪ {⊥} then. In this context, f (u, φ) =⊥ is read as the lack of decision about the degree of satisfiability of φ by u. 4.3
The Extension-Based Model
The last mathematical model of satisfiability recalled here is the model based on the concept of an extension of a formula (set of formulas). Namely, any formula (set of formulas) φ may be treated as the label of some infogranule comprised of all objects satisfying φ. This infogranule is called an extension 8 9
Depending on particular needs one can speak of satisfiability to a degree greater than t, lesser than t, etc. Let us observe that every evaluation easily generates families of relations of satisfiability to a degree greater than t, lesser than t, and so on. Conversely, each of the families gives rise to a corresponding evaluation.
74
A. Gomoli´ nska
of φ and denoted by Sat(φ). The family of all extensions {Sat(φ)}φ∈F defines a satisfiability relation |= such that for any φ and u, we have u |= φ if and only if u ∈ Sat(φ). On the other hand, every satisfiability relation |= defines an extension of a formula (set of formulas) φ by Sat(φ) = {u ∈ U | u |= φ}. The notion of an extension can be generalized as follows. We view any formula (set of formulas) φ as the label of a compound infogranule {Satt (φ)}t∈T consisting of infogranules of the form Satt (φ). The latter infogranule, referred to as a t-extension of φ or a t-satisfiability class of φ, is the set of all objects satisfying φ to the degree t. Arguing as earlier, one can show equivalence of this model and the relational one. 4.4
Three Research Problems Concerning Satisfiability
Satisfiability of formulas plays an important role in the following three problems. The first two are frequently studied in logic, whereas the last one is typical for computational complexity theory. A formal language L, which is a descriptor language in our case, is fixed in all of them. The first problem can be specified as follows. Given a logical system (R, Ax), where R is a set of primitive inference rules and Ax is a set of axioms in L, and a notion of provability of formulas in (R, Ax). The task is to find a class of mathematical structures10 M, a satisfiability notion, and a corresponding concept of logical truth such that every formula provable in (R, Ax) is true in M. Then we will say that (R, Ax) is sound with respect to M. In the second case, a class of structures M, a satisfiability notion, and a concept of truth are assumed in advance. The task is to axiomatize the set of all formulas true in M. That is, we search for a logical system (R, Ax) and a provability concept such that for every formula α of L, α is provable in (R, Ax) if and only if α is true in M. Then we will say that (R, Ax) is sound and complete with respect to M. We can formulate the last problem now. Given a class of structures M and a satisfiability notion as in the preceding case. The task is to find a possibly most effective and efficient method (algorithm, classifier) of deciding whether or not any given formula of L is satisfied in M.
5
A More Realistic Scenario
The general models of satisfiability using satisfiability relations, evaluations, and families of extensions, described in the previous section, cover an enormously large number of more specific models of which many have been already defined and examined in logic and its applications (see, e.g., [21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39]). Unfortunately, even 10
For instance, they might be approximation spaces.
Satisfiability Judgement under Incomplete Information
75
those specific models can be both too simple and too abstract to fit well the reallife situations of satisfiability judgement. Actually, satisfiability of formulas is a complex vague concept in general, heavily depending on the judgemental situation and drifting with it, which cannot be grasped by means of one global model. 5.1
Actors and Their Roles
Two actors are considered in our approach: an intelligent system S and an expert, called just Expert. S is an open system which acts in a dynamically changing environment. Satisfiability judgement of a good quality is expected from S before execution of an action, application of a rule, and in the course of or after construction of a complex object as an action plan or an algorithm. In the first and second cases, S needs to judge whether or not, or to which extent, pre-conditions of the action and premises of the rule are fulfilled, respectively. In the last case, S is supposed to judge whether or not, or to which extent, the object constructed meets expectations expressed, e.g. in the form of a specification. Our idea is that S learns inductively (in other words, discovers) how to judge satisfiability of conditions by objects where ‘conditions’ is the umbrella term for a variety of different requirements, properties, rule premises, or action pre-conditions. Since it is rather unlikely that S itself can learn how to make good judgements, some help of Expert is assumed. Expert’s role may vary from case to case. In purely logical approaches, Expert is a logician who provides S with a readyto-use definition of satisfiability. On the opposite side are approaches where S is provided with a finite collection of examples only, in particular, with sets of positive and negative cases of satisfiability. By an example we mean a triple of the form (u, φ, t) where u is an object, φ is a formula (set of formulas) representing a condition (set of conditions), and t is a satisfiability degree. 5.2
Information Available to System S
The system S is assumed to have an access to information about a situation under consideration, s0 , including information about objects of some sort, concepts comprised of these objects, and relationships among them. The information is of several kinds. First of all, S is given a finite infosystem IS0 = (U0 , A0 ) which comprises descriptions of objects in terms of attribute values. Secondly, in order to judge formulas of descriptor languages, S needs to construct appropriate approximation spaces. To realize this goal, S has to be given information about how to understand similarity of objects, how to measure inclusion of a set in a set, and how to approximate concepts. Next, Expert should supply S with a non-empty collection of examples E0 on how to judge satisfiability in s0 . The satisfiability problem is split into two subcases in our approach. In the former one, S is given a family of various satisfiability relations. Except for a unique relation case, S has to find a possibly most appropriate relation, e.g. by optimizing parameters. In the second case, S has to discover how to judge satisfiability of formulas (sets of formulas) directly on the basis of
76
A. Gomoli´ nska
E0 . Clearly, examples are useful and necessary in both cases. Last but not the least, S is given a knowledge K0 about the domain considered and judgement of satisfiability in general. This knowledge, being a kind of information and referred to as the domain knowledge, takes the form of a concept ontology, a set of rules, or a theory, for example. Information available to S is, of necessity, imperfect. Incompleteness11 is partly caused by the fact that only a small fragment of the reality under investigation can be captured. Taking into account all possibly interesting features of objects is simply unfeasible. On the other hand, consideration of a huge number of attributes is in contradiction with the commonly accepted principle of minimum description length [77]. Even if the choice of relevant attributes is made by an expert, some important features can be omitted. It is also common that some values of attributes are missing for various reasons.12 Thus, IS0 is viewed as a merely subsystem of an infosystem IS = (U, A) playing the role of a universal infosystem in s0 . The latter infosystem, introduced for technical reasons only, need not be available in practice. By assumption, U contains all objects which possibly may be taken into account in s0 , including all elements of U0 . Assume also that all attributes of A are mappings and each of A0 ’s, being a (partial) mapping defined on U0 , is extended to some of A’s. For simplicity, we may use the same symbols to denote elements of A0 and their expansions belonging to A. The quality of information can be questionable.13 Descriptions of objects may come from several, sometimes incompatible sources which, in addition, may be reliable to some extent only. Another source of uncertainty are noise and errors caused by technical problems with storing, transferring, and processing of data. Last but not the least, information can concern vague concepts (e.g., ‘safe situation on a road’) where the main difficulty consists in identification of borderline cases [3]. The problem of separation of objects belonging to a concept from non-elements is of a primary importance here. 5.3
Languages for Specifying Properties and Representing Knowledge
Properties of objects and various conditions to be satisfied by objects are represented by formulas of descriptor languages. We use L0 and L to denote the descriptor languages for the infosystems IS0 and IS mentioned above, respectively. By assumption, IS is an expansion of IS0 , so L extends L0 by symbols denoting new attributes and new attribute values. Apart from the description and specification languages mentioned above, other languages are also needed as, e.g., a language to represent the domain knowledge14 or a mathematical language to speak of various aspects concerning similarity of objects, infogranules, inclusion of concepts, concept approximation, etc. 11 12 13 14
A reader is referred to [4] for a detailed study of incomplete information. See, e.g., [72,73,74,75,76] for treatment of missing values in infosystems and concept approximation. For uncertainty of information see, e.g., [2]. A natural language may serve this purpose, for example.
Satisfiability Judgement under Incomplete Information
77
In our approach, the specification, description, and knowledge representation languages are not fixed for ever. Their initial choice and further development heavily depend, among other things, on the judgemental situation, the available information, Expert’s experience, knowledge, and preferences, and the ability of the system S to learn and to discover new concepts. For instance, in the case of descriptor languages, the choice of the most useful, interesting attributes in a given situation s0 , determining the initial language L0 , is very important but S has no influence on it. Nevertheless, S can participate in the language development and transformation processes at later stages. Namely, when new concepts are discovered and named in the course of learning, L0 (and, subsequently, L) will be augmented step by step to form a new language, say L1 . 5.4
From Infosystems to Approximation Spaces
Starting with information about objects in the form of an infosystem IS0 , the learning system S searches for an approximation space or a class of approximation spaces based on IS0 in order to judge satisfiability of (sets of) formulas. To realize this goal, S has to have some understanding of similarity of objects, inclusion of concepts in concepts (i.e., sets of objects in sets of objects), and concept approximation. More precisely, S should be able to judge in the majority of interesting cases, e.g. on the basis of knowledge provided by Expert, whether or not, or to which extent, an object is similar to another one. In the sequel, S should be able to group objects into similarity-based elementary infogranules, to measure the degree of inclusion of concepts in concepts by means of a RIF or a related inclusion measure, and – last but not the least – to approximate concepts by means of some approximation operators. In this way, IS0 gives rise to an approximation space M0 = (U0 , 0 , κ0 ) or a class of such spaces where 0 ⊆ U0 ×U0 is a similarity relation derived from object descriptions,15 and κ0 is a RIF over U0 . From a purely theoretical standpoint, M0 can be extended to an approximation space M = (U, , κ), available to S merely in part and playing a technical role only, where is a similarity relation on U subsuming 0 and κ is a RIF over U extending κ0 . Let us emphasize that approximation spaces are not fully determined by the underlying infosystems: Almost every infosystem induces a number of different approximation spaces. The role of S is to discover by inductive learning which of the spaces is the possibly most appropriate in a given situation s0 . 5.5
Learning of Satisfiability vs. Concept Approximation
It is clear that satisfiability judgement is highly expert-dependent. In fact, the notion of satisfiability, discovered by S either in an explicit form (e.g., as a satisfiability relation or an evaluation) or implicitly (e.g., as an algorithm or a 15
The relation 0 can, in particular, be an equivalence relation like an indistinguishability relation. Apart from similarity we can also take into account dissimilarity [56] or distinguishability [101].
78
A. Gomoli´ nska
heuristics), will always be a better or worse approximation of Expert’s original concept of satisfiability. Therefore, discovery of satisfiability judgement is an instance of the problem of concept approximation (see, e.g., [49, 102, 47, 57, 103, 104, 60, 105] for the rough set approach). Let us note that satisfiability relations and evaluations work as concept approximators. They classify objects to concepts labelled by (sets of) formulas. These concepts are just extensions of (sets of) formulas and are known to S partially only. Production rules, enabling to compute the satisfiability degree of a compound formula (finite set of formulas) φ on the basis of satisfiability degrees of components of φ, aggregate approximations of simpler concepts into approximations of more complex ones. There arises a question which seems to be irrelevant for concept approximation at first glance: Why and when to consider more than two degrees of satisfiability and how many degrees to allow? In view of the above remarks on aggregation of approximations of concepts, it can happen that approximations of simpler concepts and their complements can prove too coarse for construction of an appropriate approximation of a more complex concept X or its complement. In such a case, a finer granulation of the universe into graded extensions of concepts will be more suitable. The problem how many satisfiability degrees to admit is not easy. The number may depend on such factors as a particular judgemental situation, Expert’s preferences, or the expected cost assessment, to name a few. 5.6
The Importance of Domain Knowledge
The domain knowledge K0 is supposed to contain firm and useful information (facts) about relationships among objects of U , various concepts under consideration and relationships among them, and all other things which might be important when learning how to judge satisfiability. The usefulness of K0 lies in reducing the search spaces by posing constraints on them. As a consequence, the system S can find possibly optimal solutions to problems more easily.16 K0 may be represented in several ways. It can take the form of a concept ontology or a collection of such ontologies (see, e.g., [49, 51, 52, 53] for the usage of ontologies in rough classification and approximation). In another setting, K0 can be a theory understood as a collection of facts expressed by formulas. In yet another approach it can be represented by an infogranule consisting of rules and/or infogranules of rules. Clearly, S needs to know how to apply the domain knowledge in practice. As mentioned earlier, K0 is, of necessity, incomplete and, hence, it can merely capture a small fragment of the domain considered. 5.7
Problems with Learning of the Notion of Satisfiability
Learning how to make good judgements about satisfiability of conditions or, in other words, learning of the notion of satisfiability of (sets of) formulas is a long 16
By a solution we mean values of parameters, an approximation space, approximation operators, a specification or knowledge representation language, a classifier, a judgement, to name a few.
Satisfiability Judgement under Incomplete Information
79
and complex process which could continue without limit along with changes in the judgemental situation. As regards change, satisfiability judgement and, hence, the resulting concept of satisfiability drift together with situation. Time and other situational factors have an impact on them. Therefore, the system S has to test the learned concept of satisfiability from time to time and to adapt or re-discover it whenever the results of satisfiability judgement seem to be doubtful (see, e.g., [102] for the problem of adaptive learning in rough set theory). When learning satisfiability judgement from examples, local models of satisfiability should suffice because the chance for discovery of a global model, suitable for all situations under consideration, is near to zero. For example, instead of discovery of an evaluation f : U × F → [0, 1] where F is the set of all formulas of a descriptor language or its power set, S should rather limit itself to discovery of a restriction of f to a finite subset of F . In particular, F could consist of a single formula. Unfortunately, even a single-formula case can cause difficulties. Namely, a formula α, judged for satisfaction by an object u, can refer to a complex, real life concept C like ‘safe situation on a road’, mentioned previously. Then, it can be impossible to make a good judgement about satisfiability of α by u directly on the basis of information contained in a given infosystem IS0 . The reason is that C is semantically “too far” from every concept which can be approximated within any approximation space induced by IS0 . A possible solution to this problem can be provided by hierarchical learning [49,106,107,108,109,110] where a hierarchy of approximation spaces is constructed step by step, within which more and more complex concepts can be approximated. The discussion will be continued in the next section.
6
Discovery of Satisfiability Judgement
In this section, drawing on the ideas presented in [68, 69, 70, 71], we give more details about our proposal which is inspired by logic, rough set theory, concept modelling, and knowledge discovery. Two directions can be observed in our approach. They differ from each other by the fact that the first one is more logically-oriented. From another perspective, the second approach is more general because less assumptions are made there. More precisely, the learning system S is supplied by Expert with parameterized concepts of satisfiability of (sets of) formulas in the first case as opposite to the latter one, other things being equal. 6.1
Rough Satisfiability Relations
In [68, 69, 70], several kinds of satisfiability of formulas and sets of formulas are proposed which take into account granularity of the universe of an approximation space. We call them rough as they are defined by means of such tools like rough inclusion and/or rough approximation operators. In this framework, satisfiability is defined in line with the relational model as a parameterized family
80
A. Gomoli´ nska
of satisfiability relations.17 As a consequence, we deal with a great variety of satisfiability relations equipped with ordinal and/or nominal parameters.18 By a suitable tuning of the parameters, the learning system S can optimize and adapt a present model to the reality observed. For the scarcity of space, we only recall one specific kind of rough satisfiability of single formulas and one of satisfiability of sets of formulas here. Their properties can be found in [68, 70]. Crisp semantics. Consider an infosystem IS, extending IS0 , and an approximation space M , derived from IS and extending M0 , as earlier. First, we recall the classical crisp interpretation of formulas of the descriptor language L in M [111]. In our approach we often use it as a background for defining rough forms of satisfiability. Thus, the crisp satisfiability relation, |=c , is defined as follows, for any descriptor (a, v) ∈ AT(L), any formulas α, β ∈ FOR(L), any set of formulas X ⊆ FOR(L), and any object u ∈ U : u |=c (a, v) ⇔ a(u) = v, u |=c α ∧ β ⇔ u |=c α & u |=c β, u |=c α ∨ β ⇔ u |=c α or u |=c β, u |=c ¬α ⇔ u
|=c α, u |=c X ⇔ ∀α ∈ X.u |=c α.
(3)
The crisp extension of a formula (set of formulas) φ, Satc (φ), is given by Satc (φ) = {u ∈ U | u |=c φ}.
(4)
Hence, Satc (a, v) = {u ∈ U | a(u) = v}, Satc (α ∧ β) = Satc (α) ∩ Satc (β), Sat c (α ∨ β) = Satc (α) ∪ Satc (β), Satc (¬α) = U − Satc (α), and Satc (X) = α∈X Satc (α). Example of rough satisfiability of single formulas. Obviously, the crisp semantics has nothing to do with granularity of U and with roughness: It is an instance of the well-known classical 2-valued semantics. An exemplary relation of satisfiability of formulas of L, taking the rough granular structure of M into account, is a relation of satisfiability to a degree t ∈ [0, 1], written |=t , defined for any α ∈ FOR(L) and any u ∈ U , as follows: u |=t α ⇔ κ(← {u}, Satc (α)) ≥ t. 17
18
(5)
The corresponding extensions of formulas and sets of formulas as well as evaluations can easily be derived. A reader is referred to the forementioned articles for details concerning extensions. Examples of such parameters are satisfiability degree, infosystem, similarity relation, RIF, approximation operation, an underlying satisfiability relation. Satisfiability degrees can be simple as numbers or words (e.g., ‘low’, ‘medium’, ‘high’) or compound as, e.g., tuples of tuples of numbers and/or words.
Satisfiability Judgement under Incomplete Information
81
That is, u satisfies α to the degree t if and only if the infogranule of objects -similar to u is included to the degree at least t in the infogranule of all objects of U which satisfy α in the crisp sense. The corresponding t-extension of α is given by Satt (α) = {u ∈ U | u |=t α}. (6) Notice that Satt (α) induces the t-positive region of the crisp extension of α if t > 0, viz., pos∪ {← {u} | u ∈ Satt (α)}. (7) t (Satc (α)) = Example of rough satisfiability of sets of formulas. To proceed further, we need a RIF over FOR(L), say κ∗ , apart from κ. Let t1 ∈ [0, 1] ∪ {c} and t2 ∈ [0, 1]. We define a relation of satisfiability to the degree (t1 , t2 ) of sets of formulas of L in M , |=t1 ,t2 , for any X ⊆ FOR(L) and u ∈ U , by u |=t1 ,t2 X ⇔ κ∗ (X, |=→ t1 {u}) ≥ t2 .
(8)
In words, u satisfies X to the degree (t1 , t2 ) if and only if X is included to the degree at least t2 in the infogranule of all formulas of L which are satisfied by u to the degree t1 if t1 ∈ [0, 1], and in the crisp sense if t1 = c. The notion of (t1 , t2 )-extension of X, Satt1 ,t2 (X), is obtained along the usual lines, i.e., Satt1 ,t2 (X) = {u ∈ U | u |=t1 ,t2 X}.
(9)
Notice that crisp satisfiability is a special case of graded satisfiability, viz., Satc,1 (X) = Satc (X), for any X ⊆ U . Moreover, Satt1 ,t2 (∅) = U , for any t1 , t2 as earlier. A less intuitive feature of the above notion of rough satisfiability is that Satt1 ,t2 ({α}) need not equal to Satt1 (α), but this pecularity can easily be removed whenever needed.19 Discussion. First, let us address the problem of computation of the degree of satisfiability of α and X by u. Due to incompleteness of information, the infogranules ← {u} and Satc (α) are known by the system S in part only. Hence, κ(← {u}, Satc (α)) and, subsequently, κ∗ (X, |=→ t1 {u}), for any t1 as earlier, are computable by S solely in theory. Now, consider the case that u ∈ U0 and α ∈ FOR(L0 ). As regards ← {u}, consisting of all objects of U -similar to u, the system S actually has an access to ← 0 {u} and perhaps to a small, finite set U ⊆ U − U0 whose elements are also known to S as -similar to u. In the same vein, a rather small than large sample of Satc (α), say U , is actually available to S. Thus, instead of κ(← {u}, Satc (α)), 20 S can merely compute κ(← Summarizing, S can judge about 0 {u} ∪ U , U ). the degree of satisfiability of α by u approximately only. It is worth noting that we deal with an instance of analogical reasoning here. 19 20
If κ∗ is standard, it will suffice to set t2 > 0. Recall that κ|℘U0 ×℘U0 = κ0 . Moreover, if κ is the standard RIF, then Satc (α) need not be replaced by U .
82
A. Gomoli´ nska
After elimination of some technical problems, the same method can be applied to arbitrary u ∈ U and α ∈ FOR(L). In [71] we just show, among other things, how the concept of graded satisfiability recalled above can be used to classify objects of U −U0 with respect to their satisfiability of a formula. Unfortunately, it can happen that α refers to a concept C which cannot be described in terms of attributes of IS0 even approximately.21 In other words, C is semantically too complex to be directly approximated by concepts of M0 . In such a case, S cannot make any judgement about crisp satisfiability of α directly on the basis of object descriptions available to S. This is a serious obstacle since judgement of crisp satisfiability is a key step in making judgements about the graded satisfiability proposed above. Fortunately, one can apply the hierarchical learning methods mentioned earlier. Another important issue is how to obtain a good quality of satisfiability judgements by S. As in inductive learning in general, S can achieve this purpose using various optimization techniques (e.g., genetic algorithms). By fine-tuning of parameters such values will be obtained which are possibly the most suitable in a given judgemental situation s0 . The domain knowledge K0 can help reduce the search spaces by constraining them. Last but not the least, S can use the examples E0 supplied by Expert, e.g. to testing of the results obtained. 6.2
Case-Based Satisfiability Judgement
Satisfiability is a complex concept and one may attempt to discover (or learn) it directly from data and a domain knowledge as in the case of other concepts. There are a number of methods which can be used here, e.g. analogy-based reasoning, inductive learning, relational learning, rough set methods, and statistical methods, to name a few [112, 113, 114, 49, 106, 13, 115, 116, 117, 118, 119, 120, 121, 122, 14, 15, 123, 124, 125, 103, 59, 126, 87, 105, 127, 110, 128, 129]. Description of the idea. We will keep with the same notation as earlier unless stated otherwise. Here we discuss a more general case than in the preceding subsection. Namely, the system S is only given an infosystem IS0 , a domain knowledge K0 , a finite set of examples E0 provided by Expert, and some knowledge necessary for concept approximation.22 As earlier, S aims at discovery how to judge satisfiability of formulas (sets of formulas) of L by objects of U appropriately to a given situation s0 . Learning of satisfiability judgement in general is clearly unfeasible. However, in a concrete situation s0 , S may only need to judge about satisfiability of a single formula (finite set of formulas23 ) φ. Therefore, we propose a local, point-wise 21 22 23
Recall the concept ‘safe situation on a road’ mentioned before. Thus, no satisfiability relation is suggested to S by Expert. In the classical crisp semantics, any finite set of formulas X can be replaced by a conjunction of all elements of X, so a separate consideration of finite sets of formulas is superfluous. This may but need not be true of approximate semantics (see, e.g., the rough semantics described in the preceding section where the graded extension of X is usually different from the graded extension of a conjunction of all formulas of X).
Satisfiability Judgement under Incomplete Information
83
approach to satisfiability judgement. In s0 , instead of discovering an evaluation f : U ×F → T (see Sect. 4), S may itself limit to discovery of fφ : U → T understood as a restriction of f to the set U × {φ}. Suppose that examples of E0 have the form (u, φ, t) where u ∈ U0 and t ∈ T . For simplicity, each object of U0 is assumed to be given exactly one example. In an obvious way, E0 induces a mapping dφ,0 : U0 → T which may be taken as a restriction to U0 of a mapping fφ yet to be discovered. Thus, S’s task is to extend dφ,0 to fφ . In [71] where the single formula case is considered, we view this task as an instance of the classification problem. In order to accomplish the task we propose to use a combination of analogy-based reasoning and rough set methods (see [129] for a survey on application of analogy-based reasoning to rough classification). By way of example, three such methods are presented there: a method based on one of Bazan’s classification algorithms [114], a method using a simple instance of the k-nearest neighbour algorithm [115], and a method based on the concept of graded satisfiability, recalled in the preceding subsection. dφ,0 is attached to IS0 as a decision attribute, which results in a decision infosystem ISd0 = (U0 , A0 , d0,φ ). Using a rough or other classifier construction method, S can obtain a possibly partial classifier dφ for classification of objects of U to classes of t-satisfiability of φ where t ∈ T . When applied to a particular object u, dφ will return a degree of satisfiability of φ by u, or ⊥ (denoting ‘I don’t know’) if dφ cannot make a decision. Thus, dφ works as a partial mapping from U into T (or a mapping from U into T ∪ {⊥}).24 The classifier dφ , extending dφ,0 , is an approximation of fφ where the latter is known to Expert only, if ever. The quality of dφ should be tested from time to time and improved if needed. In particular, dφ should be adapted appropriately when s0 changes to another situation of judgement about satisfiability of φ, say s1 . The case of compound formulas. The inner structure of compound formulas may but need not be taken into account by a learning method. In the crisp semantics – like in many other cases – the extension of a compound formula α ◦ β (◦ ∈ {∧, ∨}) is obtained by an aggregation of the extensions of α and β. The aggregation operators are the intersection (∩) if ◦ = ∧, and the union (∪) if ◦ = ∨. The extension of ¬α is computed as the complement of the extension of α. In fuzzy logic, the intersection, the union, and the complementation are refined to triangular norms (t-norms), triangular co-norms (t-co-norms), and fuzzy complementations, respectively [2, 33, 130, 131, 39, 100]. The choice of the most appropriate aggregation operators is left to the expert (a logician). In our approach, S is expected to discover production rules aggregating degrees of satisfiability of α and β into degrees of satisfiability of α ◦ β. In the same vein, production rules computing the degree of satisfiability of ¬α from a degree of satisfiability of α might be learned by S. The discovery of rules which will be suitable for all cases is hardly believable since satisfiability judgement depends on the situation and many other factors. Nevertheless, assuming Expert’s help, the 24
It may be viewed as the decision attribute of ISd = (U, A, dφ ).
84
A. Gomoli´ nska
goal seems to be achievable locally, i.e., for concrete, not very complex formulas, objects, and situations. The case of finite sets of formulas. Now consider the case of finite sets of formulas containing at least two elements.25 This case is s to that of a compound formula. Given such a set of formulas X, S may try to discover rules aggregating degrees of satisfiability of formulas which comprise X into degrees of satisfiability of X. If X = {α, β}, one can ask about the difference between satisfiability of X and satisfiability of α ◦ β where ◦ ∈ {∧, ∨}. We leave this question open for the time being since it requires a longer discussion. Satisfiability by complex objects. Last and perhaps the most difficult problem touched upon here is satisfiability of (finite sets of) formulas by complex objects. The problem is closely related to approximation and classification of complex objects [49]. Assume that the domain knowledge K0 provided to S in s0 contains, among other things, concept ontologies describing, possibly in an approximate way, the inner structure of objects. For a given object u and a formula (set of formulas) φ, S is expected to discover production rules making it possible to judge satisfiability of φ by u on the basis of judgements about satisfiability of (finite sets of) formulas by objects being rough parts of u. Let (T1 , ≤1 ), (T2 , ≤2 ), and (T3 , ≤3 ) be ordered sets of degrees of being a part, degrees of satisfiability of formulas, and degrees of satisfiability of sets of formulas, respectively. For i = 1, . . . , n, let t∗i ∈ T1 , ti , t ∈ T2 ∪ T3 , ui be a member of a set Ui which may but need not be U , and φi be a formula of a relevant descriptor language. Any production rule mentioned above might have the following form: In s0 , if u1 is a part of u to a degree t∗1 ,. . . , un is a part of u to a degree t∗n and u1 satisfies φ1 to a degree t1 ,. . . , un satisfies φn to a degree tn , then u satisfies φ to a degree t. In this case, both hierarchical learning and rough mereology can be useful. A detailed discussion of the case is left to a separate paper.
7
Summary
The central notion of this paper is satisfiability of (sets of) conditions by objects of some sort when information about the judgemental situation, including objects and concepts comprised of them, is incomplete. Under such circumstances, it is hardly believable that any fixed notion of satisfiability can be suitable for all cases considered. Therefore, instead of providing an intelligent system S with a 25
If a set of formulas is infinite, discovery of its satisfiability will be unfeasible in most cases. The empty set may be viewed as satisfied by any object. Also, for simplicity, satisfiability of a single formula set {α} and satisfiability of a formula α may be treated as equivalent.
Satisfiability Judgement under Incomplete Information
85
precisely defined satisfiability concept, we propose that S learns how to judge satisfiability of (sets of) formulas by objects in a given situation. Our approach to satisfiability goes in two directions, overviewed in Sect. 6. In both of them, S is supposed to learn how to judge satisfiability of (sets of) formulas with help of Expert, given partial information about objects, concepts, relationships among them, and satisfiability judgement making. The information may take several forms as an infosystem containing object descriptions, a domain knowledge, a non-empty finite collection of examples, and some knowledge necessary for concept approximation. The main difference between the two research lines lies in the fact that in the first case presented – as opposite to the second one – S is also supplied by Expert with a parameterized definition of satisfiability. The problem is challenging, especially in the case of compound formulas, sets of formulas, complex objects, and formulas referring to complex concepts even if their syntactical structure is simple. In those cases, advanced learning methods should be used as hierarchical learning, for example. Finding a solution of a satisfactory quality to the problem of discovery of satisfiability judgement will open new possibilities, e.g. in the field of autonomic intelligent systems. The problem is also important for foundational studies in machine learning and knowledge discovery. Acknowledgements. Many thanks to Professor Andrzej Skowron for insightful comments which helped improve the final version of the paper. The research has been partially supported by the grants N N516 368334 and N N516 077837 from Ministry of Science and Higher Education of the Republic of Poland.
References 1. Hornby, A.S. (ed.): Oxford Advanced Learner’s Dictionary of Current English, 7th edn. with Vocabulary Trainer. Oxford University Press, Oxford (2007) 2. Klir, G.J., Wierman, M.J.: Uncertainty-based Information: Elements of Generalized Information Theory. Physica, Heidelberg (1998) 3. Keefe, R.: Theories of Vagueness. Cambridge University Press, Cambridge (2000) 4. Demri, S., Orlowska, E. (eds.): Incomplete Information: Structure, Inference, Complexity. Springer, Heidelberg (2002) 5. Kephart, J.O.: Research challenges of autonomic computing. In: Proc. 27th Int. Conf. on Software Engineering (ICSE 2005), May 2005, pp. 15–22. ACM Press, New York (2005) 6. Liu, J.: Autonomy-oriented computing (AOC): The nature and implications of a paradigm for self-organized computing. In: Proc. 4th Int. Conf. on Natural Computation (ICNC 2008), Jinan, China, October 2008, pp. 3–11. IEEE Computer Society Press, Los Alamitos (2008) 7. Liu, J., Jin, X., Tsui, K.C.: Autonomy Oriented Computing: From Problem Solving to Complex Systems Modeling. Kluwer, Dordrecht (2005) 8. Jankowski, A., Skowron, A.: A wistech paradigm for intelligent systems. In: Peters, J.F., Skowron, A., D¨ untsch, I., Grzymala-Busse, J.W., Orlowska, E., Polkowski, L. (eds.) Transactions on Rough Sets VI. LNCS, vol. 4374, pp. 94–132. Springer, Heidelberg (2007)
86
A. Gomoli´ nska
9. Kondratoff, Y., Michalski, R.S. (eds.): Machine Learning: An Artificial Intelligence Approach, vol. 3. Morgan Kaufmann, San Mateo (1990) 10. Michalski, R.S., Carbonell, T.J., Mitchell, T.M. (eds.): Machine Learning: An Artificial Intelligence Approach. TIOGA Publ., Palo Alto (1983) 11. Michalski, R.S., Tecuci, G. (eds.): Machine Learning – A Multistrategy Approach, vol. 4. Morgan Kaufmann, San Mateo (1994) 12. Mitchell, T.M.: Machine Learning. McGraw-Hill, Portland (1998) 13. Cios, K.J., Pedrycz, W., Swiniarski, R.W., Kurgan, L.A.: Data Mining: A Knowledge Discovery Approach. Springer Science + Business Media, LLC (2007) ˙ 14. Kloesgen, W., Zytkow, J.: Handbook of Knowledge Discovery and Data Mining. Oxford University Press, Oxford (2002) 15. Maimon, O., Rokach, L. (eds.): The Data Mining and Knowledge Discovery Handbook. Springer, Heidelberg (2005) 16. Kahneman, D., Slovic, P., Tversky, A. (eds.): Judgment Under Uncertainty: Heuristics and Biases. Cambridge University Press, New York (1982) 17. Kant, I.: Critique of Judgment. Clarendon, Oxford (1988); Transl. by Meredith, J. C. 18. Plous, S.: The Psychology of Judgement and Decision Making. McGraw-Hill, New York (1993) 19. Thiele, L.P.: The Heart of Judgment: Practical Wisdom, Neuroscience, and Narrative. Cambridge University Press, New York (2006) 20. Tarski, A.: The semantical concept of truth and the foundations of semantics. Philosophy and Phenomenological Research 4, 341–375 (1944) 21. Banerjee, M., Chakraborty, M.K.: Rough consequence and rough algebra. In: Ziarko, W. (ed.) Proc. 2nd Int. Workshop on Rough Sets and Knowledge Discovery (RSKD 1993), Banff, Canada, October 1993, pp. 196–207. Springer/British Computer Society, Berlin/London (1994) 22. Barwise, J., Seligman, J.: Information Flow: The Logic of Distributed Systems. Cambridge University Press, Cambridge (1997) 23. Belnap, N.D.: A useful four-valued logic. In: Dunn, J.M., Epstein, G. (eds.) Modern Uses of Multiple-valued Logic, pp. 8–37. Reidel, Dordrecht (1977) 24. Bolc, L., Borowik, P.: Many-valued Logics, vol. 1. Springer, Berlin (1992) 25. Chellas, B.F.: Modal Logic: An Introduction. Cambridge University Press, Cambridge (1980); Reprinted with corrections in 1988 26. Emerson, E.A.: Temporal and modal logic. In: Leeuwen, J.v. (ed.) Handbook of Theoretical Computer Science, vol. B, pp. 995–1072. Elsevier/The MIT Press (1990) 27. Fagin, R., Halpern, J.Y., Moses, Y., Vardi, M.Y.: Reasoning About Knowledge. The MIT Press, Cambridge (1995) 28. Kleene, S.C.: Introduction to Metamathematics. North-Holland, Amsterdam (1952) 29. Kripke, S.A.: Semantical analysis of modal logic I: Normal propositional calculi. Zeit. Math. Logik. Grund. 9, 67–96 (1963) 30. Kripke, S.A.: Semantical analysis of modal logic II: Non-normal propositional calculi. In: Addison, J.W., et al. (eds.) The Theory of Models, pp. 206–220. NorthHolland, Amsterdam (1965) 31. L ukasiewicz, J.: On three-valued logic (in Polish). Ruch Filozoficzny 5, 170–171 (1920); English transl. in [132], pp. 87–88 32. L ukasiewicz, J.: Philosophische Bemerkungen zu mehrwertigen Systemen des Aussagenkalk¨ uls. C. R. Soc. Sci. Lettr. Varsovie 23, 51–77 (1930); English transl. in [132], pp. 153–178
Satisfiability Judgement under Incomplete Information
87
33. Pavelka, J.: On fuzzy logic I. Zeit. Math. Logic Grund. Math. 25, 45–52 (1979); See also parts II and III in the same volume, pp. 119–134, 447–464 34. Pawlak, Z.: Rough logic. Bull. Polish Acad. Sci. Tech. 35, 253–258 (1987) 35. Pogorzelski, W.A.: Notions and Theorems of Elementary Formal Logic. Bialystok Division of Warsaw University, Bialystok (1994) 36. Rescher, N.: Many-valued Logic. McGraw-Hill, New York (1969) 37. Rosser, J.B., Turquette, A.R.: Many-valued Logics. North Holland, Amsterdam (1958) 38. Segerberg, K.: An Essay in Classical Modal Logic, vol. 1-3. Uppsala Universitet (1971) 39. Zadeh, L.A.: Fuzzy logic and approximate reasoning. Synthese 30, 407–428 (1975) 40. Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading (1974) 41. Cook, S.A.: The complexity of theorem proving procedure. In: Proc. 3rd Annual ACM Symp. on Theory of Computing, pp. 151–158 (1971) 42. Penczek, W., Szreter, M.: SAT-based unbounded model checking of timed automata. Fundamenta Informaticae 85, 425–440 (2008) 43. Penczek, W., Wo´zna, B., Zbrzezny, A.: Bounded model checking for the universal fragment of CTL. Fundamenta Informaticae 51, 135–156 (2002) 44. Wo´zna, B., Zbrzezny, A., Penczek, W.: Checking reachability properties for timed automata via SAT. Fundamenta Informaticae 55, 223–241 (2003) 45. Pawlak, Z.: Information systems – theoretical foundations. Information Systems 6, 205–218 (1981) 46. Pawlak, Z.: Information Systems: Theoretical Foundations (in Polish). Wydawnictwo Naukowo-Techniczne, Warsaw (1983) 47. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer, Dordrecht (1991) 48. Pawlak, Z.: Rough set elements. In: [103], vol. 1, pp. 10–30 (1998) 49. Bazan, J.G.: Hierarchical classifiers for complex spatio-temporal concepts. In: Peters, J.F., Skowron, A., Rybi´ nski, H. (eds.) Transactions on Rough Sets IX. LNCS, vol. 5390, pp. 474–750. Springer, Heidelberg (2008) 50. Fensel, D.: Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce. Springer, Berlin (2003) 51. Nguyen, S.H., Nguyen, H.S.: Improving rough classifiers using concept ontology. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 312–322. Springer, Heidelberg (2005) 52. Nguyen, S.H., Nguyen, T.T., Nguyen, H.S.: Ontology driven concept approximation. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowi´ nski, R. (eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 547–556. Springer, Heidelberg (2006) ´ ezak, 53. Skowron, A., Stepaniuk, J.: Ontological framework for approximation. In: Sl D., Wang, G., Szczuka, M.S., D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3641, pp. 718–727. Springer, Heidelberg (2005) 54. Staab, S., Studer, R. (eds.): Handbook on Ontologies. Springer, Heidelberg (2004) 55. Gomoli´ nska, A.: Variable-precision compatibility spaces. Electronical Notices in Theoretical Computer Science 82, 1–12 (2003), http://www.elsevier.nl/locate/entcs/volume82.html 56. Gomoli´ nska, A.: Approximation spaces based on relations of similarity and dissimilarity of objects. Fundamenta Informaticae 79, 319–333 (2007) 57. Pawlak, Z.: A treatise on rough sets. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets IV. LNCS, vol. 3700, pp. 1–17. Springer, Heidelberg (2005)
88
A. Gomoli´ nska
58. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996) 59. Slowi´ nski, R., Greco, S., Matarazzo, B.: Dominance-based rough set approach to reasoning about ordinal data. In: Kryszkiewicz, M., Peters, J.F., Rybi´ nski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 5–11. Springer, Heidelberg (2007) 60. Slowi´ nski, R., Vanderpooten, D.: Similarity relation as a basis for rough approximations. In: Wang, P.P. (ed.) Advances in Machine Intelligence and Soft Computing, vol. 4, pp. 17–33. Duke University Press (1997) 61. Yao, Y.Y., Wong, S.K.M.: A decision theoretic framework for approximating concepts. Int. J. of Man–Machine Studies 37, 793–809 (1992) 62. Yao, Y.Y., Wong, S.K.M., Lin, T.Y.: A review of rough set models. In: Lin, T.Y., Cercone, N. (eds.) Rough Sets and Data Mining: Analysis of Imprecise Data, pp. 47–75. Kluwer, Dordrecht (1997) 63. Ziarko, W.: Variable precision rough set model. J. Computer and System Sciences 46, 39–59 (1993) ´ ezak, D., Wang, G., Szczuka, M.S., 64. Ziarko, W.: Probabilistic rough sets. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3641, pp. 283– 293. Springer, Heidelberg (2005) 65. Zadeh, L.A.: Outline of a new approach to the analysis of complex system and decision processes. IEEE Trans. on Systems, Man, and Cybernetics 3, 28–44 (1973) 66. Zadeh, L.A.: Fuzzy sets and information granularity. In: Gupta, M., Ragade, R., Yager, R. (eds.) Advances in Fuzzy Set Theory and Applications, pp. 3–18. NorthHolland, Amsterdam (1979) 67. Gomoli´ nska, A.: Judgement of satisfiability under incomplete information. In: Czaja, L., Szczuka, M. (eds.) Proc. 18th Workshop on Concurrency, Specification and Programming (CS& P 2009), Krak´ ow Przegorzaly, September 2009, vol. 1. Warsaw University, Warsaw, pp. 164–175 (2009) 68. Gomoli´ nska, A.: A graded meaning of formulas in approximation spaces. Fundamenta Informaticae 60, 159–172 (2004) 69. Gomoli´ nska, A.: On rough judgment making by socio-cognitive agents. In: Skowron, A., et al. (eds.) Proc. 2005 IEEE/WIC/ACM Int. Conf. on Intelligent Agent Technology (IAT 2005), Compi`egne, France, September 2005, pp. 421–427. IEEE Computer Society Press, Los Alamitos (2005) 70. Gomoli´ nska, A.: Satisfiability and meaning of formulas and sets of formulas in approximation spaces. Fundamenta Informaticae 67, 77–92 (2005) 71. Gomoli´ nska, A.: Satisfiability of formulas from the standpoint of object classification: The RST approach. Fundamenta Informaticae 85, 139–153 (2008) 72. Greco, S., Matarazzo, B., Slowi´ nski, R.: Handling missing values in rough set analysis of multi-attribute and multi-criteria decision problems. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC 1999. LNCS (LNAI), vol. 1711, pp. 146–157. Springer, Heidelberg (1999) 73. Grzymala-Busse, J.W.: Characteristic relations for incomplete data: A generalization of the indiscernibility relation. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets IV. LNCS, vol. 3700, pp. 58–68. Springer, Heidelberg (2005) 74. Kryszkiewicz, M.: Rough set approach to incomplete information system. Information Sciences 112, 39–49 (1998) 75. Lipski, W.: Informational systems with incomplete information. In: Proc. 3rd Int. Symp. on Automata, Languages and Programming, pp. 120–130. Edinburgh University Press, Edinburgh (1976)
Satisfiability Judgement under Incomplete Information
89
76. Stefanowski, J., Tsouki` as, A.: Incomplete information tables and rough classification. Computational Intelligence 17, 545–566 (2001) 77. Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978); See also An introduction to the MDL Principle, http://www.mdl-research.org/jorma.rissanen 78. Gomoli´ nska, A.: Construction of rough information granules. In: [82], pp. 449–470 (2008) 79. Inuiguchi, M., Hirano, S., Tsumoto, S. (eds.): Rough Set Theory and Granular Computing. Springer, Heidelberg (2003) 80. Nguyen, H.S., Skowron, A., Stepaniuk, J.: Granular computing: A rough set approach. Computational Intelligence 17, 514–544 (2001) 81. Pedrycz, W. (ed.): Granular Computing: An Emerging Paradigm. Physica, Heidelberg (2001) 82. Pedrycz, W., Skowron, A., Kreinovich, V. (eds.): Handbook of Granular Computing. John Wiley & Sons, Chichester (2008) 83. Skowron, A., Stepaniuk, J.: Towards discovery of information granules. In: ˙ Zytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 542–547. Springer, Heidelberg (1999) 84. Skowron, A., Swiniarski, R., Synak, P.: Approximation spaces and information granulation. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 175–189. Springer, Heidelberg (2005) 85. Gomoli´ nska, A.: Possible rough ingredients of concepts in approximation spaces. Fundamenta Informaticae 72, 139–154 (2006) 86. Polkowski, L., Skowron, A.: Rough mereology in information systems. A case study: Qualitative spatial reasoning. In: [104], pp. 89–135 (2001) 87. Stepaniuk, J.: Knowledge discovery by application of rough set models. In: [104], pp. 137–233 (2001) 88. Le´sniewski, S.: Foundations of the General Set Theory 1 (in Polish), Moscow. Works of the Polish Scientific Circle, vol. 2 (1916); Also in [89], pp 128–173 89. Surma, S.J., Srzednicki, J.T., Barnett, J.D. (eds.): Stanislaw Le´sniewski Collected Works. Kluwer/Polish Scientific Publ., Dordrecht/Warsaw (1992) 90. Polkowski, L., Skowron, A.: Rough mereology. In: Ra´s, Z.W., Zemankova, M. (eds.) ISMIS 1994. LNCS (LNAI), vol. 869, pp. 85–94. Springer, Heidelberg (1994) 91. Polkowski, L., Skowron, A.: Rough mereology: A new paradigm for approximate reasoning. Int. J. Approximated Reasoning 15, 333–365 (1996) 92. Polkowski, L., Skowron, A.: Towards adaptive calculus of granules. In: [133], vol. 1, pp. 201–228 (1999) 93. Drwal, G., Mr´ ozek, A.: System RClass – software implementation of a rough classifier. In: Klopotek, M.A., Michalewicz, M., Ra´s, Z.W. (eds.) Proc. 7th Int. Symp. Intelligent Information Systems (IIS 1998), Malbork, Poland, Warsaw, PAS Institute of Computer Science, June 1998, pp. 392–395 (1998) 94. Gomoli´ nska, A.: On certain rough inclusion functions. In: Peters, J.F., Skowron, A., Rybi´ nski, H. (eds.) Transactions on Rough Sets IX. LNCS, vol. 5390, pp. 35–55. Springer, Heidelberg (2008) 95. Gomoli´ nska, A.: Rough approximation based on weak q-RIFs. In: Peters, J.F., et al. (eds.) Transactions on Rough Sets X. LNCS, vol. 5656, pp. 117–135. Springer, Heidelberg (2009) 96. Polkowski, L.: A note on 3-valued rough logic accepting decision rules. Fundamenta Informaticae 61, 37–45 (2004)
90
A. Gomoli´ nska
97. Polkowski, L.: Rough mereology in analysis of vagueness. In: Wang, G., Li, T., Grzymala-Busse, J.W., Miao, D., Skowron, A., Yao, Y. (eds.) RSKT 2008. LNCS (LNAI), vol. 5009, pp. 197–205. Springer, Heidelberg (2008) 98. Xu, Z.B., Liang, J.Y., Dang, C.Y., Chin, K.S.: Inclusion degree: A perspective on measures for rough set data analysis. Information Sciences 141, 227–236 (2002) 99. L ukasiewicz, J.: Die logischen Grundlagen der Wahrscheinlichkeitsrechnung. In: [132], pp. 16–63 (1970); First published Krak´ ow (1913) 100. Zadeh, L.A.: Fuzzy logic = computing with words. IEEE Trans. on Fuzzy Systems 4, 103–111 (1996) 101. Zhao, Y., Yao, Y.Y., Luo, F.: Data analysis based on discernibility and indiscernibility. Information Sciences 177, 4959–4976 (2007) 102. Bazan, J.G., Skowron, A., Swiniarski, R.: Rough sets and vague concept approximation: From sample approximation to adaptive learning. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 39–62. Springer, Heidelberg (2006) 103. Polkowski, L., Skowron, A. (eds.): Rough Sets in Knowledge Discovery, vol. 1-2. Physica, Heidelberg (1998) 104. Polkowski, L., Tsumoto, S., Lin, T.Y. (eds.): Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems. Physica, Heidelberg (2001) 105. Stepaniuk, J.: Approximation spaces in multi-relational knowledge discovery. In: Peters, J.F., Skowron, A., D¨ untsch, I., Grzymala-Busse, J.W., Orlowska, E., Polkowski, L. (eds.) Transactions on Rough Sets VI. LNCS, vol. 4374, pp. 351–365. Springer, Heidelberg (2007) 106. Bazan, J.G., Nguyen, S.H., Nguyen, H.S., Skowron, A.: Rough set methods in approximation of hierarchical concepts. In: Tsumoto, S., Slowi´ nski, R., Komorowski, J., Grzymala-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 346– 355. Springer, Heidelberg (2004) 107. Nguyen, S.H., Bazan, J.G., Skowron, A., Nguyen, H.S.: Layered learning for concept synthesis. In: Peters, J.F., Skowron, A., Grzymala-Busse, J.W., Kostek, B.z., ´ Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 187–208. Springer, Heidelberg (2004) 108. Peters, J.F.: Approximation spaces for hierarchical intelligent behavioral system models. In: Dunin-K¸eplicz, B., Jankowski, A., Skowron, A., Szczuka, M. (eds.) Monitoring, Security, and Rescue Techniques in Multiagent Systems, pp. 13–30. Springer, Heidelberg (2005) 109. Stone, P.: Layered Learning in Multi-agent Systems: A Winning Approach to Robotic Soccer. The MIT Press, Cambridge (2000) 110. Synak, P., Bazan, J.G., Skowron, A., Peters, J.F.: Spatio-temporal approximate reasoning over complex objects. Fundamenta Informaticae 67, 249–269 (2005) 111. Pawlak, Z., Polkowski, L., Skowron, A.: Rough sets and rough logic: A KDD perspective. In: [104], pp. 583–646 (2001) 112. Aamodt, A., Plaza, E.: Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications 7, 39–52 (1994) 113. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991) 114. Bazan, J.G.: Discovery of decision rules by matching new objects against data tables. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 521–528. Springer, Heidelberg (1998) 115. Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. on Information Theory 13, 21–27 (1967)
Satisfiability Judgement under Incomplete Information
91
116. Duda, R.O., Hart, P.E., Stork, R.: Pattern Classification. John Wiley & Sons, New York (2002) 117. Dzeroski, S., Lavrac, N. (eds.): Relational Data Mining. Springer, Berlin (2001) 118. Greco, S., Matarazzo, B., Slowi´ nski, R.: Dominance-based rough set approach to case-based reasoning. In: Torra, V., Narukawa, Y., Valls, A., Domingo-Ferrer, J. (eds.) MDAI 2006. LNCS (LNAI), vol. 3885, pp. 7–18. Springer, Heidelberg (2006) 119. Grzymala-Busse, J.W.: LERS – a system for learning from examples based on rough sets. In: Slowi´ nski, R. (ed.) Intelligent Decision Support. Handbook of Applications and Advances of the Rough Sets Theory, pp. 3–18. Kluwer, Dordrecht (1992) 120. Grzymala-Busse, J.W.: LERS – A data mining system. In: [15], pp. 1347–1351 (2005) 121. Grzymala-Busse, J.W.: Rule induction. In: [15], pp. 255–267 (2005) 122. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer Science + Business Media, LLC, New York (2009) 123. Michalski, R.S.: Inferential theory of learning as a conceptual basis for multistrategy learning. Machine Learning 11, 111–151 (1993) 124. Mitchell, M.: Analogy-making as Perception: A Computer Model. The MIT Press, Cambridge (1993) 125. Mitchell, M.: Analogy-making as a complex adaptive system. In: Segel, L.E., Cohen, I.R. (eds.) Design Principles for the Immune System and Other Distributed Autonomous Systems, pp. 335–360. Oxford University Press, New York (2001) 126. Stefanowski, J.: On rough set based approaches to induction of decision rules. In: [103], vol. 1, pp. 500–529 (1998) 127. Stepaniuk, J., Ho´ nko, P.: Learning first-order rules: A rough set approach. Fundamenta Informaticae 61, 139–157 (2004) 128. Vapnik, V.: Statistical Learning Theory. John Wiley & Sons, New York (1998) 129. Wojna, A.G.: Analogy-based reasoning in classifier construction. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets IV. LNCS, vol. 3700, pp. 277–374. Springer, Heidelberg (2005) 130. Polkowski, L.: Rough Sets: Mathematical Foundations. Physica, Heidelberg (2002) 131. Torra, V., Narukawa, Y.: Modeling Decisions: Information Fusion and Aggregation Operators. Springer, Heidelberg (2007) 132. Borkowski, L. (ed.): Jan L ukasiewicz – Selected Works. North Holland/Polish Scientific Publ., Amsterdam/Warsaw (1970) 133. Zadeh, L.A., Kacprzyk, J. (eds.): Computing with Words in Information/ Intelligent Systems. Physica, Heidelberg (1999)
Irreducible Descriptive Sets of Attributes for Information Systems Mikhail Moshkov1, Andrzej Skowron2, and Zbigniew Suraj3 1
Division of Mathematical and Computer Science and Engineering King Abdullah University of Science and Technology P.O. Box 55455, Jeddah 21534, Saudi Arabia
[email protected] 2 Institute of Mathematics, Warsaw University Banacha 2, 02-097 Warsaw, Poland
[email protected] 3 Chair of Computer Science, University of Rzesz´ ow Rejtana 16A, 35-310 Rzesz´ ow, Poland
[email protected]
Abstract. The maximal consistent extension Ext(S) of a given information system S consists of all objects corresponding to attribute values from S which are consistent with all true and realizable rules extracted from the original information system S. An irreducible descriptive set for the considered information system S is a minimal (relative to the inclusion) set B of attributes which defines exactly the set Ext(S) by means of true and realizable rules constructed over attributes from the considered set B. We show that there exists only one irreducible descriptive set of attributes. We present a polynomial algorithm for this set construction. We also study relationships between the cardinality of irreducible descriptive set of attributes and the number of attributes in S. The obtained results will be useful for the design of concurrent data models from experimental data. Keywords: rough sets, information systems, maximal consistent extensions, irreducible descriptive sets.
1
Introduction
Let S = (U, A) be an information system [12], where U is a finite set of objects and A is a finite set of attributes defined on U . We identify objects and tuples of values of attributes on these objects. The information system S can be considered as a representation of a concurrent system: attributes are interpreted as local processes of the concurrent system, values of attributes – as states of local processes, and objects – as global states of the considered concurrent system. This idea is due to Pawlak [11]. Let Rul(S) be the set of all true realizable rules in S of the kind a1 (x) = b1 ∧ . . . ∧ at−1 (x) = bt−1 ⇒ at (x) = bt , J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 92–105, 2010. c Springer-Verlag Berlin Heidelberg 2010
Irreducible Descriptive Sets of Attributes for Information Systems
93
where a1 , . . . , at are pairwise different attributes from A and b1 , . . . , bt are values of attributes a1 , . . . , at . True means that the rule is true for any object from U . Realizable means that the left hand side of the rule is true for at least one object from U . Let V (S) be the Cartesian product of ranges of attributes from A. The knowledge encoded in a given information system S can be represented by means of rules from Rul(S). Besides “explicit” global states, corresponding to objects from U , the concurrent system generated by the considered information system can also have “hidden” global states, i.e., tuples of attribute values from V (S) not belonging to U but consistent with all rules from Rul(S). Such “hidden” states can also be considered as realizable global states. This was a motivation for introducing in [16] the maximal consistent extensions of information systems with both “explicit” and “hidden” global states. More exactly, the maximal consistent extension of U is the set Ext(S) of all objects from V (S) for which each rule from Rul(S) is true. The maximal consistent extensions of information systems were considered in [15,16,21,22,1]. In this paper, we study the problem of construction of an irreducible descriptive set of attributes. A set of attributes B ⊆ A is called a descriptive set for S if there exists a set of rules Q ⊆ Rul(S) constructed over the attributes from B only such that Ext(S) coincides with the set of all objects from V (S) for which all rules from Q are true. Note that if Q is the empty set, then the set of all objects from V (S) for which all rules from Q are true coincides with V (S). A descriptive set B for S is called irreducible if each proper subset of B is not a descriptive set for S. Let us consider an example of information system S = (A, U ) for which Ext(S) = U [16]. Let A = {a1 , a2 } and U = {(0, 1), (1, 0), (0, 2), (2, 0)}. One can show that V (S) = {0, 1, 2}2, Rul(S) = {a1 = 1 ⇒ a2 = 0, a1 = 2 ⇒ a2 = 0, a2 = 1 ⇒ a1 = 0, a2 = 2 ⇒ a1 = 0}, and Ext(S) = U ∪ {(0, 0)}. It is clear that {a1 , a2 } is a descriptive set for S, and any set containing only one attribute from A is not a descriptive set for S. Therefore {a1 , a2 } is an irreducible descriptive set for S, and there are no any other irreducible descriptive sets for S. We prove that for any information system S there exists only one irreducible descriptive set of attributes (if Ext(S) = V (S), then the irreducible descriptive set for S is empty), and we present a polynomial in time algorithm for construction of this set. Let us recall that there is no polynomial in time algorithm for constructing the set Ext(S) from a given information system S [5]. We also study possible relationships between the cardinality of the unique irreducible descriptive set for S and the number of attributes in S for information systems S = (U, A) such that Ext(S) = U. The obtained results will be useful for study of concurrent systems, generated by information systems [9,17,20,23].
94
M. Moshkov, A. Skowron, and Z. Suraj
For other issues on information systems and dependencies in information systems the reader is referred to, e.g., [2,3,8,10,13,14,18]. This paper is an extension of [7]. The paper consists of nine sections. Irreducible descriptive sets of attributes are considered in Sects. 2–8. Section 9 contains short conclusions.
2
Maximal Consistent Extensions
Let S = (U, A) be an information system [12], where U = {u1 , . . . , un } is a set of objects and A = {a1 , . . . , am } is a set of attributes (functions defined on U ). We assume that for any two different numbers i1 , i2 ∈ {1, . . . , n} tuples (a1 (ui1 ), . . . , am (ui1 )) and (a1 (ui2 ), . . . , am (ui2 )) are different. Hence, for i = 1, . . . , n we identify object ui ∈ U and corresponding tuple (a1 (ui ),. . . ,am (ui )). For j = 1, . . . , m let Vaj = {aj (ui ) : ui ∈ U }. We assume that Vaj ≥ 2 for j = 1, . . . , m. We consider the set V (S) = Va1 × . . . × Vam as the universe of objects and study extensions U ∗ of the set U such that U ⊆ U ∗ ⊆ V (S). We assume that for any aj ∈ A and any u ∈ V (S) the value aj (u) is equal to the j-th component of u. Let us consider a rule aj1 (x) = b1 ∧ . . . ∧ ajt−1 (x) = bt−1 ⇒ ajt (x) = bt ,
(1)
where t ≥ 1, aj1 , . . . , ajt ∈ A, b1 ∈ Vaj1 , . . . , bt ∈ Vajt , and numbers j1 , . . . , jt are pairwise different. The rule (1) is called true for an object u ∈ V (S) if there exists l ∈ {1, . . . , t − 1} such that ajl (u) = bl , or ajt (u) = bt . The rule (1) is called true if it is true for any object from U . The rule (1) is called realizable if there exists an object ui ∈ U such that aj1 (ui ) = b1 , . . . , ajt−1 (ui ) = bt−1 or t = 0. By Rul(S) we denote the set of all rules each of which is true and realizable. By Ext(S) we denote the set of all objects from V (S) for which each rule from Rul(S) is true. The set Ext(S) is called the maximal consistent extension of U relative to the set of rules Rul(S).
3
On Membership to Ext(S)
First, we recall a polynomial algorithm B1 from [4] which for a given information system S = (U, A) and an element u ∈ V (S) recognizes if this element belongs to Ext(S) or not (see Algorithm 1). Let U = {u1 , . . . , un } and A = {a1 , . . . , am }. Let us observe that using the indiscernibility relation IN D(Ai (u)) [12], where Ai (u) = {al : l ∈ Mi (u)}, we obtain that Pij (u) = aj ([u]IN D(Ai (u) ), i.e., Pij (u) is equal to the image under aj of the Ai (u)-indiscernibility class [u]IN D(Ai (u)) defined by u. The considered algorithm is based on the following criterion. Proposition 1. [4] The relation u ∈ Ext(S) holds if and only if |Pij (u)| ≥ 2 for any i ∈ {1, . . . , n} and j ∈ {1, . . . , m} \ Mi (u).
Irreducible Descriptive Sets of Attributes for Information Systems
95
Algorithm 1. Algorithm B1 Input : Information system S = (U, A), where U = {u1 , . . . , un }, A = {a1 , . . . , am }, and u ∈ V (S). Output: Return Y es if u ∈ Ext(S), and N o, otherwise. for i = 1, . . . , n do Mi (u) ← {j ∈ {1, . . . , m} : aj (u) = aj (ui )}; end for i ∈ {1, . . . , n} and j ∈ {1, . . . , m} \ Mi (u) do Pij (u) ← {aj (ut ) : ut ∈ U and al (ut ) = al (u) for each l ∈ Mi (u)}; end if |Pij (u)| ≥ 2 for any i ∈ {1, . . . , n} and j ∈ {1, . . . , m} \ Mi (u) then return “Yes”; else return “No”; end
4
Separating Sets of Attributes
A set of attributes B ⊆ A is called a separating set for Ext(S) if for any two objects u ∈ Ext(S) and v ∈ V (S) \ Ext(S) there exists an attribute aj ∈ B such that aj (u) = aj (v) or, which is the same, tuples u and v are different in the j-th component. A separating set for Ext(S) is called irreducible if each its proper subset is not a separating set for Ext(S). It is clear that the set of irreducible separating sets for Ext(S) coincides with the set of decision reducts for the decision system D = (V (S), A, d), where for any u ∈ V (S) 1, if u ∈ Ext(S), d(u) = 0, if u ∈ / Ext(S). Let us show that the core for this decision system is a reduct. It means that D has exactly one reduct coinciding with the core. We denote by C(Ext(S)) the set of attributes aj ∈ A such that there exist two objects u ∈ Ext(S) and v ∈ V (S)\ Ext(S) which are different only in the j-th component. It is clear that C(Ext(S)) is the core for D, and C(Ext(S)) is a subset of each reduct for D. Proposition 2. The set C(Ext(S)) is a reduct for the decision system D = (V (S), A, d). Proof. Let us consider two objects u ∈ Ext(S) and v ∈ V (S) \ Ext(S). Let us show that these objects are different on an attribute from C(Ext(S)). Let u and v be different in p components j1 , . . . , jp . Then there exists a sequence u1 , . . . , up+1 of objects from V (S) such that u = u1 , v = up+1 , and for i = 1, . . . , p the objects ui and ui+1 are different only in the component with the number ji . Since u1 ∈ Ext(S) and up+1 ∈ V (S) \ Ext(S), there exists i ∈ {1, . . . , p} such that ui ∈ Ext(S) and ui+1 ∈ V (S) \ Ext(S). Therefore, aji ∈ C(Ext(S)). It is clear that u and v are different on the attribute aji . Thus, C(Ext(S)) is a reduct for D.
96
M. Moshkov, A. Skowron, and Z. Suraj
From Proposition 2 it follows that C(Ext(S)) is the unique reduct for the decision system D. Thus, a set B ⊆ A is a separating set for Ext(S) if and only if C(Ext(S)) ⊆ B. One can show that C(Ext(S)) = ∅ if and only if Ext(S) = V (S).
5
On Construction of C(Ext(S))
In this section, we present a polynomial in time algorithm for construction of C(Ext(S)). First, we define an auxiliary set N (Ext(S)). Next, we present a polynomial in time algorithm for constructing this set and finally we show that this auxiliary set N (Ext(S)) is equal to C(Ext(S)). Let us define the set N (Ext(S)). An attribute aj ∈ A belongs to N (Ext(S)) if and only if there exist objects u ∈ U and v ∈ V (S) \ Ext(S) such that u and v are different only in the j-th component. Notice that the only difference in the definition of N (Ext(S)) in comparison with the definition of C(Ext(S)) is that the first condition for u. In the former case we require u ∈ U and in the latter case u ∈ Ext(S). We now describe a polynomial algorithm B2 for the set N (Ext(S)) construction. Algorithm 2. Algorithm B2 Input : Information system S = (U, A), where A = {a1 , . . . , am }. Output: Return N (Ext(S)). N (Ext(S)) = ∅; for u ∈ U do for j ∈ {1, . . . , m} and b ∈ Vaj \ {bj }, where u = (b1 , . . . , bm ) do v ← (b1 , . . . , bj−1 , b, bj+1 , . . . , bm ); Apply algorithm B1 to v; if algorithm B1 returns “No” then N (Ext(S)) ← N (Ext(S)) ∪ {aj }; end end end
Theorem 1. C(Ext(S)) = N (Ext(S)). Proof. Let ar ∈ A. It is clear that if ar ∈ N (Ext(S)) then ar ∈ C(Ext(S)). We now show that if ar ∈ / N (Ext(S)) then ar ∈ / C(Ext(S)). To this end we must prove that for any two objects u and v from V (S), if u ∈ Ext(S) and v is different from u only in the r-th component then v ∈ Ext(S). Let us assume that u ∈ Ext(S) and v ∈ V (S) is different from u only in the r-th component. We now show that v ∈ Ext(S). Taking into account that u ∈ Ext(S) and using Proposition 1 we conclude that |Pij (u)| ≥ 2 for any i ∈ {1, . . . , n} and j ∈ {1, . . . , m} \ Mi (u). We now show that |Pij (v)| ≥ 2 for i ∈ {1, . . . , n} and j ∈ {1, . . . , m} \ Mi (v). Let us consider four cases.
Irreducible Descriptive Sets of Attributes for Information Systems
97
1. Let r ∈ / Mi (u) and ar (v) = ar (ui ). Then Mi (v) = Mi (u) ∪ {r} and j = r. Since |Pij (u)| ≥ 2, there exists an object ut ∈ U such that al (ut ) = al (u) for each l ∈ Mi (u) and aj (ut ) = aj (ui ). If ar (v) = ar (ut ) then |Pij (v)| ≥ 2. Let ar (v) = ar (ut ). We denote by w an object from V (S) which is different from ut only in the r-th component and for which ar (w) = ar (v). Since ar ∈ / N (Ext(S)), we have w ∈ Ext(S). Let us assume that Ki = {s ∈ {1, . . . , m} : as (w) = as (ui )}. It is clear that Mi (v) ⊆ Ki and j ∈ / Ki . Taking into account that w ∈ Ext(S) and using Proposition 1 we conclude that there exists an object up ∈ U such that al (up ) = al (ui ) for each l ∈ Ki and aj (up ) = aj (ui ). Since Mi (v) ⊆ Ki , we obtain |Pij (v)| ≥ 2. 2. Let r ∈ / Mi (u) and ar (v) = ar (ui ). Then Mi (v) = Mi (u). Since |Pij (u)| ≥ 2, there exists an object ut ∈ U such that al (ut ) = al (u) for each l ∈ Mi (u) and aj (ut ) = aj (ui ). Taking into account that Mi (v) = Mi (u) and al (ut ) = al (v) for each l ∈ Mi (u) we obtain |Pij (v)| ≥ 2. 3. Let r ∈ Mi (u) and r = j. Then Mi (v) = Mi (u) \ {r}. Since |Pij (u)| ≥ 2, there exists an object ut ∈ U such that al (ut ) = al (u) for each l ∈ Mi (u) and aj (ut ) = aj (ui ). It is clear that al (ut ) = al (v) for each l ∈ Mi (v) and aj (ut ) = aj (ui ). Therefore, |Pij (v)| ≥ 2. 4. Let r ∈ Mi (u) and r = j. Then Mi (v) = Mi (u) \ {r}. By w we denote an object from V (S) which is different from ui only in the r-th component. Since ar ∈ / N (Ext(S)), we have w ∈ Ext(S). Using Proposition 1, one can show that there exists an object up ∈ U which is different from ui only in the r-th component. It is clear that al (up ) = al (v) for each l ∈ Mi (v), and ar (up ) = ar (ui ). Therefore, |Pij (v)| ≥ 2. Using Proposition 1, we obtain v ∈ Ext(S). Thus, ar ∈ / C(Ext(S)).
6
Descriptive Sets of Attributes
In this section, we show that the maximal consistent extension Ext(S) of a given information system S cannot be defined by any system of true and realizable rules in S constructed over a set of attributes not including C(Ext(S)). Proposition 3. Let Q be a set of true realizable rules in S such that the set of objects from V (S), for which any rule from Q is true, coincides with Ext(S), and let B be the set of attributes from A occurring in rules from Q. Then C(Ext(S)) ⊆ B. Proof. Let us assume the contrary: aj ∈ / B for some attribute aj ∈ C(Ext(S)). Since aj ∈ C(Ext(S)), there exist objects u ∈ Ext(S) and v ∈ V (S) \ Ext(S) which are different only in the component with the number j. Let us consider a rule from Q which is not true for the object v. Since this rule does not contain the attribute aj , the considered rule is not true for u which is impossible.
98
M. Moshkov, A. Skowron, and Z. Suraj
Now, we will show that using true realizable rules in S with attributes from C(Ext(S)) only it is possible to describe exactly the set Ext(S). Proposition 4. There exists a set Q of true realizable rules in S such that the set of objects from V (S), for which any rule from Q it true, coincides with Ext(S), and rules from Q use only attributes from C(Ext(S)). Proof. Let us consider an arbitrary rule from the set Rul(S). Let, for the definiteness, this will be the rule a1 (x) = b1 ∧ . . . ∧ at−1 (x) = bt−1 ⇒ at (x) = bt .
(2)
We show that at ∈ C(Ext(S)). Let us assume the contrary, i.e., at ∈ / C(Ext(S)). Since (2) is realizable, there exists an object ui ∈ U such that a1 (ui ) = b1 , . . . , at−1 (ui ) = bt−1 . Since (2) is true, at (ui ) = bt . Using Theorem 1, we obtain at ∈ / N (Ext(S)). Let w be an object from V (S) which is different from ui only in the component with the number t. Since at ∈ / N (Ext(S)), we have w ∈ Ext(S). Using Proposition 1, we conclude that there exists an object up ∈ U which is different from ui only in the component with the number t. It is clear that the rule (2) is not true for up which is impossible. Thus, at ∈ C(Ext(S)). Let us assume that there exists j ∈ {1, . . . , t − 1} such that aj ∈ / C(Ext(S)). Now, we consider the rule al (x) = bl ⇒ at (x) = bt . (3) l∈{1,...,t−1}\{j}
We show that this rule belongs to Rul(S). Since (2) is realizable, (3) is realizable too. We now show that (3) is true. Let us assume the contrary, i.e., there exists object ui ∈ U for which (3) is not true. It means that al (ui ) = bl for any l ∈ {1, . . . , t − 1} \ {j}, and at (ui ) = bt . Since (2) is true, aj (ui ) = bj . Let us consider the object w ∈ V (S) such that w is different from ui only in the j-th component, and aj (w) = bj . Taking into account that aj ∈ / C(Ext(S)) we obtain w ∈ Ext(S), but this is impossible. Since (2) is true, (2) must be true for any object from Ext(S). However, (2) is not true for w. Thus, if we remove from the left hand side of a rule from Rul(S) all conditions with attributes from A \ C(Ext(S)) we obtain a rule from Rul(S) which uses only attributes from C(Ext(S)). We denote by Rul∗ (S) the set of all rules from Rul(S) which use only attributes from C(Ext(S)). It is clear that the set of objects from V (S), for which each rule from Rul∗(S) is true, contains all objects from Ext(S). Let u ∈ V (S) \ Ext(S). Then there exists a rule from Rul(S) which is not true for u. If we remove from the left hand side of this rule all conditions with attributes from A \ C(Ext(S)) we obtain a rule from Rul∗(S) which is not true for u. Therefore, the set of objects from V (S), for which each rule from Rul∗ (S) is true, coincides with Ext(S). Thus, as the set Q we can take the set of rules Rul∗ (S).
Irreducible Descriptive Sets of Attributes for Information Systems
99
We will say that a subset of attributes B ⊆ A is a descriptive set for S if there exists a set of rules Q ⊆ Rul(S) that uses only attributes from B, and the set of objects from V (S), for which each rule from Q is true, coincides with Ext(S). A descriptive set B will be called irreducible if each proper subset of B is not a descriptive set for S. Next statement follows immediately from Propositions 3 and 4. Theorem 2. The set C(Ext(S)) is the unique irreducible descriptive set for S. From Theorem 1 it follows that C(Ext(S)) = N (Ext(S)). The algorithm B2 allows us to construct the set N (Ext(S)) in polynomial time. Let us consider three examples. Example 1. We now consider an example of information system S = (U, A) for which Ext(S) = U and the irreducible descriptive set for S is empty. Let A = {a1 , a2 } and U = {(0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1)}. One can show that V (S) = {0, 1, 2}2 and Rul(S) = ∅. Therefore Ext(S) = V (S) and Ext(S) \ U = {(0, 0), (1, 1), (2, 2)}. It is clear that the empty set of attributes is the unique irreducible descriptive set for S. Example 2. Let m ≥ 3, U ⊆ {0, 1, 2}m, U = {(0, 0, . . . , 0), (1, 0, . . . , 0), . . . , (0, 0, . . . , 1), (2, 2, . . . , 2), (1, 2, . . . , 2), . . . , (2, 2, . . . , 1)}, A = {a1 , . . . , am }, and for any aj ∈ A and any u ∈ {0, 1, 2}m the value aj (u) be equal to the j-th component of u. It is clear that V (S) = {0, 1, 2}m, where S = (U, A). In [6] we prove that Ext(S) \ U = {(1, . . . , 1)} ∪ ({0, 2}m \ {(0, . . . , 0), (2, . . . , 2)}) . It means, in particular, that {(1, 2, 0, . . . , 0), . . . , (1, 0, 0, . . . , 2), (2, 0, . . . , 0, 1)} ∩ Ext(S) = ∅. From here and from description of U it follows that N (Ext(S)) = {a1 , . . . , am }. Using Theorems 1 and 2 we conclude that {a1 , . . . , am } is the unique irreducible descriptive set for S. Example 3. Let m ≥ 3, U ⊆ {0, 1, 2}m, U = {(0, 0, . . . , 0), (1, 0, . . . , 0), . . . , (0, 0, . . . , 1), (2, 2, . . . , 2), (1, 2, . . . , 2), . . . , (2, 2, . . . , 1), (2, 0, . . . , 0), . . . , (0, 0, . . . , 2)}, A = {a1 , . . . , am }, and for any aj ∈ A and any u ∈ {0, 1, 2}m the value aj (u) be equal to the j-th component of u. It is clear that V (S) = {0, 1, 2}m, where
100
M. Moshkov, A. Skowron, and Z. Suraj
S = (U, A). In [6] we prove that Ext(S) \ U = {(1, . . . , 1)}. It means, in particular, that {(1, 2, 0, . . . , 0), . . . , (1, 0, 0, . . . , 2), (2, 0, . . . , 0, 1)} ∩ Ext(S) = ∅. From here and from description of U it follows that N (Ext(S)) = {a1 , . . . , am }. Using Theorems 1 and 2 we conclude that {a1 , . . . , am } is the unique irreducible descriptive set for S.
7
On Cardinality of Irreducible Descriptive Sets
In this section, we study relationships between the cardinality of irreducible descriptive set for information system S = (U, A) and the number of attributes in S in the case when Ext(S) = U . We show that there are no nontrivial relationships between these parameters. To this end we study the set R = {(|C(Ext(S))|, |A|) : S = (U, A) ∈ IS, Ext(S) = U }, where IS is the set of all information systems S = (U, A) for which each attribute from A has at least two values on objects from U , and for different objects from U tuples of values of attributes from A are different. By Theorems 1 and 2, the set C(Ext(S)) coincides with the set N (Ext(S)) and is the unique irreducible descriptive set for S. We show that R = {(p, q) : p ∈ N (2) ∪ {0}, q ∈ N (2), p ≤ q}, where N (2) = {2, 3, 4, . . .} is the set of natural numbers which are greater than 1. It means that there are no nontrivial relationships between parameters |C(Ext(S))| and |A| in the set of information systems {S : S = (U, A) ∈ IS, Ext(S) = U }. First, we consider a transformation of an arbitrary information system S = (U, A) ∈ IS into an information system S (t) = (U (t) , A(t) ) with |A| + t attributes where t ≥ 1. Let us assume A = {a1 , . . . , am } and u = (u1 , . . . , um ) ∈ V (S). We set clone(u) = {(u1 , . . . , um , um+1 , . . . , um+t ) : um+1 , . . . , um+t ∈ {0, 1}}. Then A(t) = {a1 , . . . , am , am+1 , . . . , am+t } and U (t) = u∈U clone(u). It is clear that V (S (t) ) = u∈V (S) clone(u). Proposition 5. Ext(S (t) ) = v∈Ext(S) clone(v). Proof. It is clear that Rul(S) ⊆ Rul(S (t)). Therefore Ext(S (t) ) ⊆ clone(v). v∈Ext(S)
Let us assume that there exist u ∈ Ext(S) and v ∈ clone(u) for which v ∈ / Ext(S (t) ). Then there exists a rule r ∈ Rul(S (t)) which is not true for v. One
Irreducible Descriptive Sets of Attributes for Information Systems
101
can show that r contains an attribute from A on the right hand side. We denote by r a rule obtained from r by removing from its left hand side all attributes from {am+1 , . . . , am+t }. One can show that r ∈ Rul(S (t) ) and r ∈ Rul(S). It is clear that r is not true for u. But this is impossible. Hence v∈Ext(S) clone(v) ⊆ Ext(S (t) ).
Proposition 6. N (Ext(S)) = N (Ext(S (t) )). Proof. Let us assume ai ∈ N (Ext(S)). Then there exist tuples u ∈ U and v ∈ V (S) \ Ext(S) which are different only in the i-th component. Let u = (u1 , . . . , um ) and v = (v1 , . . . , vm ). We now consider two tuples u = (u1 , . . . , um , 0, . . . , 0) and v = (v1 , . . . , vm , 0, . . . , 0) from V (S (t) ). It is clear that u ∈ U (t) . From Proposition 5, it follows that v ∈ V (S (t) ) \ Ext(S (t) ). Therefore ai ∈ N (Ext(S (t) )). Let us assume ai ∈ N (Ext(S (t) )). Then there exist tuples u ∈ U (t) and v ∈ V (S (t) ) \ Ext(S (t) ) which are different only in the i-th component. It is clear that i ∈ {1, . . . , m}. Let us assume u = (u1 , . . . , um , um+1 , . . . , um+t ), v = (v1 , . . . , vm , vm+1 , . . . , vm+t ). We now consider two tuples from V (S): u = (u1 , . . . , um ) and v = (v1 , . . . , vm ). It is clear that u ∈ U . From Proposition 5 it follows that v ∈ V (S) \ Ext(S). Therefore ai ∈ N (Ext(S)).
From Theorem 1 and Propositions 5 and 6 the next two statements follow. Corollary 1. Let us assume S = (U, A) ∈ IS, Ext(S) = U and t is a natural number. Then S (t) ∈ IS, Ext(S (t) ) = U (t) , |C(Ext(S (t) ))| = |C(Ext(S))| and |A(t) | = |A| + t. Corollary 2. Let us assume (p, q) ∈ R and t is a natural number. Then (p, q + t) ∈ R. Now we are able to describe exactly the set R = {(|C(Ext(S))|, |A|) : S = (U, A) ∈ IS, Ext(S) = U }. Let us remind that N (2) = {2, 3, 4, . . .}. Theorem 3. R = {(p, q) : p ∈ N (2) ∪ {0}, q ∈ N (2), p ≤ q}. Proof. Let us assume (p, q) ∈ R. Then there exists an information system S = (U, A) ∈ IS such that Ext(S) = U , |C(Ext(S))| = p and |A| = q. It is clear that p ≤ q, p ∈ N (2) ∪ {0, 1} and q ∈ N (2) ∪ {1}. Let us assume that p = 1. Then there exists an attribute ai ∈ A and a set Q ⊆ Rul(S) constructed over the attribute ai only such that Ext(S) coincides with the set of all objects from V (S) for which all rules from Q are true. We know that the attribute ai has at least two values on objects from U . Therefore Q is the empty set and p = 0. Hence p = 1.
102
M. Moshkov, A. Skowron, and Z. Suraj
Let us assume that q = 1. One can show that in this case V (S) = Ext(S) = U which contradicts the assumption Ext(S) = U . Therefore q = 1. Thus, R ⊆ {(p, q) : p ∈ N (2) ∪ {0}, q ∈ N (2), p ≤ q}. From Example 1, we have (0, 2) ∈ R. Using the example considered in the introduction we conclude that (2, 2) ∈ R. Let us assume m ∈ N (2) and m ≥ 3. From Example 2, (m, m) ∈ R. Let (p, q) ∈ R. From Corollary 2 it follows that (p, q + t) ∈ R for any natural t. Therefore {(p, q) : p ∈ N (2) ∪ {0}, q ∈ N (2), p ≤ q} ⊆ R.
8
Descriptions of Ext(S) and Rul(S)
In this section, we outline some problems of more compact description of sets Ext(S) and Rul(S) which we would like to investigate in our further study. Let us start from a proposal for (approximate) description of maximal extensions. We consider an extension of the language of boolean combinations of descriptors [14] of a given information system by taking instead of descriptors of the form a = v over a given information system S = (U, A), where a ∈ A, v ∈ Va , and Va is the set of values of a, their generalization to a ∈ W where W is a nonempty subset of Va . Such new descriptors are called generalized descriptors. The semantics of the generalized descriptor a ∈ W relative to a given information system S = (U, A) is defined by the set a ∈ W V (S) = {u ∈ V (S) : a(u) ∈ W } or by a ∈ W S ∩ U , if one would like to restrict attention to the set U only. This semantics can be extended, in the standard way, on boolean combination of descriptors defined by classical propositional connectives, i.e., conjunction, disjunction, and negation. Let us consider boolean combinations of generalized descriptors defined by conjunctions of generalized descriptors only. We call them as templates. Now, we define decision systems with conditional attributes defined by generalized descriptors. Let us consider a sample U of objects from V (S) \ U and the set GD of all binary attributes a ∈ W such that (a ∈ W )(u) = 1 if and only if a(u) ∈ W , where u ∈ V (S). Next, we consider decision systems of the form DSB = (U ∪ U , B, d), where B ⊆ GD and d(u) = 1 if and only if u ∈ Ext(S). Using such decision systems one can construct classifiers for the set Ext(S). The problem is to search for classifiers with the high quality of classification. Searching for such classifiers can be based on the minimal length principle. For example, for any DSB one can measure the size of classifier by the size of the generated set of decision rules. The size of a set of decision rules can be defined as the sum of sizes of the left hand sides of decision rules from the set. Observe that the left hand sides of the considered decision rules are templates, i.e., conjunctions of generalized descriptors. In this way, some approximate but compact descriptions of Ext(S) by classifiers can be obtained. Another possibility is to use lazy classifiers for Ext(S) based on DSB decision systems. Dealing with all rules of a given kind, e.g., all realizable and true deterministic rules [6], one may face problems related to the large size of the set of such rules
Irreducible Descriptive Sets of Attributes for Information Systems
103
in a given information system. Hence, it is necessary to look for more compact description of such sets of rules. It is worthwhile mentioning that this problem is of great importance in data and knowledge visualization. A language which can help to describe the rule set Rul(S) in a more compact way can be defined by dependencies, i.e., expressions of the form B −→ C, where B, C ⊆ A (see, e.g., [14]). A dependency B −→ C is true in S, in symbols B −→S C = 1, if and only if there is a functional dependency between B and C in S what can be expressed using the positive region by P OSB (C) = U . Certainly, each true in S dependency B −→ C in S is representing a set of deterministic, realizable and true decision rules in S. The aim is to select dependencies true in S which are representing as many as possible rules from the given rule set Rul(S). For example, in investigating decompositions of information systems [21,20,17] some special dependencies in a given information system called as components were used. One could also use dependencies called as association reducts [19]. The remaining rules from Rul(S) set which are not represented by the chosen functional dependencies can be added as links between components. They are interpreted in [21,20,17] as constraints or interactions between modules defined by components. The selected dependencies and links create a covering of Rul(S). Assuming that a quality measure for such coverings was fixed, one can consider the minimal exact (or approximate) covering problem for Rul(S) set by functional dependencies from the selected set of dependencies and some rules from Rul(S). Yet another possibility is to search for minimal subsets of a given Rul(S) from which Rul(S) can be generated using, e.g., some derivation rules.
9
Conclusions
We proved that for any information system S there exists only one irreducible descriptive set of attributes, and we proposed a polynomial in time algorithm for this set construction. We plan to use the obtained results in applications of information systems to analysis and design of concurrent systems.
Acknowledgements The research has been supported by the grant N N516 368334 and N N516 077837 from Ministry of Science and Higher Education of the Republic of Poland.
References 1. Delimata, P., Moshkov, M., Skowron, A., Suraj, Z.: Inhibitory Rules in Data Analysis. A Rough Set Approach. Studies in Computational Intelligence, vol. 163. Springer, Heidelberg (2009) 2. D¨ untsch, I., Gediga, G.: Algebraic Aspects of Attribute Dependencies in Information Systems. Fundamenta Informaticae 29(1-2), 119–134 (1997)
104
M. Moshkov, A. Skowron, and Z. Suraj
3. Marek, W., Pawlak, Z.: Rough Sets and Information Systems. Fundamenta Informaticae 7(1), 105–116 (1984) 4. Moshkov, M., Skowron, A., Suraj, Z.: On Testing Membership to Maximal Consistent Extensions of Information Systems. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowi´ nski, R. (eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 85–90. Springer, Heidelberg (2006) 5. Moshkov, M., Skowron, A., Suraj, Z.: On Maximal Consistent Extensions of Information Systems. In: Conference Decision Support Systems, Zakopane, Poland, December 2006, vol. 1, pp. 199–206. University of Silesia, Katowice (2007) 6. Moshkov, M., Skowron, A., Suraj, Z.: Maximal Consistent Extensions of Information Systems Relative to Their Theories. Information Sciences 178(12), 2600–2620 (2008) 7. Moshkov, M., Skowron, A., Suraj, Z.: On Irreducible Descriptive Sets of Attributes for Information Systems. In: Chan, C.-C., Grzymala-Busse, J.W., Ziarko, W.P. (eds.) RSCTC 2008. LNCS (LNAI), vol. 5306, pp. 21–30. Springer, Heidelberg (2008) 8. Novotn´ y, J., Novotn´ y, M.: Notes on the Algebraic Approach to Dependence in Information Systems. Fundamenta Informaticae 16, 263–273 (1992) 9. Pancerz, K., Suraj, Z.: Synthesis of Petri Net Models: A Rough Set Approach. Fundamenta Informaticae 55, 149–165 (2003) 10. Pawlak, Z.: Information Systems: Theoretical Foundations. WNT, Warsaw (1983) (in Polish) 11. Pawlak, Z.: Concurrent Versus Sequential – The Rough Sets Perspective. Bulletin of the EATCS 48, 178–190 (1992) 12. Pawlak, Z.: Rough Sets – Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991) 13. Pawlak, Z., Rauszer, C.: Dependency of Attributes in Information Systems. Bull. Polish. Acad. Sci. Math. 9-10, 551–559 (1985) 14. Pawlak, Z., Skowron, A.: Rudiments of Rough Sets. Information Sciences 177(1), 3–27 (2007); Rough Sets: Some Extensions. Information Sciences 177(1), 28–40 (2007); Rough Sets and Boolean Reasoning. Information Sciences 177(1), 41–73 (2007) 15. Rz¸asa, W., Suraj, Z.: A New Method for Determining of Extensions and Restrictions of Information Systems. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 197–204. Springer, Heidelberg (2002) 16. Skowron, A., Suraj, Z.: Rough Sets and Concurrency. Bulletin of the Polish Academy of Sciences 41, 237–254 (1993) 17. Skowron, A., Suraj, Z.: Discovery of Concurrent Data Models from Experimental Tables: A Rough Set Approach. In: First International Conference on Knowledge Discovery and Data Mining, pp. 288–293. AAAI Press, Menlo Park (1995) 18. Skowron, A., Stepaniuk, J., Peters, J.F.: Rough Sets and Infomorphisms: Towards Approximation of Relations in Distributed Environments. Fundamenta Informaticae 54(1-2), 263–277 (2003) ´ ezak, D.: Association Reducts: A Framework for Mining Multi-attribute Depen19. Sl dencies. In: Hacid, M.-S., Murray, N.V., Ra´s, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, pp. 354–363. Springer, Heidelberg (2005) 20. Suraj, Z.: Discovery of Concurrent Data Models from Experimental Tables: A Rough Set Approach. Fundamenta Informaticae 28, 353–376 (1996)
Irreducible Descriptive Sets of Attributes for Information Systems
105
21. Suraj, Z.: Some Remarks on Extensions and Restrictions of Information Systems. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 204– 211. Springer, Heidelberg (2001) 22. Suraj, Z., Pancerz, K.: A New Method for Computing Partially Consistent Extensions of Information Systems: A Rough Set Approach. In: 11th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, vol. III, pp. 2618–2625. E.D.K., Paris (2006) 23. Suraj, Z., Pancerz, K.: Reconstruction of Concurrent System Models Described by Decomposed Data Tables. Fundamenta Informaticae 71, 121–137 (2006)
Computational Theory Perception (CTP), Rough-Fuzzy Uncertainty Analysis and Mining in Bioinformatics and Web Intelligence: A Unified Framework Sankar K. Pal Center for Soft Computing Research: A National Facility Indian Statistical Institute, Kolkata - 700108
[email protected]
Abstract. The concept of computational theory of perceptions (CTP), its characteristics and the relation with fuzzy-granulation (f-granulation) are explained. Role of f-granulation in machine and human intelligence and its modeling through rough-fuzzy integration are discussed. The Significance of rough-fuzzy synergestic integration is highlighted through three examples, namely, rough-fuzzy case generation, rough-fuzzy c-means and rough-fuzzy c-medoids along with the role of fuzzy granular computation. Their superiority, in terms of performance and computation time, is illustrated for the tasks of case generation (mining) in large-scale case-based reasoning systems, segmenting brain MR images, and analyzing protein sequences. Different quantitative measures for rough-fuzzy clustering are explained. The effectiveness of rough sets in constructing an ensemble classifier is also illustrated in a part of the article along with its performance for web service classification. The article includes some of the existing results published elsewhere under different topics related to rough sets and attempts to integrate them with CTP in a unified framework providing a new direction of research. Keywords: soft computing, fuzzy granulation, rough-fuzzy computing, bioinformatics, MR image segmentation, case based reasoning, data mining, web service classification.
1
Introduction
Rough set theory [33] is a popular mathematical framework for granular computing. The focus of rough set theory is on the ambiguity caused by limited discernibility of objects in the domain of discourse. Granules are formed as objects and are drawn together by the limited discernibility among them. A rough set represents a set in terms of lower and upper approximations. The lower approximation contains granules that completely belong in the set and the upper approximation contains granules that partially or completely belong in the set. Rough set-based techniques have been used in the fields of pattern recognition J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 106–129, 2010. c Springer-Verlag Berlin Heidelberg 2010
Computational Theory Perception
107
[25,41], image processing [38], data mining and knowledge discovery [5,31] process from large data sets. Recently rough sets were found to have extensive application in dimensionality reduction [41] and knowledge encoding [2,19] particularly when the uncertainty is due to granularity in the domain of discourse. It is also has been found to be an effective machine learning tool for designing ensemble classifier. Recently rough-fuzzy computing has drawn the attention of researches in machine learning community. Rough-fuzzy techniques are efficient hybrid techniques based on judicious integration of the principles of rough sets and fuzzy sets. While the membership functions of fuzzy sets enables efficient handling of overlapping classes, the concept of lower and upper approximations of rough sets deals with uncertainty, vagueness, and incompleteness in class definitions. Since the rough-fuzzy approach has the capability of providing a stronger paradigm for uncertainty handling, it has greater promise in application domains e.g., pattern recognition, image processing, dimensionality reduction, data mining and knowledge discovery, where fuzzy sets and rough sets are being effectively used. Its effectiveness in handling large data sets (both in size and dimension) is also evident because of its “fuzzy granulation” characteristics. Some of the challenges arising out of those posed by massive data and high dimensionality, nonstandard and incomplete data, knowledge discovery using linguistic rules and over-fitting problems can be dealt well using soft computing and rough-fuzzy approaches. The World Wide Web (WWW) and bioinformatics are the two major forefront research areas where recent data mining finds significant applications. A detailed review explaining the state of the art and the future directions for web mining research in soft computing framework is provided by Pal et al. [21]. One may note that web mining, although considered to be an application area of data mining on the stocktickerWWW, demands a separate discipline of research. The reason is that web mining has its own characteristic problems (e.g., page ranking, personalization), because of the typical nature of the data, components involved and tasks to be performed, which cannot be usually handled within the conventional framework of data mining and analysis. Moreover, being an interactive medium, human interface is a key component of most web applications. Bioinformatics can be viewed as a discipline of using computational methods to make biological discoveries [1]. It is an interdisciplinary field mainly involving biology, computer science, mathematics and statistics to analyze biological sequence data, genome content and arrangement, and to predict the function and structure of macromolecules. The ultimate goal is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be derived. With the need to handle large heterogeneous data sets in biology in a robust and computationally efficient manner, soft computing, which provides machinery for handling uncertainty, learning and adaptation with massive parallelism, and powerful search and imprecise reasoning, have recently gained the attention of researchers for their efficient mining.
108
S.K. Pal
The significance of some of the soft computing tools for bioinformatics research is reported in different surveys [22,35]. While talking about pattern recognition and decision-making in the 21st century, it will remain incomplete without the mention of the Computational Theory of Perceptions (CTP) explained by Zadeh [44,45], which is governed by perception-based computation. Since the boundaries of perceptions (e.g., perception of direction, time, speed, age) are not crisply defined and the attributes it can accept are granules, the concept of rough fuzzy computing seems to have a significant role in modeling the f-granulation (i.e., fuzzy-granules) characteristics of CTP. In the present article, we mention some of the results published elsewhere in the areas of rough-fuzzy approach and fuzzy granular computing with application to tasks like case generation, classification, clustering/segmentation in protein sequence and web data, and integrate them with the concept of fgranulation of CTP in a unified framework; thereby showing greater promise of its research. The organization of this paper is as follows. Section 2 introduces the basic notions of computational theory of perceptions and f-granulation, while Section 3 presents rough-fuzzy approach to granular computation, in general. Section 4 explains the application of rough-fuzzy granulation in case based reasoning where the problem of case generation is considered. Sections 5 and 6 demonstrate the concept of rough-fuzzy clustering and some of the quantative measures for evaluating the performance of clustering. The problem of segmenting brain MR images is considered as an example. Section 7 demonstrates an application of rough-fuzzy clustering for analyzing protein sequence for determining bio-bases. Section 8 deals with rough set theoretic ensemble classifier with application to web services. Concluding remarks are given in Section 9.
2
Computational Theory of Perceptions and F-Granulation
The computational theory of perceptions (CTP) [44,45] is inspired by the remarkable human capability to perform a wide variety of physical and mental tasks, that include recognition tasks without any measurements and any computations. Typical everyday examples of such tasks are parking a car, driving in city traffic, cooking meal, understanding speech, and recognizing similarities. This capability is due to the crucial ability of human brain to manipulate perceptions of time, distance, force, direction, shape, color, taste, number, intent, likelihood, and truth, among others. Recognition and perception are closely related. In a fundamental way, a recognition process may be viewed as a sequence of decisions. Decisions are based on information. In most realistic settings, decision-relevant information is a mixture of measurements and perceptions; e.g., the car is six year old but looks almost new. An essential difference between measurement and perception is that in general, measurements are crisp, while perceptions are fuzzy. In existing
Computational Theory Perception
109
theories, perceptions are converted into measurements, but such conversions in many cases, are infeasible, unrealistic or counterproductive. An alternative, suggested by the CTP, is to convert perceptions into propositions expressed in a natural language, e.g., it is a warm day, he is very honest, it is very unlikely that there will be a significant increase in the price of oil in the near future. Perceptions are intrinsically imprecise. More specifically, perceptions are fgranular, that is, both fuzzy and granular, with a granule being a clump of elements of a class that are drawn together by indistinguishability, similarity, proximity or functionality. For example, a perception of height can be described as very tall, tall, middle, short, with very tall, tall, and so on constituting the granules of the variable ‘height’. F-granularity of perceptions reflects the finite ability of sensory organs and, ultimately, the brain, to resolve detail and store information. In effect, f-granulation is a human way of achieving data compression. It may be mentioned here that although information granulation in which the granules are crisp, i.e., f-granular, plays key roles in both human and machine intelligence, it fails to reflect the fact that, in much, perhaps most, of human reasoning and concept formation the granules are fuzzy (f-granular) rather than crisp. In this respect, generality increases as the information ranges from singular (age: 22 yrs), c-granular (age: 20-30 yrs) to f-granular (age: “young”). It means CTP has, in principle, higher degree of generality than qualitative reasoning and qualitative process theory in AI [12,40]. The types of problems that fall under the scope of CTP typically include: perception based function modeling, perception based system modeling, perception based time series analysis, solution of perception based equations, and computation with perception based probabilities where perceptions are described as a collection of different linguistic if-then rules. F-granularity of perceptions puts them well beyond the meaning representation capabilities of predicate logic and other available meaning representation methods [44]. In CTP, meaning representation is based on the use of so called constraint-centered semantics, and reasoning with perceptions is carried out by goal-directed propagation of generalized constraints. In this way, the CTP adds to existing theories the capability to operate on and reason with perceptionbased information. This capability is already provided, to an extent, by fuzzy logic and, in particular, by the concept of a linguistic variable and the calculus of fuzzy if-then rules. The CTP extends this capability much further and in new directions. In application to pattern recognition and data mining, the CTP opens the door to a much wider and more systematic use of natural languages in the description of patterns, classes, perceptions and methods of recognition, organization, and knowledge discovery. Upgrading a search engine to a question- answering system is another prospective candidate in web mining for CTP application. However, one may note that dealing with perception-based information is more complex and more effort-intensive than dealing with measurement-based information, and this complexity is the price that has to be paid to achieve superiority.
110
3
S.K. Pal
Granular Computation and Rough-Fuzzy Approach
Rough set theory [33] provides an effective means for analysis of data by synthesizing or constructing approximations (upper and lower) of set concepts from the acquired data. The key notions here are those of “information granule” and “reducts”. Information granule formalizes the concept of finite precision representation of objects in real life situation, and reducts represent the core of an information system (both in terms of objects and features) in a granular universe. Granular computing (GrC) refers to that domain where computation and operations are performed on information granules (clump of similar objects or points). Therefore, it leads to have both data compression and gain in computation time, and finds wide applications [29]. An important use of rough set theory and granular computing in data mining has been in generating logical rules for classification and association [33]. These logical rules correspond to different important regions of a feature space, which represent data clusters roughly. For example, given the object region in Figure 1, rough set theory can, whether supervised or unsupervised, extract the rule F1M ∧ F2M (i.e., feature F1 is M AND feature F2 is M) to encode the object region. This rule, which represents the rectangle (shown by bold line), provides a crude description of the object or region.
F2 object region H
M
Rough set rule F1M F2M
L L
M
H
F1
Fig. 1. Rough set theoretic rules for an object
In many situations, when a problem involves incomplete, uncertain and vague information, it may be difficult to differentiate distinct elements and one is forced to consider granules. On the other hand, in some situations though detailed information is available, it may be sufficient to use granules in order to have an efficient and practical solution. Granulation is an important step in the human cognition process. From a more practical point of view, the simplicity derived from granular computing is useful for designing scalable data mining algorithms.
Computational Theory Perception
111
There are two aspects of granular computing: one deals with formation, representation and interpretation of granules (algorithmic aspect) while the other deals with utilization of granules for problem solving (semantic aspect). Several approaches for granular computing have been suggested in literature including fuzzy set theory, rough set theory, power algebras and interval analysis [43,47]. The rough set theoretic approach is based on the principles of set approximation and provides an attractive framework for knowledge encoding and discovery. For the past few years, rough set theory and granular computation has proven to be another soft computing tool which, in various synergistic combinations with fuzzy logic, artificial neural networks and genetic algorithms, provides a stronger framework to achieve tractability, robustness, low cost solution and close resembles with human like decision making [27,29,31,46]. For example, rough-fuzzy integration can be considered as a way of emulating the basis for f-granulation in CTP, where perceptions have fuzzy boundaries and granular attribute values. Similarly, rough neural synergistic integration helps in extracting crude domain knowledge in the form of rules for describing different concepts/classes, and then encoding them as network parameters; thereby constituting the initial knowledge-base network for efficient learning. Since, in granular computing, computations/operations are performed on granules (clump of similar objects or points) rather than on the individual data points, the computation time is greatly reduced. The results on these investigations, both theory and real life applications, are being available in different journals and conference proceedings [32,42]. Some special issues and edited volumes have also come out [23,24,25]. Rough-fuzzy computing is one of the hybridization techniques that has drawn the attention of researcher in recent times as they promise to provide a much more stronger paradigm for uncertainty handling than the individuals ones. Recently a generalized rough set is defined by Sen and Pal [39] for uncertainty handling and defining rough entropy based on the four criteria, namely, (i) set is crisp and granules are crisp, (ii) set is fuzzy and granules are crisp, (iii) set is crisp and granules are fuzzy, and (iv) set is fuzzy and granules are fuzzy. The f-granulation property of CTP can therefore be modeled using the rough-fuzzy computing framework with one or more of the aforesaid criteria. Two examples of rough fuzzy computing in case-based reasoning and clustering are explained in the following two sections together with their characteristic features. In the former case granules are fuzzy and the classes are crisp, while the cases are fuzzy and granules are crisp in the latter case. Application of roughfuzzy clustering in bioinformatics is mentioned in Section 7, as an example of amino acid sequence analysis for determining bio-bases.
4
Rough-Fuzzy Granulation and Case Based Reasoning
Case-based reasoning (CBR) [10], which is a novel Artificial Intelligence (AI) problem-solving paradigm, involves adaptation of old solutions to meet new
112
S.K. Pal
Fig. 2. Rough-fuzzy case generation for a two dimensional data
demands, explanation of new situations using old instances (called cases), and performance of reasoning from precedence to interpret new problems. It has a significant role to play in today’s pattern recognition and data mining applications involving CTP, particularly when the evidence is sparse. The significance of soft computing to CBR problems has been adequately explained in a recent book by Pal, Dillon and Yeung [26] and Pal and Shiu [27]. In this section we give an example [28,29] of using the concept of f-granulation, through rough-fuzzy computing, for performing an important task, namely, case generation, in large scale CBR systems. A case may be defined as a contextualized piece of knowledge representing evidence that teaches a lesson fundamental to achieving goals of a system. While case selection deals with selecting informative prototypes from the data, case generation concerns the construction of ‘cases’ that need not necessarily include any of the given data points. For generating cases, linguistic representation of patterns is used to obtain a fuzzy granulation of the feature space. Rough set theory is used to generate dependency rules corresponding to informative regions in the granulated feature space. The fuzzy membership functions corresponding to the informative regions are stored as cases. Figure 2 shows an example of such case generation for a two dimensional data having two classes. The granulated feature space has 32 = 9 granules. These granules of different sizes are characterized by three membership functions along each axis, and have ill-defined (overlapping) boundaries. Two dependency rules: class1 ← L1 ∧ H2 and class2 ← H1 ∧ L2 are obtained using rough set theory. The fuzzy membership functions, marked bold, corresponding to the attributes appearing in the rules for a class are stored as its case. Unlike the conventional case selection methods, the cases, illustrated in Figure 2, are cluster granules and not sample points. Also, since all the original features may not be required to express the dependency rules, each case involves a reduced number of relevant features. The methodology is therefore suitable
Computational Theory Perception
113
Fig. 3. Performance of different case generation schemes for the forest cover-type GIS data set with 7 classes, 10 features and 586012 samples
114
S.K. Pal
Fig. 4. Performance of different case generation schemes for the handwritten numeral recognition data set with 10 classes, 649 features and 2000 samples
Computational Theory Perception
115
for mining data sets, large both in dimension and size, due to its low time requirement in case generation as well as retrieval. The aforesaid characteristics are demonstrated in Figures 4 and 4 [28,29] for two real life data sets with features 10 and 649 and number of samples 586012 and 2000, respectively. Their superiority over IB3, IB4 [10] and random case selection algorithms, in terms of classification accuracy (with one nearest neighbor rule), case generation (tgen ) and retrieval (tret ) times, and average storage requirement (average feature) per case, are evident. The numbers of cases considered for comparison are 545 and 50, respectively. Recently, Li et al reported a CBR-based classification system combining efficient feature reduction and case selection based on the concept of rough sets [13].
5
Rough-Fuzzy Clustering
Incorporating both fuzzy and rough sets, a new clustering algorithm is described here. This method adds the concept of fuzzy membership of fuzzy sets, and lower and upper approximations of rough sets into a clustering algorithm that results in c number of clusters. While the membership of fuzzy sets enables efficient handling of overlapping partitions, rough sets deal with uncertainty, vagueness, and incompleteness in class definition [15]. In other words, fuzziness is involved here not in determining granules (unlike the case-based method of Section 4), but in handling uncertainty arising from overlapping regions. Here each cluster is represented by a centroid, a crisp lower approximation, and a fuzzy boundary. The lower approximation influences the fuzziness of a final partition. According to the definitions of lower approximations and boundary of rough sets, if an object belongs to lower approximations of a cluster, then the object does not belong to any other clusters. That is, the object is contained in that cluster definitely. Thus, the weights of the objects in lower approximation of a cluster should be independent of other centroids and clusters, and should not
Fig. 5. Rough-fuzzy c-means: each cluster is represented by crisp lower approximations and fuzzy boundary
116
S.K. Pal
Fig. 6. Comparison of DB and Dunn Index, and execution time of HCM, FCM [3], RCM [14], RFCMMBP [18], and RFCM for Iris Data
be coupled with their similarity with respect to other centroids. Also, the objects in lower approximation of a cluster should have similar influence on the corresponding centroids and cluster. Whereas, if the object belongs to the boundary of a cluster, then the object possibly belongs to that cluster and potentially
Computational Theory Perception
117
Fig. 7. Comparison of β index of HCM, FCM [3], RCM [14], RFCMMBP [18], and RFCM
Fig. 8. Some original and segmented images of HCM, FCM [3], RCM [14], RFCMMBP [18], and RFCM
belongs to another cluster. Hence, the objects in boundary regions should have different influences on the centroids and clusters. So, in the case of rough-fuzzy c-means algorithm (RFCM), the membership values of objects in lower approximation are 1, while those in boundary region are the same as fuzzy c-means. In other word, RFCM first partitions the data into two classes - lower approximation and boundary. Only the objects in boundary are fuzzified. The new centroid is calculated based on the weighting average of the crisp lower approximation and fuzzy boundary. Computation of the centroid is modified to include the effects of both fuzzy memberships and lower and upper bounds. In essence, Rough-Fuzzy clustering tends to compromise between restrictive (hard clustering) and descriptive (fuzzy clustering) partitions. The effectiveness of the algorithm is shown, as an example, for classification of Iris data set and segmentation of brain MR images where the centroid of a
118
S.K. Pal
Fig. 9. Scatter plots of two highest membership values of all the objects in the data set of image
cluster is meant for cluster mean, i.e., rough-fuzzy c-means (RFCM). The Iris data set is a four-dimensional data set containing 50 samples of each of three types of Iris flowers. One of the three clusters (class 1) is well separated from the other two, while classes 2 and 3 have some overlap. The performance of other different c-means algorithms is shown with respect to DB and Dunn index [3] in Fig. 4. The results reported establish the fact that RFCM provides best result having lowest DB index and highest Dunn index with lower execution time. For segmentation of brain MR images, 100 MR images with different sizes and 16 bit gray levels are tested. All the MR images are collected from Advanced Medicare and Research Institute (AMRI), Kolkata, India. The comparative performance of different c-means is shown in Fig. 7 with respect to β index [30]. β index is defined as the ratio of total variation to cluster variation of intensity in
Computational Theory Perception
119
an image. Therefore, for a given number of clusters in an image, higher β-value is desirable.] Some of the original images along with their segmented versions with different c-means are shown in Fig. 8. The results explain that the RFCM algorithm produces segmented images more promising than do the conventional methods, both visually and in terms of β index. Figure 9 shows the scatter plots of the highest and second highest membership of all the objects in the data set of image at first and final iterations respectively, considering w=0.95, (m ´ 1 = 2.0, ) and c = 4. The diagonal line represents the zone where two highest memberships of objects are equal. From Fig. 9, it is observed that though the average difference between two highest memberships of the objects are very low at first iteration δ = 0.145), they become ultimately very high at the final iteration δ = 0.652).
6
Quantitative Measures
In this section we describe some quantitative indices to evaluate the performance of rough-fuzzy clustering algorithm incorporating the concepts of rough sets [15]. The α index is defined in (1) where c 1 wAi α= (1) c i=1 wAi + wB ˜ i where Ai =
xj ∈A(βi )
m ´1
(μij )
= |A (βi )| ; and Bi =
m ´1
(μij )
(2)
xj ∈B(βi )
In (2) μij represents the probabilistic memberships of object xj in cluster βi . The parameters w and w correspond to the relative importance of the lower and boundary regions, respectively. The α index represents the average accuracy of c clusters. It is the average of the ratio of the number of objects in lower approximation to that in upper approximation of each cluster. In effect, it captures the average degree of completeness of knowledge about all clusters. A good clustering procedure should make all objects as similar to their centroids as possible. The α index increases with an increase in similarity within a cluster. Therefore, for a given data set and c value, the higher the similarity values within the clusters, the higher would be the α value. The value of α also increases with c. In an extreme case when the number of clusters is maximum, i.e., c = n, the total number of objects in the data set, the value of α = 1. When A (βi ) = A (βi ) , ∀i, that is, all the clusters {βi} are exact or definable, then we have α = 1 3. Whereas if A (βi ) = B (βi ) , ∀i, the value of α = 0. Thus, 0 ≤ α ≤ 1.
120
S.K. Pal
The ρ index represents the average roughness of c clusters and is defined in (3) by subtracting the average accuracy α from 1. c
ρ =1−α = 1−
1 wAi c i=1 wAi + wB ˜ i
(3)
where Ai and Bi are given by Equation 2. Note that the lower the value of ρ, the better is the overall clusters approximations. Also, 0 ≤ ρ ≤ 1. Basically, ρ index represents the average degree of incompleteness of knowledge about all clusters. The α* index is defined in (4) α∗ =
c c C ; where C = wAi ; and D = {wAi + wB ˜ i} D i=1 i=1
(4)
where Ai and Bi are given by Equation 2. The α* index represents the accuracy of approximation of all clusters. It captures the exactness of approximate clustering. A good clustering procedure should make the value of α* as high as possible. The α* index maximizes the exactness of approximate clustering. The τ index is the ratio of the total number of objects in lower approximations of all clusters to the cardinality of the universe of discourse U and is given in (5) τ=
R ; S
where R =
c
|A (βi )| ;
and S = |U | = n.
(5)
i=1
The τ index basically represents the quality of approximation of a clustering algorithm.
7
Rough Fuzzy C-Medoids and Amino Acid Sequence Analysis
In most pattern recognition algorithms, symbolic representation of amino acids cannot be used directly as input since they are non-numerical variables. They, therefore, need encoding prior to input. In this regard, a bio-basis function maps a non-numerical sequence space to a numerical feature space. It uses a kernel function to transform biological sequences to feature vectors directly. Bio-bases consist of sections of biological sequences that code for a feature of interest in the study and are responsible for the transformation of biological data to a highdimensional feature space. Transformation of input data to a high-dimensional feature space is performed based on the similarity of an input sequence to a bio-basis with reference to a biological similarity matrix. Thus, the biological content in the sequences can be maximally utilized for accurate modeling. The use of similarity matrices to map features allows the bio-basis function to analyze biological sequences without the need for encoding. One of the important issues for the bio-basis function is how to select the minimum set of bio-bases with maximum information. Here, we present an application of the rough-fuzzy clustering algorithms where the c centroids mean
Computational Theory Perception
121
for c medoids, i.e., we use rough-fuzzy c-medoids (RFCMdd) algorithm [16] to select the most informative bio-bases. The objective of the RFCMdd algorithm for selection of bio-bases is to assign all amino acid subsequences to different clusters. Each of the clusters is represented by a bio-basis, which is the medoid for that cluster. The process begins by randomly choosing a desired number of subsequences as the bio-bases. The subsequences are assigned to one of the clusters based on the maximum value of the similarity between the subsequence and the bio-basis. After the assignment of all the subsequences to various clusters, the new bio-bases are modified accordingly. The performance of RFCMdd algorithm for bio-basis selection is presented using five whole human immunodeficiency virus (HIV) protein sequences and Cai-Chou HIV data set, which can be downloaded from the National Center for Biotechnology Information [20]. The performance of different c-medoids algorithms such as hard c-medoids (HCMdd), fuzzy c-medoids (FCMdd) [11], rough c-medoids (RCMdd)[16], and rough-fuzzy c-medoids (RFCMdd) [16] is reported with respect to β index and γ index in [16]. Some of the results (shown in Fig. 10) establish the superiority of RFCMdd with lowest γ index and highest β index. Here β index signifies the average normalized homology alignment scores of input sub-sequences with respect to their corresponding medoids or bio-bases. That is, β, providing a measure of homology alignment score within a cluster, should be as high as possible. γ index, on the other hand, provides maximum normalized homology alignment score between all bio-bases, and therefore a low value is desirable. Note that homology alignment score between a pair of two sequences of amino acids measures the similarity between them in terms of probability of mutation of two amino acids, as computed from 20 × 20 Dayhoff mutation matrix [4].
8
Rough Ensemble Classifier for Web Services
In the previous sections we have explained the concept of knowledge encoding using rough sets, rough-fuzzy approach for modeling the concept of f-granulation of CTP, and have shown, as an example, how fuzzy granular computation provides a case generation method for decision making which is efficient in terms of classification performance, retrieval time and feature storage. Apart from granulation, the capability of rough sets (in terms of lower and upper approximations) for determining exactness in class definition for ambiguous regions is explained. Its merits for both hard and fuzzy clustering are also illustrated for brain image segmentation and determination of bio-bases from protein sequences. It is shown that incorporation of rough sets make rough-fuzzy clustering faster that fuzzy clustering. In the present section, we describe another application of rough sets as an ensemble classifier with characteristic features. Here the problem of web service classification is considered as an example to demonstrate its efficiency through various experimental results. (The concept may be extended further into roughfuzzy framework considering the classes and granular fuzzy either singly or together depending on the problem domain.)
122
S.K. Pal
Fig. 10. β and γ values corresponding to different c-medoids (i.e., c-biobases) for different data bases
Rough Ensemble Classifier In the problem of classification we train a learning algorithm and validate the trained algorithm. This task is performed, using some test-train split on a given labeled dataset. In the notion of rough set, let U be the given categorized k dataset and P = C1 , C2 , . . . , Ck where Ci ≤ ϕ for i = 1, 2, 3, . . . , k, i=1 C i = U and Ci ∩ Cj = ϕ for i = j and i, j = 1, 2, 3, . . . , k, be a partition on U which provides the given k categories of U. Output of a classifier determines a new partition on U. In rough set terminology each class of the given partition P is a given concept about dataset and output of a classifiers determines new concepts about the same dataset. The given concepts can be expressed approximately by upper and lower approximations constructed by generated concepts. The rough ensemble classifier is designed to extract decision rules from trained classifier ensembles that perform classification tasks [36]. The classification method (RSM) utilizes trained ensembles to generate a number of instances consisting of prediction of individual classifiers as conditional attribute values
Computational Theory Perception
123
and actual classes as decision attribute values. Then a decision table is constructed using all the instances with one instance in each row. Once the decision table is constructed, rough set attribute reduction is performed to determine core and minimal reducts. The classifiers corresponding to a minimal reduct are then taken to form classifier ensemble for RSM classification system. From the minimal reduct, the decision rules are computed by finding mapping between decision attribute and conditional attributes. These decision rules obtained by rough set technique are then used to perform classification tasks. The following theorems exist in this regard. Theorem 1. The rough set based combination of classifiers provides an optimal classifier combination technique[36]. Theorem 2. The performance of the rough set based ensemble classifier is at least same as that of every one of its constituent single classifiers [36]. Some of the experimental results are shown in Fig. 11 to evaluate the performance of RSM, especially in comparison to other methods for combining classifiers, such as bagging, boosting, voting and stacking. Five learning algorithms have been used in the base-level experiments: tree-learning algorithm C4.5, the rule-learning algorithm CN2, the k-nearest neighbor (k-NN) algorithm, support vector machine (SVM), and naive bayes method. Data sets used are Reuters [6], 20NG, webKB [7], Dmoz [8], and Looksmart [9]. On each of the five text corpus, RSM as shown in Fig. 11, is found to perform better [36] than the other three classification techniques, namely Adaboost, Bagging and Stacking.
Fig. 11. Accuracy comparison of rough set based ensemble classifier with other ensemble classifiers on large datasets
124
S.K. Pal
Web Service Classification The transition of the World Wide Web from a paradigm of static Web pages to one of dynamic Web services raises a new and challenging problem of locating desired web services. With the expected growth of the number of Web services available on the web, the need for mechanisms that enable the automatic categorization to organize this vast amount of data, becomes important. A major limitation of the Web services technology is that finding and composing services require manual effort. This becomes a serious burden with the increasing number of Web services. Describing and organizing this vast amount of resources is essential for realizing the web as an effective information resource. Web Service classification has become an important tool for helping discovery and integration process to organize this vast amount of data. For instance, for categorization in the UDDI (Universal Description Discovery and Integration) registry, one needs to divide the publicly available Web Services into a number of categories for the users to limit the search scope. Moreover, Web Services classification helps the developer to build integrated Web Services. Traditionally, Web Service classification is performed manually by domain experts. However, human classification is unlikely to keep pace with the rate of growth of the number of Web Services. Hence, as the web continues to increase, the importance of automatic Web Service classification becomes necessary. We treat the determination of a web services category as a tag-based text classification problem, where the text comes from different tags of the WSDL file and from UDDI text. Unlike standard texts, WSDL (Web Services Description Language) descriptions are highly structured. We therefore provide a Tensor space model (TSM) [37] for data representation and rough set based approach [36] for the classification of Web services. A WSDL page can be better represented as a Tensor (i.e., set of vectors corresponding to different vector spaces representing the different tags of WSDL pages). In other words, tag based TSM for web services consists of a two dimensional tensor where one dimension represents the tags of WSDL and the other represents the terms extracted from WSDL. (Unline matrix, the number of items corresponding to tags are different.) As compared to vector space model (VSM), TSM has less complexity and captures the structural representation of WSDL better [37]. Therefore, the tensor space model captures the information from internal structure of WSDL documents along with the corresponding text content. Rough sets are used here to combine information of the individual tensor components for providing classification results. Two-step improvements on the existing classification results of web services have been shown here. In the first step we achieve better classification results over existing, by using proposed tensor space model. In the second step further improvement of the results has been obtained by using Rough set based ensemble classifier [36]. The experimental results demonstrate that splitting the feature set based on structure improves the performance of a learning classifier. By combining different classifiers it is possible to improve the performance even further.
Computational Theory Perception
125
We gathered corpuses of web services from SALCentral and webservicelist, two categorized web service indexes. The actual taxonomy exists in the web service indexes consists of more classes organized in a tree structure, but in order to simplify the task for our experiment we used only the classes from the taxonomy that were direct descendants of the root. We then discarded categories with less than ten instances, remaining categories have been used in our experiments. The discarded web services tended to be quite obscure, such as a search tool for a music teacher in an area specified by ZIP code. Details of the corpuses have been given below. Salcentral dataset: Business-22, Communication-44, Converter-43, Country Info-62, Developers-34, Finder-44, Games-42, Mathematics-10, Money-54, News-30, Web-39. Webservicelist dataset: Access & Security-27, Address / Locations-57, Business & Finance-97, Developer Tools-54, Content & Databases-24, Politics & Government-56, Online Validations-26, Stock Quotes-31, Search & Finders-22, Sales Automation-20, Retail Services-30. Two-step improvement on the existing classification results of web services has been shown. In the first step we achieve better classification results by using the proposed tensor space model. In the second step further improvement of the results has been obtained by using the Rough set based classifier. In Table 1 percentage accuracies of classifications have been compared on two different representation models. WSDL documents corresponding to above datasets have been represented in vector space model and tag based tensor space model, respectively [37]. Three well known classifiers, namely, naive bayes (NB), namely, support vector machine (SVM) and decision tree (C4.5) have been considered to provide classification results in two different models [37]. Note that classification results in tensor space model have been computed on individual tensor components and combining them with majority voting. The results show that classification on tensor space model provides better percentage accuracy than vector space model for both the datasets and for all classifiers considered. In Fig. 12, bar chart of percentage accuracies of combining classifiers have been given. Here all the classifiers have been tested on meta data generated by the base level classifiers from each tensor component. Naive bayes (NB), support vector machine (SVM), decision tree (DT), majority vote and rough set
Table 1. VSM vs. TSM Data salcentral salcentral webservicelist webservicelist
Model VSM TSM VSM TSM
NB 47.06 53.45 62.21 69.48
SVM 55.50 58.78 65.57 73.63
C4.5 49.62 56.31 64.19 70.90
126
S.K. Pal
Fig. 12. Comparison of percentage accuracies of classifiers on salcentral and webservicelist datasets
(RS) are applied to combine the output of base level classifiers corresponding to individual tensor components. Results show that, rough set provides the better classification results than other methods on both datasets considered.
9
Conclusions
The concepts of knowledge encoding using rough sets, judicious integration of the merits of rough and fuzzy sets, rough-fuzzy granulation, and their relevance to computational theory of perception (CTP) are explained. Significance of granular computing and the formulation of rough ensemble classifier are illustrated. Ways of modeling f-granulation property of CTP are discussed. Three examples of judicious integration, viz., rough-fuzzy case generation, rough-fuzzy c-means and rough-fuzzy c-medoids are explained along with their merits and some quantitative indices. These rough-fuzzy methodologies can be viewed under the generalized rough set theoretic framework with four cases, namely, (i) crisp set – crisp granules, (ii) crisp set – fuzzy granules, (iii) fuzzy set – crisp granules, and (iv) fuzzy set – fuzzy granules. For example, the ensemble classifier concerns with case (i) whereas case-based reasoning and clustering methods belong to cases (ii) and (iii) respectively.
Computational Theory Perception
127
Significance of rough-fuzzy clustering in protein sequence analysis for determining bio-bases and segmentation of brain MR images is explained. Effectiveness of rough ensemble classifier, which provides an optimum combination, is demonstrated in web service classification. Concept of fuzzy granulation through rough-fuzzy computing, and performing operations on fuzzy granules provide both information compression and gain in computation time; thereby making it suitable for data mining applications. Other application of fuzzy information measures in fuzzy approximation space for feature selection problem is recently reported in [17]. The concept of rough ensemble classifier can be extended into rough-fuzzy framework with appropriate criterion depending on the application domain. Rough-fuzzy granular computing (GrC), coupled with computational theory of perception (CTP), have great promise for human like decision making and efficient mining of large, heterogeneous data, and providing solution of various real-life ambiguous recognition problems. Apart from the problem of defining granules with their sizes appropriately, future challenges in GrC and CTP, include formulating efficient methodologies based on fuzzy granular computing and granularfuzzy computing for making the aforesaid tasks of decision making more natural and efficient. While the former concerns with computation using fuzzy granules, the latter deals with fuzzy computing using granules. Another important application of fuzzy GrC would be granular information retrieval from heterogeneous media like WWW.
Acknowledgement The author acknowledges J.C. Bose Fellowship of the Govt. of India, as well as his co-investigators Dr. C.A. Murthy, Dr. P. Mitra, Dr. P. Maji, Mr. S. Saha, and Mr. D. Sen.
References 1. Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge (1998) 2. Banerjee, M., Mitra, S., Pal, S.K.: Rough Fuzzy MLP: Knowledge Encoding and Classification. IEEE Trans. Neural Networks 9, 1203–1216 (1998) 3. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithm. Plenum, New York (1981) 4. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A Model of Evolutionary Change in Proteins. Matrices for Detecting Distant Relationships, Atlas of Protein Sequence and Structure 5, 345–358 (1978) 5. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2005) 6. http://www.daviddlewis.com/resources/testcollections/reuters21578/ 7. http://www.cs.cmu.edu/~ WebKB/ 8. http://www.dmoz.org/ 9. http://www.looksmart.com
128
S.K. Pal
10. Kolodner, J.L.: Case-Based Reasoning. Morgan Kaufmann, San Mateo (1993) 11. Krishnapuram, R., Joshi, A., Nasraoui, O., Yi, L.: Low complexity fuzzy relational clustering algorithms for web mining. IEEE Trans. on Fuzzy System 9, 595–607 (2001) 12. Kuipers, B.J.: Qualitative Reasoning. MIT Press, Cambridge (1984) 13. Li, Y., Shiu, S.C.K., Pal, S.K.: Combining feature reduction and case selection in building CBR classifiers. IEEE Trans. on Knowledge and Data Engineering 18, 415–429 (2006) 14. Lingras, P., West, C.: Interval set clustering of web users with rough K-means. Journal of Intelligent Information Systems 23, 5–16 (2004) 15. Maji, P., Pal, S.K.: Rough set based generalized fuzzy C-means algorithm and quantitative indices. IEEE Trans. on System, Man and Cybernetics, Part B, 37, 1529–1540 (2007) 16. Maji, P., Pal, S.K.: Rough-fuzzy C-medoids algorithm and selection of bio-basis for amino acid sequence analysis. IEEE Trans. Knowledge and Data Engineering 19, 859–872 (2007) 17. Maji, P., Pal, S.K.: Feature Selection Using f-Information Measures in Fuzzy Approximation Spaces. IEEE Trans. Knowledge and Data Engineering (to appear) 18. Mitra, S., Banka, H., Pedrycz, W.: Rough-fuzzy collaborative clustering. IEEE Trans. on Systems, Man, and Cybernetics - Part B: Cybernetics 36, 795–805 (2006) 19. Mitra, S., De, R.K., Pal, S.K.: Knowledge Based Fuzzy MLP for Classification and Rule Generation. IEEE Trans. Neural Networks 8, 1338–1350 (1997) 20. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov 21. Pal, S.K., Talwar, V., Mitra, P.: Web mining in soft computing framework: Relevance, state of the art and future directions. IEEE Trans. Neural Networks 13, 1163–1177 (2002) 22. Pal, S.K., Bandyopadhyay, S., Ray, S.S.: Evolutionary Computation in Bioinformatics: A Review. IEEE Transactions on Systems, Man, and Cybernetics, PartC 36, 601–615 (2006) 23. Pal, S.K., Skowron, A. (eds.): Rough-Fuzzy Hybridization: A New Trend in Decision Making. Springer, Singapore (1999) 24. Pal, S.K., Polkowski, L., Skowron, A. (eds.): Rough-neuro Computing: A Way to Computing with Words. Springer, Berlin (2003) 25. Pal, S.K., Skowron, A. (eds.): Special issue on Rough Sets, Pattern Recognition and Data Mining. Pattern Recognition Letters 24 (2003) 26. Pal, S.K., Dillon, T.S., Yeung, D.S. (eds.): Soft Computing in Case Based Reasoning. Springer, London (2001) 27. Pal, S.K., Shiu, S.C.K.: Foundations of Soft Case Based Reasoning. John Wiley, NY (2003) 28. Pal, S.K., Mitra, P.: Case generation using rough sets with fuzzy discretization. IEEE Trans. Knowledge and Data Engineering 16, 292–300 (2004) 29. Pal, S.K., Mitra, P.: Pattern Recognition Algorithms for Data Mining. Chapman & Hall CRC Press, Boca Raton (2004) 30. Pal, S.K., Ghosh, A., Sankar, B.U.: Segmentation of remotely sensed images with fuzzy thresholding, and quantitative evaluation. International Journal of Remote Sensing 2, 2269–2300 (2000) 31. Pal, S.K., Polkowski, L., Skowron, A. (eds.): Rough-Neural Computing Techniques for Computing with Words. Springer, Berlin (2004) 32. Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds.): PReMI 2005. LNCS, vol. 3776. Springer, Heidelberg (2005)
Computational Theory Perception
129
33. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic, Dordrecht (1991) 34. Pedrycz, W., Skowron, A., Kreinovich, V. (eds.): Handbook of Granular Computing. John Wiley, N.Y. (2008) 35. Ray, S.S., Bandyopadhyay, S., Mitra, P., Pal, S.K.: Bioinformatics in Neurocomputing Framework. IEE Proc. Circuits Devices & Systems 152, 556–564 (2005) 36. Saha, S., Murthy, C.A., Pal, S.K.: Rough set Based Ensemble Classifier for Web Page Classification. Fundamentae Informetica 76, 171–187 (2007) 37. Saha, S., Murthy, C.A., Pal, S.K.: Classification of Web Services using Tensor Space Model and Rough Ensemble Classifier. In: Proc. 17th International Symposium on Methodologies for Intelligent Systems, Toronto, Canada, pp. 508–513 (2008) 38. Sen, D., Pal, S.K.: Histogram Thresholding using Fuzzy and Rough Measures of Association Error. IEEE Trans. Image Processing 18, 879–888 (2009) 39. Sen, D., Pal, S.K.: Generalized Rough Sets, Entropy and Image Ambiguity Measures. IEEE Trans. Syst, Man and Cyberns. Part B 39, 117–128 (2009) 40. Sun, R.: Integrating Rules and Connectionism for Robust Commonsense Reasoning. Wiley, N.Y. (1994) 41. Swiniarski, R.W., Skowron, A.: Rough Set Methods in Feature Selection and Recognition. Pattern Recognition Letters 24, 833–849 (2003) 42. Wang, G., Li, T., Grzymala-Busse, J.W., Miao, D., Skowron, A., Yao, Y. (eds.): RSKT 2008. LNCS (LNAI), vol. 5009. Springer, Heidelberg (2008) 43. Yao, Y.Y.: Granular Computing: Basic Issues and Possible Solutions. In: Proceedings of the 5th Joint Conference on Information Sciences, vol. I, pp. 186–189 (2000) 44. Zadeh, L.A.: A new direction in AI: Toward a computational theory of perceptions. AI Magazine 22, 73–84 (2001) 45. Zadeh, L.A., Pal, S.K., Mitra, S.: Foreword. In: Neuro-Fuzzy Pattern Recognition: Methods in Soft Computing. Wiley, New York (1999) 46. Zadeh, L.A.: Fuzzy Logic, Neural Networks, and Soft Computing. ACM 37, 77–84 (1994) 47. Zadeh, L.A.: Towards a Theory of Fuzzy Information Granulation and Its Centrality in Human Reasoning and Fuzzy Logic. Fuzzy Sets Systems 19, 111–127 (1997)
Decision Rule-Based Data Models Using TRS and NetTRS – Methods and Algorithms Marek Sikora Silesian University of Technology, Institute of Computer Sciences, 44-100 Gliwice, Poland
[email protected]
Abstract. The internet service NetTRS (Network TRS) that enable to realize induction, evaluation, and postprocessing of decision rules is presented in the paper. The TRS (Tolerance Rough Sets) library is the main part of the service. The TRS library makes possible to induct, generalize and filtrate decision rules. Moreover, TRS enables to evaluate rules and conduct the classification process. The NetTRS service is a package of the library in user interface and makes it accessible in the Internet. NetTRS put principal emphasis on induction and postprocessing of decision rules, the paper describes methods and algorithms that are available in the service.
1
Introduction
Algorithms of rules induction based on examples are the part of the wider group of algorithms that realize the learning by examples paradigm [25]. Generally, the problem of machine learning is solved not only using algorithms that learn by examples but also algorithms that facilitate the learning process by analogy, deduction, abduction or explanations as well [25]. Recently, application of machine learning methods in practice has increased dramatically, where rules induction algorithms are intensively used in the field of knowledge discovery in databases [23]. At present, research on rules induction algorithms focus on subject concerned possibilities of big data sets analysis and induction of rules with more rich representation language. Of course, works on improving induction algorithms quality are conducted simultaneously. The purpose of machine learning algorithms as well as rules induction is to describe a given set of examples in a synthetic way, more precisely, building concepts description (a data model) for examples we have at our disposal. Rules induction algorithms build descriptions in the form of a rules set. In relation to knowledge discovery in databases, we require from determined rules to represent nontrivial and important dependencies that are present in data. The fact that knowledge represented by single rules can be easy interpreted is a typical feature of rules induction algorithms. However, to understand knowledge discovered by an algorithm entirely, interpretation of a group of rules describing the same J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 130–160, 2010. c Springer-Verlag Berlin Heidelberg 2010
Decision Rule-Based Data Models Using TRS and NetTRS
131
concept is necessary. From this point of view, the requirement from the rules induction algorithm is to determine small rules sets. The second significant feature that determines the quality of rules induction algorithms is generalization abilities of determined descriptions. We want the decision algorithm [33] developed based on inducted rules to assign new unknown examples to proper concepts as precisely as possible. To recapitulate, a desired feature of rules induction algorithms is generating sets with small numbers of rules (good description power) and good generalization abilities. Rough sets theory is one of methodologies, among others, that makes it possible to obtain rules-based descriptions of data [33]. On grounds of that theory and its numerous generalizations as well (tolerance based rough sets [48], variable precision rough sets [59], dominance based rough sets [14]), many algorithms of data discretization, attributes reduction, rules induction etc. were proposed. Many programs exploiting rough sets theory in data analysis arose (LERS [18,19], RSES [6], Rose [38], 4emka2 [13], Rosetta [32], ARES [36]). It seems that RSES, Rose2 and Rosetta the most developed programs as regards functionality, and RSES offers the most friendly environment for experiments conducting (somewhat similar to SAS Enterprise Miner environment). Each of programs mentioned above offers standard algorithms collection (reducts, minimal rules, LEM algorithm) and own unique algorithms set (among others: dynamic reducts [5] - RSES, Rosetta; tables decomposition [28], constructive induction [56] - RSES; quality based filtering [2] - Rosetta; rules induction by using similarity relation - Rose2). The Tolerance Rough Sets (TRS) library was created as a tool for researching selected methods of obtaining a rules-based data model by the tolerance rough sets model [48,54]. For that reason, methods of finding tolerance thresholds, reducts determining and verifying attributes significance were implemented in the library. Afterwards, algorithms that make possible rules evaluation and postprocessing (joining, filtration) were added to the library. The MODLEM algorithm [51,52] that enables to induce rules without conditional attributes discretization necessity were also implemented in it. Modifications of existing methods and unique propositions of methods included in the library were published, among others, in [39,42,43,45]. From the functionality point of view the library distinguishes from other solutions by possibility of rules evaluation by means of various rules quality measures. The measures can be used for rules induction, generalization and filtration, and also are used in classification process. Modified forms of rules quality measures are applied for searching tolerance thresholds. Along with adding new algorithms to the TRS library it became environment for postprocessing and rules evaluation mostly. The TRS library is a tool devoid of graphical interface. Experiments are conducted in batch mode by control scripts transferring into executable part of the library. Recently, WWW service (http://nettrs.polsl.pl/nettrs) that makes available graphical interface to automatic scripts generation for the TRS library was created.
132
2
M. Sikora
TRS Library Functionality
The functionality of the TRS library is formally presented below. Algorithms and methods which can be recognized as commonly known are not described here. The example set based on which we induce rules is included in a decision table. The decision table is the set of objects characterized by a features vector, the last feature is called a decision attribute. Values of the decision attribute assign each example included in the table to a concept that the given example represents. 2.1
Tolerance Thresholds
Let DT = (U, A∪{d}) be a decision table, where U is a set of objects, A is a set of conditional attributes, and d is a decision attribute. A domain of an attribute a is denoted by Da . For each attribute a ∈ A there is determined a distance function δa : Da × Da → [0, ∞ of the following properties ∀ x, y ∈ U δa (a(x), a(x)) = 0 and δa (a(x), a(y)) = δa (a(y), a(x)), where a(x) means the value of attribute a ∈ A for x ∈ U. The TRS library has two distance functions implemented known as diff and vdm [54,31]. For each attribute a ∈ A, the value εa is called the tolerance threshold. A similarity relation determined by the set of attributes B = {a1 , a2 , . . . , ak } ⊆ A, τB (εa1 , εa2 , . . . , εak ) is defined in the following way: ∀ x, y ∈ U < x, y >∈ τB (εa1 , . . . , εak ) ⇔ ∀ ai ∈ B [δai (ai (x), ai (y)) ≤ εai ] (1) The set of objects similar to the object x ∈ U regarding the attributes set B ⊆ A determines uncertainty function value IB : U → 2U which is defined in the following way [54,31]: ∀ y ∈ U y ∈ IB (x) ⇔< x, y >∈ τB (εa1 , εa2 , . . . , εak )
(2)
In other words we can say that IB (x) set is a tolerance set of the element x. When tolerance thresholds are determined it is possible to determine a set of decision rules [54,31] that create descriptions of decision classes. As a decision class we mean the set Xv = {x ∈ U : d(x) = v}, where v ∈ Dd . Accuracy and generality (coverage) of obtained rules is dependent on the distance measure δa and tolerance thresholds values εa . One can choose the form of the δa function arbitrarily, however selection of the εa thresholds is not obvious. If there are decision table DT = (U, A ∪ {d})) and distance function δa for each a ∈ A defined, the new decision table DT’ = (U , A ∪ {d}) can be created, and: U = {< x, y >: (< xi , yj >∈ U × U ) ∧ (i ≤ j)}, A = {(a : U → R+ ): a (< x, y >) = δa (a(x), a(y))}, 0 for d(x) = d(y) d (< x, y >) = 1 otherwise.
Decision Rule-Based Data Models Using TRS and NetTRS
133
It can be easily noticed, that sorting objects from U ascending according to values of any a ∈ A , one receives every possible value for which the power of set Ia (x), x ∈ U varies. These values are all reasonable εa values that should be considered during tolerance thresholds searching. Choosing εa for each a ∈ A one receives a tolerance thresholds vector. Because the set of all possible vectors is very large 12 (|Da |2 − |Da |) + 1 [54], to estblish desired tolerance thresholds a∈A
values, heuristic [29,31] or evolutionary [54,55] strategies are used. In the TRS library a evolutionary strategy was applied. In the paper [55] Stepaniuk presented a tolerance threshold vector searching method. A genetic algorithm was chosen as a solution set searching strategy. A single specimen was represented as a tolerance thresholds vector. For every a ∈ A, reasonable εa values were numerated, obtaining a sequence of integers, out of which every single integer corresponded to one of the possible values of tolerance threshold. A single specimen was then a vector of integers. Such representation of the specimen led to a situation in which a given, limited to the population size and the mutation probability, set of all possible tolerance threshold vectors was searched through. In other words, mutation was the only factor capable of introducing new tolerance threshold values, not previously drawn to the starting population, into the population. In our library searching for good tolerance thresholds vectors, we assumed a binary representation of a single specimen. We used so-called block positional notation with standardization [16]. Every attribute was assigned a certain number of bits in a binary vector representing the specimen that was necessary to encode all numbers of tolerance thresholds acceptable for that attribute. Such representation of a specimen makes possible to consider every possible thresholds vector. In each case user can choose one from various criteria of thresholds optimality, applying standard criterion given by Stepaniuk [54] or criteria adapted from rules quality measures. For a given decision table DT , standard tolerance thresholds vector quality measure is defined in the following way: wγ(d) + (1 − w)vSRI (Rd , RIA ),
(3)
where γ(d) = |P OS(d)| , P OS(d) is is a positive region [33] of the decision table |U| DT , Rd = {< x, y >∈ U × U: d(x) = d(y)}, RIA = {< x, y >∈ U × U: y ∈ IA (x)}, vSRI (X, Y ) is a standard rough inclusion [48]. We expect from tolerance thresholds vector that most of all objects from the same decision class (Rd ) will be admitted as similar (RIA ). The above mentioned function makes possible to find such tolerance thresholds that as many as possible objects of the same decision stay in a mutual relation, concurrently limiting to minimum cases in which the relation concerns objects of different decisions.
134
M. Sikora
Since the (3) criterion not always allow to obtain high accuracy of classification and small decision rules set, other optimality criteria adapted from rules quality measures were also implemented in the TRS library. The rules quality measures allow us to evaluate a given rule, taking into account its accuracy and coverage [9,61]. The accuracy of a rule usually decreases when its coverage increases and vice-versa, the similar dependence appears also among components of the (3) measure. The properly adapted rules quality measures may be used to evaluate the tolerance thresholds vector [40]. It is possible to determine the following matrix for DT’ table and any tolerance thresholds vector ε = (εa1 , εa2 , . . . , εan ) nRd RIA n¬Rd RIA nRIA
nRd ¬RIA n¬Rd ¬RIA n¬RIA
nRd n¬Rd
where: nRd = nRd RIA + nRd ¬RIA is the number of object pairs with the same value of the decision attribute; n¬Rd = n¬Rd RIA + n¬Rd ¬RIA is the number of object pairs with different values of the decision attribute; nRIA = nRd RIA + n¬Rd RIA is the number of object pairs staying in the relation τA ; n¬RIA = nRd ¬RIA + n¬Rd ¬RIA is the number of object pairs not staying in the relation τA ; nRd RIA is the number of object pairs with the same value of the decision attribute, staying in the relation τA . The n¬Rd RIA , nRd ¬RIA , n¬Rd ¬RIA values we define analogously to nRd RIA . Two rules quality evaluation measures (WS, J-measure [53]) which were adapted to evaluation tolerance thresholds vector are presented below. nRd RIA nRd RIA ∗w+ ∗ (1 − w), w ∈ (0, 1; nRd nRIA 1 nR R |U | nR ¬R |U | q J−measure (ε) = nRd RIA ln d IA + nRd ¬RIA ln d IA . |U | nRd nRIA nRd n¬RIA q W S (ε) =
Modified versions of the following measures: Brazdil, Gain, Cohen, Coleman, IKIB, Chi-2 are also available in the TRS library. Analytic formulas of the mentioned measures can be found, among others, in [2,3,9,40,61]. Algorithm of searching the vector of identical tolerance thresholds value for each attribute is the other algorithm implemented in TRS library. The searched vector has the form (ε, . . . , ε). As the initial (minimal) value of ε we take 0 and the final (maximal) value is ε = 1. Additionally, the parameter k is defined to increase ε on each iteration (usually k = 0.1). For each vector, from (0, . . . , 0) to (1, . . . , 1), the tolerance thresholds vector quality measure is calculated (it
Decision Rule-Based Data Models Using TRS and NetTRS
135
is possible to use identical evaluation measures as for genetic algorithm). The vector with highest evaluation is admitted as the optimal one. 2.2
Decision Rules
TRS library generates decision rules in the following form (4): IF a1 ∈ Va1 and . . . and ak ∈ Vak THEN d = vd ,
(4)
where: {a1 , . . . , ak } ⊆ A, d is the decision attribute, Vai ⊆ Dai , vd ∈ Dd . Expression a ∈ V is called a conditional descriptor. For fixed distance functions and tolerance thresholds it is possible to make the set of all minimal decision rules [33,47]. The process of making minimal decision rules equals finding object-related relative reducts [47]. Finding object-related relative reducts is one of the methods of rules determining in the TRS library, in this case, decision rules are generated on the basis of discernibility matrix [47]. There are three different algorithm variants: all rules (from each objects all relative reducts are generated and from each reduct one minimal decision rule is obtained), one rule (from each object only one relative reduct is generated (the quasi-shortest one) that is used for the rule creation, the reduct is determined by the modified Johnson algorithm [27]); from each object given by the user rules number is generated, in this case the randomized algorithm is used [27]. Rules determined from object-related relative reducts have enough good classification abilities but they have (at least many of them) poor descriptive features. It happens because minimal decision rules obtained from object-related relative reducts express local regularities, only few of them reflect general (global) regularities in data. Because of this reason, among others, certain heuristic attitudes [17,18,19,28,30,57,58,60] or further generalizations of rough sets [50] are used in practice. The heuristic algorithm of rules induction RMatrix that exploits both information included in discernibility matrix [47] and rule quality measure values was implemented in the TRS library. Definition 1. Generalized discernibility matrix modulo d Let DT = (U, A ∪ {d}) be a decision table, where U = {x1 , x2 , . . . , xn }. The generalized discernibility matrix modulo d for the table DT we called the square matrix Md (DT) = {ci,j : 1 ≤ i, j ≤ n} which elements defined as follows: cij = {a ∈ A: (< xi , xj >∈ / τa (εa )) ∧ (d(xi ) = d(xj )}, cij = ∅
if
d(xi ) = d(xj ).
Each object xi ∈ U matches i-th row (i-th column) in discernibility matrix. The attribute most frequently appearing in the i-th row lets discern objects with decisions other than the xi object’s decision in the best way.
136
M. Sikora
RMatrix algorithm input: DT = (U, A ∪ {d}), the tolerance thresholds vector (εa1 , εa2 , ..., εam ), q – quality evaluation measure, x – object, rule generator, an order of conditional attributes (ai1 , ai2 , ..., aim ) so as the attribute the most frequently appearing in cx is the first (attribute appearing the most rarely is the last) begin create the rule r, which has the decision descriptor d = d(x) only; rbest := r; for every j := 1, ..., m add the descriptor aij ∈ Vaij to conditional part of the rule r (where Vaij = {aij (y) ∈ Daij : y ∈ Iaij (x)}) if q(r) > q(rbest ) then rbest := r return rbest RMatrix algorithm generates one rule from each object x ∈ U . Adding succeeding descriptor makes the rule more accurate and simultaneously limits rule coverage. Rule quality measure ensures the output rule not to be too fitting to training data. Using the algorithm it is possible to define one rule for each object or define only those rules which will be sufficient to cover the training set. Considering the number of the rules obtained it is good to get all the rules and then, starting with the best rules, construct a coverage of each decision class. Computational complexity of determining one rule by RMatrix algorithm is equal to O(m2 n), where m = |A|, n = |U |. The detailed description of this algorithm can be found in [40,42]. Both RMatrix algorithm and methods of determining rules from reducts require information about tolerance thresholds values. The MODLEM algorithm was also implemented in the TRS library. The MODLEM algorithm was proposed by Stefanowski [51,52] as the generalization of the LEM2 algorithm [18]. The author’s intention was to make strong decision rules from numeric data without their prior discretization. There is a short description of the MODLEM algorithm working below. For each conditional attribute and each value of the attribute appearing in the currently examined set of objects Ub (at first it is the whole training set U ) successive values of conditional attributes are tested previously sorted in not decreasing order) in search of the limit point ga . The limit point is in the middle, between two successive values of the attribute a (for example va < ga < wa ), and divides the current range of the attribute a values into two parts (values bigger than ga and smaller than ga ). Such a division also establishes the division of the current set of training objects Ub into two subsets U1 and U2 . Of course, only those values ga , which lie between two values of the attribute a, characterizing objects from different decision classes are taken into consideration as the limit points. The optimum is the limit point, which minimizes the value of the conditional |U2 | 1| entropy |U |Ub | Entr(U1 ) + |Ub | Entr(U2 ), where Entr(Ui ) means entropy of the set Ui .
Decision Rule-Based Data Models Using TRS and NetTRS
137
As the conditional descriptor we choose the interval (out of two) for which in adequate sets U1 , U2 there are more objects from the decision class pointed by the rule. The created descriptor limits the set of objects examined in the later steps of algorithm. The descriptor is added to those created earlier and together with them (in the form of the conjunction of conditions) makes the conditional part of the rule. As it is easily noticeable, if for an attribute a two limit points will be chosen in the successive steps of the algorithm, we can obtain the descriptor in the form [va1 , va2 ]. If there will be one such limit point, the descriptor will have the nature of inequality, and if a given attribute will not generate any limit point then the attribute will not appear in the conditional part of the rule. The process of creating the conditional part of the rule finishes when the set of objects Ub is included in the lower (or upper) approximation of the decision class being described. In other words, the process of creating the rule finishes when it is accurate or accurate as much as it is possible in the training set. The algorithm creates coverage of a given decision class and hence after generating the rule all objects supporting the created rule are deleted from the training set, and the algorithm is used for the remaining objects from the class being described. The stop criterion which demands from the objects recognized by the conditional part of the rule to be included in the proper approximation of a given decision class creates two unfavorable situations. Firstly, the process of making rules is longer than in the case of making a decision tree. Secondly, the number of the rules obtained, though smaller than in the case of induction methods based on rough sets methodology or the LEM2 algorithm, is still relatively big. Moreover, rules tend to match to data and hence we deal with the phenomenon of overlearning. The simplest way of dealing with the problem is arbitrary establishment of the minimal accuracy of the rule and after reaching this accuracy the algorithm finishes working. However, as numerous research studies show (among others [2,3,9,40,45,52]) it is better to use rules quality measures, which try to simultaneously estimate two most important rule features – accuracy and rule coverage. We apply rules evaluation measures in the implemented in the TRS version of the MODLEM algorithm. We do not interfere in the algorithm, only after adding (or modifying) successive conditional descriptor we evaluate the current form of the rule, remembering as the output rule the one with the best evaluation (not necessarily the one which fulfills the stop conditions). As it is shown in the research results [41,45], such attitude makes possible to significantly reduce the number of rules being generated. Rules generated in such a way are characterized by smaller accuracy (about 70-100%) but are definitely more general, such rules have also good classifying features. During experiments conducting we observed that the quality of the rule being made increases until it reaches a certain maximal value and then decreases (the quality function has one maximum). This observation allowed to modify stop criterion in the MODLEM algorithm so as to stop process of rule conditional
138
M. Sikora
part creation at the moment while a value of used in the algorithm quality measure becomes decrease. The presented modification allows to limit the number of the rules being generated and makes the algorithm work faster. 2.3
Rules Generalization and Filtration
Independently of the method of rules generating (either all minimal rules are obtained or the heuristic algorithms are used) the set of generated rules still can be large, which decreases its description abilities. The TRS library owns some algorithms implemented in, which are responsible for the generated rules set postprocessing. The main target of postprocessing is to limit the number of decision rules (in other words: to increase their description abilities) but with keeping their good classification quality simultaneously. The TRS library realizes postprocessing in two ways: rules generalization (rules shortening and rules joining) and rules filtration (rules that are not needed in view of certain criterion, are removed from the final rules set). Rules shortening Shortening a decision rule consists in removing some conditional descriptors from the conditional part of the rule [5]. Every unshortened decision rule has an assigned quality measure value. The shortening process longs until the quality of the shortened rule decreases below the defined threshold. The rules shortening algorithm was applied, among others, in the paper [4], where increasing classification accuracy of minimal decision rules induction algorithms was the main purpose of the shortening. Authors propose to create a classifiers hierarchy, in which a classifier composed of minimal rules is placed on the lowest level and classifiers composed of shortened rules (by 5%, 10% etc with relation to their original accuracy) are located on higher levels. The classification process runs from the lowest (the accurate rules) to the highest level. The presented proposition enables to increase classification accuracy, which was probably the main authors intention. Authors did not consider description power of obtained rules set, that undoubtedly significantly worsen in this case, for the sake of increase of rules number taking part in the classification (theoretically one can consider some method of filtration of determined rules). In the TRS library the standard not hierarchical shortening algorithm was implemented, quality threshold is defined for each decision class separately. The order of descriptors removing is set in accordance with a hill climbing strategy. Computational complexity of the algorithm that shortens a rule composed of l conditional descriptors is O(l2 mn, ), where m = |A|, n = |U |. Rules joining Rules joining consists in obtaining one more general rule from two (or more) less general ones [24,26,35,40,43]. The joining algorithm implemented in the TRS library bases on following assumptions: only rules from the same decision class can be joined, two rules can
Decision Rule-Based Data Models Using TRS and NetTRS
139
be joined if their conditional parts are built from the same conditional attributes or if the conditional attributes set of one rule is a subset of the conditional attributes set of the second one. Rules joining process consists in joining sets of values of corresponding conditional descriptors. If the conditional descriptor (a, Va1 ) occurs in the conditional part of the φ1 → ψ rule and the descriptor (a, Va2 ) occurs in the conditional part of the φ2 → ψ rule, then – as the result of joining process – the final conditional descriptor (a, Va ) has the following properties: Va1 ⊆ Va and Va2 ⊆ Va . After joining the descriptors the rules representation language does not change, that is depending on the type of the conditional attribute a the set Va of values of the descriptor created after joining is defined in the following way: – if the attribute is of the symbolic type and joining concerns descriptors (a, Va1 ) and (a, Va2 ), then Va = Va1 ∪ Va2 , – if is of the numeric type and joining concerns descriptors the1 attribute a, [va , va2 ] and (a, [va3 , va4 ]), then Va = [f vmin , vmax ], where vmax = max{vai : i = 2, 4}, vmin = min{vai : i = 1, 3}. Therefore, the situation in which after joining the conditional descriptor will be in the form a, [va1 , va2 ] ∨ [va3 , va4 ] for va2 < va3 is impossible. In other words, appearing of so-called internal alternative in descriptors is not possible. Other assumptions concern the control over the rules joining process. They can be presented as follows: – From two rules r1 and r2 with the properties qr1 > qr2 the rule r1 is chosen as the "base" on the basis of which a new joined rule will be created. The notation qr1 means the value of the rule r1 quality measure; – Conditional descriptors are joined sequentially, the order of descriptors joining depends on the value of the measure which evaluates a newly created rule r (qr ). In order to point the descriptor which is the best for joining the climbing strategy is used. – The process of joining is finished when the new rule r recognizes all positive training examples recognized by rules r1 and r2 ; – If r is the joined rule and qr ≥ λ, then in the place of rules r1 and r2 the rule r is inserted into the decision class description; otherwise rules r1 and r2 cannot be joined. The parameter λ defines the minimal rules quality value. In particular, before beginning the joining process one can create a table which contains "initial" values of the rules quality measure and afterwards use the table as the table of threshold values for rules joining; then for all joined rules r1 and r2 , λ = max{qr1 , qr2 }. Computational complexity of the algorithms that joins two rules r1 and r2 (with the property: qr1 > qr2 , number of descriptors in the rule r1 is equal to l) equals O(l2 mn), where m = |A|, n = |n|. The detailed description of this algorithm can be found in [43]. Another joining algorithm was proposed by Mikołajczyk [26]. In the algorithm rules from each decision class are grouped with respect to similarity of their
140
M. Sikora
structure. Due to computational complexity of the algorithm, iterative approach was proposed in [24] (rules are joined in pairs, beginning from the most similar rules). In contrast to the joining algorithm presented in the paper, rules built around a certain number of different attributes (i.e. occurring in one rule and not occurring in another one) can be joined. Ranges of descriptors values in a joined rule are the set sum of joined rules ranges. The algorithm does not verify quality of joined rules. Initially [26] the algorithm was described for rules with descriptors in the form attribute=value, then [24] occurrence of values set in a descriptor was admitted. In each approach value can be a symbolical value or a code of some interval of discretization (for numerical attributes). Thus, the algorithm admits introducing so-called internal alternatives into description of a joined rule in the case of numerical data, because in joined rule description, codes of two intervals that do not lie side by side can be found. The joining algorithm proposed by Pindur, Susmaga and Stefanowski [35] operates on rules determined by dominance based rough sets model. Rules that lie close to each other are joined if their joint cause no deterioration of accuracy of obtained rule. In resultant joined rule, new descriptor which is linear combination of already existing descriptors appears. In this way a rule can describe a solid different from hypercube in the features space. Therefore, it is possible to describe the space occupied by examples from a given decision class by means of less rules number. Rules filtration Let us assume that there is accessible a set of rules determined from a decision table DT , such a set we will be denote RU LDT. An object x ∈ U recognizes the rule of the form (4) if and only if ∀i ∈ {1, . . . , k} ai (x) ∈ Vai . If the rule is recognized by the object x and d(x) = vd , then the object x support the rule of the form (4). The set of objects from the table DT which recognize the rule r will be marked as matchDT (r). Definition 2. Description of a decision class. IF DT = (U, A ∪ {d}) is any decision table, v ∈ Dd and Xv = {x ∈ U : d(x) = v} is the decision class, then each set of rules RU LvDT ⊂ RU LDT satisfying the following conditions: v 1. If r ∈ RU φ → (d, v), LDT then r has the form 2. ∀x ∈ U d(x) = v ⇒ ∃ r ∈ RU LvDT (x ∈ matchDT (r)) is called the description of the decision class Xv .
Each of presented above rules induction algorithms create so-called descriptions of decision classes which are compatible with Definition 2. There are usually distinguished: minimal, satisfying, and complete descriptions. Rejection any rule from minimal description of the decision class Xv , causes the situation in which the second condition of Definition 2 is no more satisfied. This means that after removing any rule from the minimal description, there exists at least one object x ∈ Xv that is recognized by none of rules remaining in the description.
Decision Rule-Based Data Models Using TRS and NetTRS
141
The complete description is a rules set in which all decision rules that can be determined from a given decision table are included. Sometimes the definition of the complete description is constrained to a set of all minimal decision rules, so to rules determined from object-related relative reducts. So far, the satisfying description was defined as an intermediate description between minimal and complete description. The satisfying description can be obtained by determining some number of rules from the training table (for example, a subset of the minimal decision rules set) or by assumption that a subset of the rules set that create another description (for example, the complete description) is the description of the decision class. In particular, the RMatrix and MODLEM algorithms enable to define satisfying descriptions. Opinions about quality of individual types of descriptions differ. From objects classification point of view, the complete description can seem to be the best description of test objects. Big number of rules preserves great possibilities of recognizing test objects. However, for complete descriptions we often meet with overlearning effect which, in the case of uncertainty of data, leads to worse classification results of the description. From the knowledge discovery in databases point of view, the complete description is usually useless, since big number of rules is, in practice, impossible to interpretation. Besides, in the complete description there is certainly a big number of redundant rules and rules excessively matched to training data. Satisfying and minimal descriptions have the most wide application in data exploration tasks. But constructing a minimal description or quasi-minimal one we employ some heuristic solutions which can cause that a part of dependences in data interesting for a user will not be note. Filtration of the rules set can be a certain solution of the problem mentioned above. Having any rules set which is too big from the point of view of its interpretability, we can remove from the set these rules which are redundant for the sake of some criterion [2,39,40]. The criterion is usually a quality measure that make allowances for accuracy and coverage of a rule [3,9], and also its length, strength [60] and so on. There are two kinds of algorithms implemented in the TRS library: considering rules quality only and considering rules quality and classification accuracy of filtered rules set. First approach is represented by the Minimal quality and From coverage algorithms. The Minimal quality algorithm removes from decision classes descriptions these rules which quality is worst then a fixed for a given decision class rule acceptation threshold. The algorithm gives no guarantee that the result rules set will fulfill conditions of Definition 2. Determination a rules qualities ranking is the first step of the From coverage algorithm, then building of the training set coverage starts from the best rule. The following rules are added according to their position in the ranking. When the rules set covers all training examples, all remaining rules are rejected.
142
M. Sikora
The second approach is represented by two algorithms: Forward and Backwards. Both of them, beside the rules qualities ranking, take into consideration the result of all rules classification accuracy. To guarantee the independence of the filtration result the separate tuning set of examples is applied. In the case of the Forward algorithm, each decision class initial description contains only one decision rule – the best one. Then, to each decision class description single rules are added. If the decision class accuracy increases, the added rule remains in this description, otherwise next decision rule is considered. The process of adding rules to the decision class description stops when the obtained rules set has the same classification accuracy as the initial, or when all rules have been already considered. The Backwards algorithm is based on the opposite conception. The weakest rules form each decision class description are removed. Keeping the difference between the most accurate and the least accurate decision class guarantees that the filtered rules set keeps the same sensitivity as the initial. The Forward and Backwards algorithms like the Minimal quality algorithm give no guarantee that the filtered rules set will fulfill Definition 2. Both filtration algorithms mentioned above give to rules that are more specific (in the light of training set), but contribute to raise classification accuracy of tuning set a chance of finding in the filtered set. To estimate computational complexities of the presented algorithms we assume the following denotations: L is a number of rules that are subject to filtration, m = |A| is a number of conditional attributes, n = |U | is a number of objects in a training table. It is readily noticeable that computational complexity of Minimal quality algorithm is equal to O(Lmn). The algorithm From Coverage requires, like in the previous case, determination of quality of all rules (O(Lnm)), and then their sorting (O(L log L)) and checking, which examples from the training set support rules in succession considered. In the extreme case, such verification should be accomplished L times (O(Ln)). To recapitulate, computational complexity of the whole algorithm is of the O(Lnm) order. Complexity of Forward and Backwards algorithms depends on fixed classification method. If we assume that a whole training set is a tuning set and we use a classification algorithm that will be presented in the next section then the complexity analysis runs as follows. To prepare the classification, determining values of the rules quality measure for each rule (O(Lnm)) and their sorting (O(L log L)) is necessary. After adding (removing) a new rule to (from) the decision class description, the classification process is realized. It is connected with necessity of looking over which rules support each training examples (O(Lmn)). In the extreme case we add (remove) rules L − 1 times, just so many times classification should be also carry out. Hence, computational complexity of Forward and Backwards filtration algorithms is of O(L2 nm) order. All mentioned filtration algorithms are described in [40]. In literature, beside quality based filtering approach, genetic algorithm to limit rules set [1] methodology is met. A population consists of specimens that
Decision Rule-Based Data Models Using TRS and NetTRS
143
are rules classifiers. Each specimen consists of a rules set that is a subset of input rules set. A specimen fitness function is classification accuracy obtained on tuning set. It is also possible to apply a function which is weighting sum of classification accuracy and a number of rules that create the classifier. In [1,2] a quality based filtration algorithm is also described. The algorithm applies also rules ranking established by selected quality measure, but filtration do not take place for each decision class separately. 2.4
Classification
Classification is considered as a process of assigning objects to corresponding decision classes. The TRS library uses a "voting mechanism" to perform objects classification process. Each decision rule has an assigned kind of confidence grade (simply: this is a value of the rule quality measure). The TRS library classification process consists in summing up confidence grades of all rules from each decision class that recognize a test object (5). Test object is assigned to the decision class that has the highest value of mentioned sum. Sometimes it happens, that object is not recognized by any rule from given decision classes descriptions. In this case, it is possible to calculate a distance between the object and the rule, and admit that rules close enough to the object recognizes it.
conf (Xv , u) = (1 − dist(r, u))q(r). (5) r∈RULXv (DT ), dist(r,u)≤ ε
In the formula (3) dist(r, u) is a distance between the test object u and the rule r (Euclidean or Hamming), ε is a maximal acceptable distance between the object and the rule (especially, when ε = 0 classification takes place only by the rules that accurately recognizes the test object), q(r) is voice strength of the rule. 2.5
Rules Quality Measures
All of the algorithms mentioned above exploit measures that decided either about form of determined rule or about which of already determined rules may be removed or generalized. These measures are called the rules quality measures and their main goal is such steering of induction and/or reduction processes that in output rules set there are rules with the best quality. Values of most frequently applied rules quality measures [9,20,45,61] can be determined based on analysis of a contingency table that allows to describe rules behavior with relation to the training set. The contingency table for the rule r ≡ (ϕ → ψ) is defined in the following way: nϕψ n¬ϕψ nψ
nϕ¬ψ n¬ϕ¬ψ n¬ψ
nϕ n¬ϕ
144
M. Sikora
where: nϕ = nϕψ + nϕ¬ψ = |Uϕ | is the number of objects that recognize the rule ϕ → ψ; n¬ϕ = n¬ϕψ +n¬ϕ¬ψ = |U¬ϕ | is the number of objects that not recognize the rule ϕ → ψ; nψ = nϕψ + n¬ϕψ = |Uψ | is the number of objects that belong to the decision class described by the rule ϕ → ψ; n¬ψ = nϕ¬ψ + n¬ϕ¬ψ = |U¬ψ | is the number of objects that do not belong to the decision class described by the rule ϕ → ψ; nϕψ = |Uϕ ∩ Uψ | is the number of objects that support the rule ϕ → ψ; nϕ¬ψ = |Uϕ ∩ U¬ψ |; n¬ϕψ = |U¬ϕ ∩ Uψ |; n¬ϕ¬ψ = |U¬ϕ ∩ U¬ψ |. Using information included in the contingency table and the fact that for the known decision rule ϕ → ψ, there are known the values |Uψ | and |U¬ψ |, it is possible to determine values of each measure based on nϕψ and nϕ¬ψ . It can be also observed that for any rule ϕ → ψ , the inequalities 1 ≤ nϕψ ≤ |Uψ |, 0 ≤ nϕ¬ψ ≤ |U¬ψ | holds. Hence, the quality measure is the function of two variables q(ϕ → ψ): {1, . . . , |Uψ |} × {0, . . . , |U¬ψ |} → R. n Two basic quality measures are accuracy (designated by q acc (ϕ → ψ) = nϕψ ) ϕ n and coverage (designated by q cov (ϕ → ψ) = nϕψ ) of a rule. According to the ψ enumerational induction principle it is known that rules with big accuracy and coverage reflect real dependences. The dependences are true also for objects from outside of the analyzed dataset. It is easy to prove that along with accuracy increasing, rule coverage decreases. Therefore, attempts at defining quality measures that simultaneously respect accuracy and coverage of a rule are carried on. Empirical research on generalization abilities of obtained classifiers depending on rules quality measure used during rules induction were realized [3]. The influence of applied quality measure on the number of discovered rules was also considered [45,52]. It has special weight in context of knowledge discovery, since a user is usually interested in discovering such model that can be interpreted or is intent on finding several rules that describe the most important dependences. In quoted research some quality measures reached good results both in classification accuracy and size of classifiers (number of rules). These measures are: WS proposed by Michalski, C2 proposed by Bruha [9], the IREP measure [12] and the adequately adopted Gain measure used in decision trees induction [37]. The W S measure respects rule accuracy as well as rule coverage: q W S (ϕ → ψ) = q acc (ϕ → ψ)w1 + q cov (ϕ → ψ)w2 ,
w1 , w2 ∈ [0, 1].
(6)
In the rules induction system YAILS values of parameters w1 , w2 for the rule ϕ → ψ are calculated as follows w1,2 = 0.5 ± 0.25 ∗ q acc (ϕ → ψ). The measure is monotone with respect to each variable nϕψ and nϕ¬ψ , and takes values from the interval [0, 1]. The measure C2 is described by the formula: acc nq (ϕ → ψ) 1 + q cov (ϕ → ψ) C2 q (ϕ → ψ) = . (7) n − nψ 2 The first component of the product in the formula (7) is the separate measure known as the Coleman measure. This measure evaluates dependences between occurrences "the objects u recognizes the rule", and "the objects u belongs to
Decision Rule-Based Data Models Using TRS and NetTRS
145
the decision class described by the rule". The modification proposed by Bruha [9] (the second component of the formula (7)) respects the fact that the Coleman measure put too little emphasis on rule coverage. Therefore, application of the Coleman measure in the induction process leads to obtaining a big number of rules [40]. The measure C2 is monotone with respect to variables nϕψ and nϕ¬ψ , its range is the interval (−∞, 1], for a fixed rule the measure takes minimum if nϕψ = 1 and nϕ¬ψ = n¬ψ . The IREP program uses for rules evaluation a measure of the following form: q IREP (ϕ → ψ) =
nϕψ + n¬ψ − nϕ¬ψ . n
(8)
The measure value depends on both accuracy and coverage of evaluated rule. If the rule is accurate, then the rule coverage is evaluated. If the rule is approximate, the number nϕ¬ψ prevents from getting the same evaluation by two rules with the same coverage and different accuracy. The function is monotone with respect to variables nϕψ and nϕ¬ψ and takes values from the interval [0,1]. The Gain measure has origin in the information theory. The measure was adopted to rules evaluation from the decision trees methods (so-called LimitedGain criterion): q Gain (ϕ → ψ) = Inf o(U ) − Inf oϕ→ψ (U ).
(9)
In the formula (8) Inf o(U ) is the entropy of training examples and Inf oϕ→ψ (U ) nϕ |U|−n = |U| Inf o(ϕ → ψ)+ |U| ϕ Inf o(¬(ϕ → ψ)), where Inf o(ϕ → ψ) is the entropy of examples covered by the rule ϕ → ψ; Inf o(¬(ϕ → ψ)) is the entropy of examples not covered by the rule ϕ → ψ. The measure is not monotone with respect to variables nϕψ and nϕ¬ψ , and takes values from the interval [0, 1]. If the accuracy of a rule is less then the accuracy of the decision class (accuracy that results form examples distribution in the training set) described by the rule then the measure q Gain is the function decreasing with respect to both variables nϕψ and nϕ¬ψ , otherwise q Gain is the increasing function. Except the presented measures, in the TRS library there were implemented the following quality measures: Brazdil [8]; J-measure [53]; IKIB [21]; Cohen, Coleman and Chi-2 [9]. Recently, so-called the Bayesian confirmation measure (denoted by f ) was proposed as the alternative for rule accuracy evaluation. In papers [10,15] theoretical analysis of the measure f is presented, and it is shown, among others, that the measure is monotone with respect to rule accuracy (so, in terminology adopted in our paper, with respect to the variable nϕψ ). In standard notation the Bayesian confirmation measure is defined by the (ϕ|ψ)−P (ϕ|¬ψ) formula q f (ϕ → ψ) = P P (ϕ|ψ)+P (ϕ|¬ψ) , where P (ϕ|ψ) denotes the conditional probability of the fact that objects belonging to the set U and having the property ψ have also the property ϕ. It is easy to see that q f can be also write as follows: q f (ϕ → ψ) =
n¬ψ nϕψ − nψ nϕ¬ψ . n¬ψ nϕψ + nψ nϕ¬ψ
(10)
146
M. Sikora
It can be noticed that the measure q f does not take into consideration the coverage of the evaluated rule, the most clearly it can be observed for two rules with identical accuracy and different coverage. If rules r1 and r2 are accurate then for both of them the equality nϕ¬ψ = 0 holds. Hence, the formula that allow to calculate a value of the measure q f reduces to 1 (constant function). Then, independently on the number of objects that support rules r1 and r2 , the value of the measure q f will be equal to one for both the rules. Interest in the measure q f is justified by the fact that beside rule accuracy it takes into consideration probability distribution of examples in training set between decision classes. Because of the above argumentation, in the selected quality measures implemented in the TRS library replacing the accuracy with the measure q f can be proposed for rules quality measures that use the rule accuracy. In particular, such modifications were introduced in the WS and C2 measures [45]. A number of conditional descriptors occurring in the rule is the important rule property from the point of view of dependences, which the given rule represents, interpretation possibilities. Of course, the fewer descriptors the easier to understand the dependence which is presented by the given rule. Let us assume that for the decision rule r, the set descr(r) contains all attributes that create conditional descriptors in this rule. A formula evaluating the rule with respect to a number of conditional descriptors q length : RU L → [0, 1) was defined in the TRS library in the following way (10): q length (r) = 1 −
|descr(r)| . |A|
(11)
The bigger value of the q length the simples rule, that is it has fewer conditional descriptors. The rule evaluation is possible taking into consideration both its quality and length: q rule_quality_measure (r)q length (r). In particular, the formula q accuracy (r)q length (r) allows to evaluate accuracy and complexity of the rule ϕ → ψ, simultaneously.
3
Selected Results Obtained by Algorithms Included in the TRS Library
To present efficiency of methods included in the TRS library, selected results obtained on various benchmark data sets are presented below. The results were gained by use of the 10-fold cross validation testing methodology (except the Monks and Satimage data sets). The results were rounded off natural numbers. Since all applied data sets are generally known, we do not give their characteristic. 3.1
Searching Tolerance Thresholds
Results of genetic algorithm and various quality measures application for tolerance thresholds values determining are presented in the first table. For the
Decision Rule-Based Data Models Using TRS and NetTRS
147
standard measure two values of the weight w (w=0.5, w=0.75) are considered, the best of results is presented in the table. The Coleman, IKIB, J-measure, Cohen, Gain, Chi-2, IREP and WS measures were considered. Beside results for the standard measure, the best result for quote measures is also presented in the first table. After establishing tolerance thresholds values, rules were determined from object-related quasi-shortest relative reduct. Classification took place by exactly matching rules, hence the first table contains, except the classification accuracy (Accuracy) and the number of rules generated (Rules), also the degree of recognition test objects (Match). Unrecognized objects were recognized as wrongly classified. Table 1. Searching tolerance thresholds Dataset
Qality measure Std Australian Chi-2 Std Breast J-measure Std Iris J-measure Std Iris + discretization J-measure Std Glass J-measure Std Heart J-measure Std Lymhography Coleman Std Monk I Coleman Std Monk II Gain Std Monk III J-measure Std Pima J-measure Std Satimage Gain
Accuracy (%) 60 86 67 72 95 95 97 97 27 61 65 78 72 80 100 97 73 79 95 97 51 75 76 75
Rules 564 429 143 59 59 41 9 9 178 185 193 182 57 66 10 17 54 64 35 12 597 675 437 717
Match 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 0.43 0.90 0.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.73 1.00 0.95 1.00
Obtained results are consistent with the results presented in earlier papers [39,40]. The standard measure can make possible to get good classification accuracy, but proper establishing a value of weight occurring in the measure is necessary. With reference to classification quality, adapted rules quality measures, especially J-measure, manage better. The J-measure leads to determining tolerance thresholds with value higher than for the standard measure. For that reason, rules obtained are somewhat less accurate but more general. A similar
148
M. Sikora
situation was observed in the case of measures Gain, Chi-2 and Cohen. In some cases other measures turned out to be better. In particular, the Coleman measure which put greater emphasis on consistence in a decision table (like measures IKIB and Gain). It clearly shows on the example of the Monk I set, which the measure makes possible to get maximal classification accuracy. As it is known the set is easy classifiable and the classification can be done by accurate rules. Probably, searching more values of the weight w for the standard measure, the same result would be obtained. To recapitulate, adapted quality measures enable to evaluate tolerance thresholds vectors better than the standard one. Better accuracy and less rules number are obtained. Adapted quality measures application leads to approximate rules induction. However, in the case of noisy data approximate rules reflect dependencies in the set better than accurate ones and better classify unknown objects. An observation that for sets with bigger number of decision classes (Glass, Lymphography, Satimage) a method of searching global tolerance thresholds vector does not enable to achieve satisfactory results, is also important. Better results could be gained by looking for tolerance vector for each decision class separately (such approach was postulated i.a. in [31]). The TRS library does not have such functionality at present. 3.2
Rules Induction
Efficiency of the RMatrix algorithm can be compared with a method of generating all minimal decision rules and quasi-shortest minimal decision rules, since an idea of RMatrix joins the method of quasi-shortest object-related minimal rule generating with loosening conditions that concern accuracy of a determined rule. The comparison was done for the same tolerance thresholds values, determined by heuristic algorithm or by genetic algorithm (for smaller data sets). Results of the comparison are presented in the second table. During rules induction the following rules quality measures were used: Accuracy, IKIB, Coleman, Brazdil, Cohen, IREP, Chi-2, WSY (WS measure from YAILS system), C2, Gain, C2F (C2 measure with Bayesian confirmation measure F instead of Accuracy). In the case of the RMatrix algorithm a measure that enable to get the best results is presented. RMatrix makes possible to get similar (better in some cases) classification accuracies as the quasi-shortest minimal decision rules induction algorithm. Essential is the fact that a number of generated rules is smaller (in some cases even smaller a lot), the rules are more general and less accurate which is sometimes a desirable property. Information about classification accuracy and a number of rules get by the MODLEM algorithm and its version using quality measures in stop criterion is presented in the third table. Loosening of requirements concerning accuracy of a rule created by MODLEM and stopping the rule generation process while its quality begin to decrease made possible to get better results than in the standard version of MODLEM. Moreover, rules number is smaller. Depending on an analyzed data set the best results are obtained by applying various quality measures.
Decision Rule-Based Data Models Using TRS and NetTRS
149
Table 2. Results of the RMatrix algorithm Dataset
Algorithm RMatrix (Gain) Australian Shortest dec. rules All minimal rules RMatrix (Gain) Breast Shortest dec. rules All minimal rules RMatrix (Gain) Iris + discretization Shortest dec. rules All minimal rules RMatrix (Chi-2) Heart Shortest dec. rules All minimal rules RMatrix (Gain) Lymphography Shortest dec. rules All minimal rules RMatrix (C2) Monk I Shortest dec. rules All minimal rules RMatrix (IREP) Monk II Shortest dec. rules All minimal rules RMatrix (IREP) Monk III Shortest dec. rules All minimal rules
Accuracy (%) 86 86 86 73 70 71 97 97 97 80 80 80 82 81 84 100 100 93 75 79 74 97 97 97
Rules 32 111 144 31 50 57 7 9 17 24 111 111 46 62 2085 10 10 59 39 64 97 9 12 12
Table 3. Results of the MODLEM algorithm MODLEM Accuracy (%) Australian 77 Breast 68 Iris 92 Heart 65 Lymphography 74 Monk I 92 Monk II 65 Monk III 92
Dataset
3.3
Rules 274 101 18 90 39 29 71 29
MODLEM Modyf. Accuracy (%) 86 (Brazdil) 73 (Brazdil) 92 (C2F) 80 (Brazdil) 81 (IREP) 100 (IREP) 66 (IREP) 97 (Gain)
Rules 29 64 9 20 15 9 18 13
Rules Joining
The joining algorithm working was verified in two ways. In the first case rules were determined from the quasi-shortest object-related relative reduct, but values of tolerance thresholds had been found earlier by a genetic algorithm. Results are presented in the fourth table. In the second case (fifth table) rules obtained
150
M. Sikora Table 4. Results of using the decision rules joining algorithm
Dataset (algorithm parameters) Australian (IREP, 0%) Breast (IREP, 10%) Iris (IREP, 0%) Heart (IREP, 0%) Lymphography (IREP, 0%) Monk I (C2, 0%) Monk II (Acc., 0%) Monk III
Accuracy (%) Rules 86 73 95 79 82 100 80 97
130 25 19 56 43 8 49 9
Reduction rate (%) 62 25 53 69 9 27 22 0
Accuracy changes (%) 0 0 0 +1 0 0 +1 0
Table 5. Results of joining of decision rules obtained by the MODLEM algorithm Dataset (algorithm parameters) Australian (IREP, 10%) Breast (Brazdil, 30%) Iris (IREP, 0%) Heart Lymphography (IREP, 0%) Monk I Monk II (Brazdil, 10%) Monk III (Gain, 0%)
Accuracy (%) Rules 85 73 94 80 80 92 67 97
24 55 8 20 13 9 17 6
Reduction rate (%) 17 14 11 0 13 0 10 53
Accuracy changes (%) −1 0 +1 0 −1 0 +1 0
by MODLEM exploiting quality measures were put to joining. Various rules quality measures were applied and maximally thirty percent decrease of a rule quality was admitted during the joining. As the fourth table shows, the algorithm makes possible to limit a rules set by 38% on average (median 27%) without causing considerable increases or decreases of classification accuracy. The method of rules generation and data preparing (tolerance thresholds establishing) cause that approximate rules will be put to joining, therefore further decrease of their quality is pointless. In the paper [43] experiments with accurate rules were also carried out. It was proved there that in the case of accurate rules it is worth admitting decrease of rules quality is equal to at most 30. Measures IREP and C2 enable to get the best results in most cases. Results presented in fifth table suggest that in the case of the modified version of MODLEM, joining algorithm application leads to insignificant reduction of rules number. A general (and obvious) conclusion can be also drawn that compression degree is the bigger the more numerous is an input rules set.
Decision Rule-Based Data Models Using TRS and NetTRS
151
Table 6. Results of rules filtration algorithms Dataset Australian (R-Gain, F-Gain) Breast (R-Gain, F-Gain) Iris (R-Gain, F-Gain) Heart (R-Chi-2, F-Chi-2) Lymphography (R-Gain, F-IREP) Monk I (R-C2, F-C2) Monk II (Q-Accuracy, F-Accuracy) Monk III (R-IREP, F-IREP)
3.4
Algorithm
Accuracy (%) Rules
From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards
86 86 86 24 73 68 97 97 97 80 82 81 79 82 82 100 100 100 78 79 75 97 97 97
18 2 2 12 9 8 7 5 4 12 15 9 14 16 13 10 10 10 42 36 32 6 3 3
Reduction rate (%) 37 93 93 52 64 68 0 28 42 50 37 62 65 62 69 0 0 0 14 26 35 14 57 57
Accuracy changes (%) 0 0 0 −49 0 −5 0 0 0 0 +2 +1 −3 0 0 0 0 0 −2 −1 −5 0 0 0
Rules Filtration
In order to illustrate filtration algorithms working, rules which enable to get the highest classification accuracy in so far research were selected. There was verified whether filtration algorithms will make possible to limit rules number on. The method of rules induction and name of the rules quality measure used in the algorithm are presented near the name of data set in the sixth table. The following denotations were used: the RMAtrix algorithm - R, the MODLEM algorithm M, Quasi-shortest minimal decision rules algorithm - Q, rules filtration - F. The same measure as in the filtration algorithm or one of basic measures (accuracy, coverage) was applied during classification. The whole training set was used as the tuning set in the Forward and Backwards algorithms. Results obtained by filtration algorithms show that, like in the joining algorithms case, compression degree is the higher the bigger is an input rules set. In the case of filtration algorithms there ensues significant limitation of the rules set, and the algorithm From coverage often leads to classification accuracy decrease. Considering classification accuracy decrease, behavior of the Forward algorithm that enables to restrict meaningfully the rules set without classification abilities loosing, is the best.
152
3.5
M. Sikora
TRS Results – Comparison with Other Methods and a Real-Life Data Analysis Example
Results obtained by methods included in TRS were compared with several decision trees and decision rules induction algorithms. Adequate algorithms which, like TRS, use neither methods of constructive induction nor soft computing solutions, were selected. Mentioned algorithms were: CART [7], C5 [37], SSV [11], LEM2 [19] and the RIONA algorithm that joins rules induction with the k-NN method [17,58]. All of quoted algorithms (including algorithms contained in TRS) try to create as synthetic descriptions of data in a form of decision trees of rules as possible. RIONA which joins two approaches is the exception here. For some k value set by an adaptative method, k nearest neighbors of a fixed test object tst are found in the training set. Next, for each trn example included in the selected training examples set, so-called maximal local decision rule that covers trn and tst is determined. If the rule is consistent in the set of selected training examples, then the example trn is added to the support set of the appropriate decision. Finally, the RIONA algorithm selects the decision with the support set of the highest cardinality. All presented results are results of experiments that were carried out. In order to do the experiments the following software was applied: RSES (RIONA, LEM2), CART-Salfor-Systems (CART), See5 (C5) and Ghostminer (SSV). Optimal values of parameters were matched on the training set, after establishing the optimal parameters on the training set efficiency of algorithms was tested. In the case of presented algorithms the parameters were: – TRS - the algorithm and tolerance thresholds quality measure; rules quality measure in algorithms of: rules induction, rules joining, rules filtration, classification. If during rules induction an quality measure was selected, then the same measure or one of two basic measures (accuracy or coverage) was used during further postprocessing and classification stages. The order of postprocessing was always the same and consisted in applying joining algorithm and then filtration; – RIONA - the optimal number of neighbors; – CART - the criterion of partition optimality in a node (gini, entropy, class probability); – SSV - the searching strategy (best first or beam search - cross-validation on the whole training set); – LEM2 - rules shortening ratio; – See5 - the tree pruning degree. Apart from 10-fold cross-validation classification accuracy (in the case of TRS the standard deviation was also given), classification accuracy and rules number which were get on whole available data set (in the case of sets Monks and Satimage there were training sets) were also given. Analysis of obtained results enables to draw a conclusion that applying rules quality measures on each stage of induction makes possible to achieve good
Decision Rule-Based Data Models Using TRS and NetTRS
153
Table 7. Results of rules filtration algorithms Dataset Australian (R-Gain, F-Gain) Breast (R-Gain, F-Gain) Iris (R-Gain, F-Gain) Heart (R-Chi-2, F-Chi-2) Lymphography (R-Gain, F-IREP) Monk I (R-C2, F-C2) Monk II (Q-Accuracy, F-Accuracy) Monk III (R-IREP, F-IREP)
Algorithm
Accuracy (%) Rules
From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards
86 86 86 24 73 68 97 97 97 80 82 81 79 82 82 100 100 100 78 79 75 97 97 97
18 2 2 12 9 8 7 5 4 12 15 9 14 16 13 10 10 10 42 36 32 6 3 3
Reduction rate (%) 37 93 93 52 64 68 0 28 42 50 37 62 65 62 69 0 0 0 14 26 35 14 57 57
Accuracy changes (%) 0 0 0 −49 0 −5 0 0 0 0 +2 +1 −3 0 0 0 0 0 −2 −1 −5 0 0 0
classification results by the induction algorithm. Applying postprocessing algorithms (especially filtration ones) makes possible to reduce a number of rules in the classifier which is important form the knowledge discovery point of view. Results obtained by TRS have the feature that differences between accuracy and number of rules determined on the whole data set and in the cross-validation mode are not big. Since majority of applied algorithms create approximate rules, overlearning manifests merely in generating too big rules set which can be effectively filtrated yet. Classification accuracy get by TRS is comparable with other presented algorithms. In some cases (Satimage, Monk II, Lymphography) the RIONA algorithm enables to achieve classification accuracy higher than the other systems that use rules approach only. Finally, a method of solving the real-life problem of monitoring seismic hazards in hard-coal mines by algorithms included in TRS is presented. Seismic hazard is monitored by seismic and seismoacustic measuring instruments placed in mine’s underground. Acquired data are then transmitted to the Hestia computer system [44], there are aggregated. Evaluation of bump risk in a given excavation is calculated based on aggregated data. Unfortunately, presently used methods of the risk evaluation are not too accurate, therefore new methods enabling to warn of hazard coming are sought. In research carried out there was created a
154
M. Sikora Table 8. TRS results – comparison with other methods Dataset Australian Breast Iris Glass Heart Lymphography Pima Monk I Monk II Monk III Satimage
TRS 86 –/+ (88 11) 73 –/+ (75 6) 97 –/+ (97 4) 66 –/+ (87 40) 82 –/+ (86 11) 82 –/+ (90 19) 75 –/+ (77 11) 100 (100 7) 80 (85 49) 97 (93 3) 83 (87 92)
0.2 0.3 0.2 0.4 0.3 0.3 0.2
RIONA
CART
86
86
73
67
95
94
71
70
83
79
85
82
75
75
96
87
83
79
94
97
91
84
SSV 86 (86 2) 76 (76 3) 95 (98 – 4 71 (89 21) 77 (86 7) 76 (91 16) 74 (76 5) 100 (100 11) 67 (85 20) 97 (95 6) 82 (83 9)
LEM2 87 (88 126) 74 (78 96) 94 (95 7) 67 (91 89) 81 (84 43) 82 (100 32) 73 (82 194) 88 (98 17) 73 (86 68) 96 (94 23) 82 (85 663)
C5 86 (92 73 (76 97 (97 70 (91 77 (90 80 (96 74 (81 84 (84 67 (66 97 (97 87 (95
11) 3) 4) 12) 11) 12) 11) 5) 2) 6) 96)
decision table in which information (registered during successive shifts) about aggregated values of: seismoacustic and seismic energy emitted in the excavation, a number of tremors in individual energy classes (from 1 ∗ 102 J to 9 ∗ 107 J), risk evaluation generated by classic evaluation methods and category of a shift (mining or maintenance) were included. Data from hazard longwall SC508 in the Polish coal-mine KWK Wesoła were collected. The data set numbered 864 objects, data were divided into two decision classes reflecting summary seismic energy which will be emitted in an excavation during the next shift. A limiting value for the decision classes was energy 1 ∗ 105 J. A number of objects belonging to the decision class Energy > 1 ∗ 105 J amounts to 97 objects. With respect to irregularity of a distribution of decision classes, global strategy of tolerance thresholds values determining was inadvisable, therefore modified version of the MODLEM algorithm was used in rules induction. The best results were obtained for the Cohen measure and Forward filtration algorithm, classification accuracy in cross-validation test was equal to 77%, and accuracy of individual class was equal to 76% and 77%. Applying another quality measures made possible to achieve better classification accuracy, running to 89% but at the cost of ability of recognizing seismic hazards. In each experiment 22 rules were obtained on average, the decision class describing potential hazard was described by two rules. The Forward filtration algorithm enabled to remove 18 rules and get decision classes descriptions composed of two rules. The 22 rules’ accuracy analysis made possible to establish an accuracy threshold for rules that describe the bigger decision class (non-hazardous states), the value 0.9 was the threshold.
Decision Rule-Based Data Models Using TRS and NetTRS
155
Applying an algorithm of arbitrary rules filtration included in TRS (Minimal Quality algorithm) five decision rules were obtained on average (3 describing the "non-hazardous state" and 2 describing the "hazardous state"). Premises of determined rules consist of three conditional attributes: maximal seismoacustic energy registered during a shift by any geophone, average seismoacustic energy registered by geophones, maximal number of impulses registered by any geophone, average number of impulses registered by geophones. Exemplary rules are presented below: IF IF IF IF IF
avg_impulses < 2786 THEN "non-hazardous state", max_energy < 161560 THEN "non-hazardous state", avg_energy < 48070 THEN "non-hazardous state", max_energy > 218000 THEN "hazardous state", max_impulses > 1444 THEN "hazardous state".
It is also important that determined rules are consistent with intuition of the geophysicist working in mine geophysical station. The geophysicist confirmed reasonableness of dependencies reflected by rules after their viewing. He cannot specify which attributes and their ranges have decisive influence on possibility of hazard occurrence earlier. To recapitulate, we have typical example of knowledge discovering in databases here. It is worth stressing that analysis of a whole rules set (without filtration) is not so simple. At present, research on verifying whether developing of classification system that enables hazard prediction in any mine excavation are carried out.
4
Conclusions
Several modifications of standard methods of: global establishing tolerance thresholds values, decision rules determining and their postprocessing have been presented in the paper. Methods included in the TRS library are available in Internet, in the NetTRS service. Reducts calculation and minimal decision rules induction algorithms and the RMatrix algorithm included in the TRS library use the discernibility matrix [47]. Therefore, their application for data sets composed of maximally several thousands objects is possible (in the paper, the biggest data set was the Staimage set for which searching a tolerance thresholds vector and rules by the RMatrix algorithms lasted a few minutes). Methods enabling to omit necessity of discernibility matrix usage are described in literature. In the case of standard rough set model such proposition are presented in [27], and in the case of tolerance model in [54]. The other algorithms of shortening, joining, filtration and the MODLEM algorithm can be applied for analysis of larger data sets composed of a few dozen thousands objects (the main operations made on these algorithms are rules quality calculation and classification). However, efficiency of the TRS library is less than commercial solutions (CART, GhostMiner) or developed from many years noncommercial program RSES [6]. Exemplary results of experimental research show that algorithms implemented in the library can be useful for obtaining small rules sets with good generalization
156
M. Sikora
abilities (especially in the case of applying the sequence: induction, generalization, filtration). Small differences between results obtained on training and test sets makes possible to match the most suitable quality measures with a given data set basis on an analysis of the training set. Results of tests done suggest also that good results can be obtained by using the same quality measure on both rules induction and postprocessing stages. This observation significantly restricts a space of possible solutions. A measure of confidence in a rule used during classification is an important parameter that is of great importance for classification accuracy. An adaptative method of the confidence measure selection limiting to one of two standard measures (accuracy, coverage) or a measure used during rules induction was applied in the paper. Well matched values of tolerance thresholds make possible to generate rules with better generalization and description abilities than the MODLEM algorithm offers. Applying the RMatrix or quasi-shortest minimal decision rules determining algorithms is a good solution. Postprocessing algorithms, especially the Forward filtration algorithm enable to limit meaningfully a set of rules used in the classification. That makes possible to improve description abilities of obtained rules set significantly. Detailed specification of experimental results that illustrates effectiveness of majority of presented in the paper algorithms can be found, among others, in [39,40,41,42,43,45,46]. As it was reported in many research [2,3,9,39,40] it is impossible to point at a measure that always gives the best results, but it is possible to show two groups of measures. One of them contains measures that put emphasis on rule coverage (these are, among others, the measures: IREP, Gain, Cohen, C2, WS, second group includes measures that put the greater emphasis on rule accuracy which leads to determining much more rules (these are, among others, the measures: Accuracy, Brazdil, Coleman, IKIB). Obviously, application of rules quality measures is sensible only if we admit approximate rules generating. If we determine accurate rules then the only rules quality measure worth to use will be rules coverage. (alternatively, rules strength [60], length or LEF criterion [22], if accuracy and coverage of compared rules will be identical). Further works will concentrate on adaptative (during rules induction) quality measures selection. In the paper [3] suggestion of such solution is presented. However, authors carried out no experimental research and propose no algorithm solving the problem. In our research we want to apply an approach in which simple characteristics of each decision class separately are calculated before main process of rules induction. Based on the characteristics an quality measure that will enable to describe a given decision class by a small number of rules with good generalization abilities is selected. A meta-classifier created based on analysis as big number of benchmark data as possible will point at potentially the best quality measure which should be applied on a given induction stage. Moreover, since during coverage algorithms application (e.g. MODLEM) a training set changes, we assume that during rules induction a quality measure used by the algorithm will also change.
Decision Rule-Based Data Models Using TRS and NetTRS
157
It would be also interesting to verify to which degree applying filtration to a set of rules obtained by the method described in the paper [4], in which a hierarchy of more and more general rules is built would reduce a number of rules used in classification.
References 1. Agotnes, T.: Filtering large propositional rule sets while retaining classifier performance. MSc Thesis. Norwegian University of Science and Technology, Trondheim, Norway (1999) 2. Agotnes, T., Komorowski, J., Loken, T.: Taming Large Rule Models in Rough Set Approaches. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 193–203. Springer, Heidelberg (1999) 3. An, A., Cercone, N.: Rule quality measures for rule induction systems – description and evaluation. Computational Intelligence 17, 409–424 (2001) 4. Bazan, J., Skowron, A., Wang, H., Wojna, A.: Multimodal classification: case studies. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 224–239. Springer, Heidelberg (2006) 5. Bazan, J.: A comprasion of dynamic and non-dynamic rough set methods for extracting laws from decision tables. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery 1: Methododology and Applications, pp. 321–365. Physica, Heidelberg (1998) 6. Bazan, J., Szczuka, M., Wróblewski, J.: A new version of rough set exploration system. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 397–404. Springer, Heidelberg (2002) 7. Breiman, L., Friedman, J., Olshen, R., Stone, R.: Classificzation and Regression Trees. Wadsworth, Pacific Grove (1984) 8. Brazdil, P.B., Togo, L.: Knowledge acquisition via knowledge integration. Current Trends in Knowledge Acquisition. IOS Press, Amsterdam (1990) 9. Bruha, I.: Quality of Decision Rules: Definitions and Classification Schemes for Multiple Rules. In: Nakhaeizadeh, G., Taylor, C.C. (eds.) Machine Learning and Statistics, The Interface, pp. 107–131. Wiley, NY (1997) 10. Brzeziñska, I., Greco, S., Sowiñski, R.: Mining Pareto-optimal rules with respect to support and confirmation or support and anti-support. Engineering Applications of Artificial Intelligence 20, 587–600 (2007) 11. Duch, W., Adamczak, K., Grbczewski, K.: Methodology of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Transaction on Neural Networks 12, 277–306 (2001) 12. Furnkranz, J., Widmer, G.: Incremental Reduced Error Pruning. In: Proceedings of the Eleventh International Conference of Machine Learning, New Brunswick, NJ, USA, pp. 70–77 (1994) 13. Greco, S., Matarazzo, B., Sowiñski, R.: The use of rough sets and fuzzy sets in MCDM. In: Gal, T., Hanne, T., Stewart, T. (eds.) Advances in Multiple Criteria Decision Making, pp. 1–59. Kluwer Academic Publishers, Dordrecht (1999) 14. Greco, S., Materazzo, B., Sowiñski, R.: Rough sets theory for multicriteria decision analysis. European Journal of Operational Research 129, 1–47 (2001) 15. Greco, S., Pawlak, Z., Sowiñski, R.: Can Bayesian confirmation measures be use-ful for rough set decision rules? Engineering Applications of Artificial Intelligence 17, 345–361 (2004)
158
M. Sikora
16. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Publishing Company Inc., Boston (1989) 17. Góra, G., Wojna, A.: RIONA: A new classification system combining rule induction and instance-based learning. Fundamenta Informaticae 51(4), 369–390 (2002) 18. Grzymaa-Busse, J.W.: LERS - a system for learning from examples based on rough sets. In: Sowiñski, R. (ed.) Intelligent Decision Support. Handbook of applications and advances of the rough set theory, pp. 3–18. Kluwer Academic Publishers, Dordrecht (1992) 19. Grzymaa-Busse, J.W., Ziarko, W.: Data mining based on rough sets. In: Wang, J. (ed.) Data Mining Opportunities and Challenges, pp. 142–173. IGI Publishing, Hershey (2003) 20. Guillet, F., Hamilton, H.J. (eds.): Quality Measures in Data Mining. Computational Intelligence Series. Springer, Heidelberg (2007) 21. Kanonenko, I., Bratko, I.: Information-based evaluation criterion for classifier‘s performance. Machine Learning 6, 67–80 (1991) 22. Kaufman, K.A., Michalski, R.S.: Learning in Inconsistent World, Rule Selection in STAR/AQ18. Machine Learning and Inference Laboratory Report P99-2 (February 1999) 23. Kubat, M., Bratko, I., Michalski, R.S.: Machine Learning and Data Mining: Methods and Applications. Wiley, NY (1998) 24. Latkowski, R., Mikoajczyk, M.: Data Decomposition and Decision Rule Joining for Classification of Data with Missing Values. In: Peters, J.F., Skowron, A., GrzymałaBusse, J.W., Kostek, B.z., Świniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 299–320. Springer, Heidelberg (2004) 25. Michalski, R.S., Carbonell, J.G., Mitchel, T.M.: Machine Learning, vol. I. MorganKaufman, Los Altos (1983) 26. Mikoajczyk, M.: Reducing number of decision rules by joining. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 425–432. Springer, Heidelberg (2002) 27. Nguyen, H.S., Nguyen, S.H.: Some Efficient Algorithms for Rough Set Methods. In: Proceedings of the Sixth International Conference, Information Processing and Management of Uncertainty in Knowledge-Based Systems, Granada, Spain, pp. 1451–1456 (1996) 28. Nguyen, H.S., Nguyen, T.T., Skowron, A., Synak, P.: Knowledge discovery by rough set methods. In: Callaos, N.C. (ed.) Proc. of the International Conference on Information Systems Analysis and Synthesis, ISAS 1996, Orlando, USA, July 22-26, pp. 26–33 (1996) 29. Nguyen, H.S., Skowron, A.: Searching for relational patterns in data. In: Komorowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263, pp. 265–276. Springer, Heidelberg (1997) 30. Nguyen, H.S., Skowron, A., Synak, P.: Discovery of data patterns with applications to decomposition and classfification problems. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery 2: Applications, Case Studies and Software Systems, pp. 55–97. Physica, Heidelberg (1998) 31. Nguyen, H.S.: Data regularity analysis and applications in data mining. Doctoral Thesis, Warsaw University. In: Polkowski, L., Tsumoto, S., Lin, T.Y. (eds.) Rough set methods and applications: New developments in knowledge discovery in information systems, pp. 289–378. Physica-Verlag/Springer, Heidelberg (2000), http://logic.mimuw.edu.pl/
Decision Rule-Based Data Models Using TRS and NetTRS
159
32. Ohrn, A., Komorowski, J., Skowron, A., Synak, P.: The design and implementation of a knowledge discovery toolkit based on rough sets: The ROSETTA system. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery 1: Methodology and Applications, pp. 376–399. Physica, Heidelberg (1998) 33. Pawlak, Z.: Rough Sets. Theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht (1991) 34. Pednault, E.: Minimal-Length Encoding and Inductive Inference. In: PiatetskyShapiro, G., Frawley, W.J. (eds.) Knowledge Discovery in Databases, pp. 71–92. MIT Press, Cambridge (1991) 35. Pindur, R., Susmaga, R., Stefanowski, J.: Hyperplane aggregation of dominance decision rules. Fundamenta Informaticae 61, 117–137 (2004) 36. Podraza, R., Walkiewicz, M., Dominik, A.: Credibility coefficients in ARES Rough Sets Exploration Systems. In: Ślęzak, D., Yao, J., Peters, J.F., Ziarko, W.P., Hu, X. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3642, pp. 29–38. Springer, Heidelberg (2005) 37. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan-Kaufman, San Mateo (1993) 38. Prêdki, B., Sowiñski, R., Stefanowski, J., Susmaga, R.: ROSE – Software implementation of the rough set theory. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, p. 605. Springer, Heidelberg (1998) 39. Sikora, M., Proksa, P.: Algorithms for generation and filtration of approximate decision rules, using rule-related quality measures. In: Proceedings of International Workshop on Rough Set Theory and Granular Computing (RSTGC 2001), Matsue, Shimane, Japan, pp. 93–98 (2001) 40. Sikora, M.: Rules evaluation and generalization for decision classes descriptions improvement. Doctoral Thesis, Silesian University of Technology, Gliwice, Poland (2001) (in Polish) 41. Sikora, M., Proksa, P.: Induction of decision and association rules for knowledge discovery in industrial databases. In: International Conference on Data Mining, Alternative Techniques for Data Mining Workshop, Brighton, UK (2004) 42. Sikora, M.: Approximate decision rules induction algorithm using rough sets and rule-related quality measures. Theoretical and Applied Informatics 4, 3–16 (2004) 43. Sikora, M.: An algorithm for generalization of decision rules by joining. Foundation on Computing and Decision Sciences 30, 227–239 (2005) 44. Sikora, M.: System for geophysical station work supporting - exploitation and development. In: Proceedings of the 13th International Conference on Natural Hazards in Mining, Central Mining Institute, Katowice, Poland, pp. 311–319 (2006) (in Polish) 45. Sikora, M.: Rule quality measures in creation and reduction of data role models. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Słowiński, R. (eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 716–725. Springer, Heidelberg (2006) 46. Sikora, M.: Adaptative application of quality measures in rules induction algorithms. In: Kozielski, S. (ed.) Databases, new technologies, vol. I. Transport and Communication Publishers (Wydawnictwa Komunikacji i Łączności), Warsaw (2007) (in Polish) 47. Skowron, A., Rauszer, C.: The Discernibility Matrices and Functions in Information systems. In: Sowiñski, R. (ed.) Intelligent Decision Support. Handbook of applications and advances of the rough set theory, pp. 331–362. Kluwer Academic Publishers, Dordrecht (1992)
160
M. Sikora
48. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 224–239. Springer, Heidelberg (2006) 49. Skowron, A., Wang, H., Wojna, A., Bazan, J.: Multimodal Classification: Case Studies. Fundamenta Informaticae 27, 245–253 (1996) 50. Sowiñski, R., Greco, S., Matarazzo, B.: Mining decision-rule preference model from rough approximation of preference relation. In: Proceedings of the 26th IEEE Annual Int. Conf. on Computer Software and Applications, Oxford, UK, pp. 1129– 1134 (2002) 51. Stefanowski, J.: Rough set based rule induction techniques for classification problems. In: Proceedings of the 6th European Congress of Intelligent Techniques and Soft Computing, Aachen, Germany, pp. 107–119 (1998) 52. Stefanowski, J.: Algorithms of rule induction for knowledge discovery. Poznañ University of Technology, Thesis series 361, Poznañ, Poland (2001) (in Polish) 53. Smyth, P., Gooodman, R.M.: Rule induction using information theory. In: Piatetsky-Shapiro, G., Frawley, W.J. (eds.) Knowledge Discovery in Databases, pp. 159–176. MIT Press, Cambridge (1991) 54. Stepaniuk, J.: Knowledge Discovery by Application of Rough Set Models. Institute of Computer Sciences Polish Academy of Sciences, Reports 887, Warsaw, Poland (1999) 55. Stepaniuk, J., Krêtowski, M.: Decision System Based on Tolerance Rough Sets. In: Proceedings of the 4th International Workshop on Intelligent Information Systems, Augustów, Poland, pp. 62–73 (1995) 56. Ślęzak, D., Wróblewski, J.: Classification Algorithms Based on Linear Combination of Features. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 548–553. Springer, Heidelberg (1999) 57. Wang, H., Duentsch, I., Gediga, G., Skowron, A.: Hyperrelations in version space. International Journal of Approximate Reasoning 36(3), 223–241 (2004) 58. Wojna, A.: Analogy based reasoning in classifier construction. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets IV. LNCS, vol. 3700, pp. 277–374. Springer, Heidelberg (2005) 59. Ziarko, W.: Variable precision rough sets model. Journal of Computer and System Sciences 46, 39–59 (1993) 60. Zhong, N., Skowron, A.: A rough set-based knowledge discovery process. International Journal of Applied Mathematics and Computer Sciences 11, 603–619 (2001) 61. Yao, Y.Y., Zhong, N.: An Analysis of Quantitative Measures Associated with Rules. In: Zhong, N., Zhou, L. (eds.) PAKDD 1999. LNCS (LNAI), vol. 1574, pp. 479–488. Springer, Heidelberg (1999)
A Distributed Decision Rules Calculation Using Apriori Algorithm Tomasz Str¸akowski and Henryk Rybiński Warsaw Uniwersity of Technology, Poland
[email protected],
[email protected]
Abstract. Calculating decision rules is a very important process. There are a lot of solutions for computing decision rules, but the algorithms of computing all set of rules are time consuming. We propose a recursive version of the well known apriori algorithm [1], designed for parallel processing. We present here, how to decompose the problem of calculating decision rules, so that the parallel calculations are efficient. Keywords: Rough set theory, decision rules, distributed computing.
1
Introduction
The number of applications of Rough Set Theory (RST) in the field of Data Mining is growing. There is a lot of research, where advanced tools for RST are developed. The most valuable feature of this theory is that it discovers knowledge from vague data. Decision rules are one of the fundamental concepts in RST , and their discovering is one of the main research areas. This paper refers to the problem of generating complete set of decision rules. One of the most popular methods of finding all sets of rules is the RS-apriori algorithm proposed by Kryszkiewicz [1], based on the idea of the apriori algorithm, proposed by [2]. Among other techniques solving this problem it is worth to note LEM [3], or incremental learning [4]. The LEM algorithm gives a restricted set of rules, which covers all objects from the Decision Table (DT ), but does not necessarily discovers all the rules. The incremental learning is used to efficiently update the set of rules, when new objects are added to DT . The main problem with discovering decision rules is the fact that the process is very time consuming. There are generally two ways of speeding it up. The first one is to use some heuristics, the second one consists in distributing computation between a number of processors. In the paper we focus on the second approach. We present here a modification of the apriori algorithm, designed for
The research has been partially supported by grant No 3 T11C 002 29 received from Polish Ministry of Education and Science, and partially by grant of Rector of Warsaw University of Technology No 503/G/1032/4200/000.
J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 161–176, 2010. c Springer-Verlag Berlin Heidelberg 2010
162
T. Str¸akowski and H. Rybiński
parallel computing. The problem of distributed generation of the rules was presented in [5], in the form of a multi agent system. The idea of the multi agent system consists in splitting Information System into many independent subsystems. Next, each subsystem computes rules for its part, and a specialized agent merges partial decision rules. Unfortunately, this method does not provide the same results, as the sequential version. Our aim is to provide a parallel algorithm which would give the same result as a sequential algorithm finding all rules, but in more efficient way. The paper is composed as follows. In Section 2 we recall basic notions related to the rough set theory, referring to the concept of decision rules. Section 3 presents the original version of the rough sets adaptation of the apriori algorithm for rules calculation [1]. Then we present our proposal for storing rules in a tree form, which is slightly different from the original hash tree proposed in [6,2], and makes possible defining recursive calculations of the rules. We present here such a recursive version. Section 4 is devoted to presenting two effective ways of parallel computing of the rules with a number of processors. Section 5 provides experimental results for the presented algorithms. We conclude the paper with a discussion about the effectiveness of the proposed approaches.
2
Computing Rules and Rough Sets Basis
Let us start with recalling basic notions of the rough set theory. Information system IS is a pair IS = (U, A), where U is finite set of elements, and A is a finite set of attributes which describe the elements. For every a ∈ A there is a function U → Va , assigning a value v ∈ Va of the attribute a to the objects u ∈ U , where Va is a domain of a. The indiscernibility relation is defined as follows: IN D(A) = {(u, v) : u, v ∈ U, ∀a ∈ A a(u) = a(v)} The IN D relation can be also defined for a particular attribute: IN D(a) = {(u, v) : u, v ∈ U, a(u) = a(v)} Informally speaking, two objects u and v are indiscernible for the attribute a if they have the same value of that attribute. One can show that the indiscernibility relation for a set B of attributes can be expressed as IN D(B) = IN D(a), B ⊆ A. In the sequel IB (x) denotes the set of objects indisa∈B cernible with x, wrt B. IN D(B) is the equivalence relation and it splits IS into abstraction classes. The set of abstraction classes wrt B will be further denoted by ACB (IS). So, ACB (IS) = {Yi |∀y ∈ Yi , Yi = IB (y)}. Given X ⊆ U and B ⊆ A, we say that B(X) is lower approximation of X if B(X) = {x ∈ U |IB (x) ⊆ X}. In addition, B(X) is upper approximation of X if B = {x ∈ U |IB (x) ∩ X = ∅}. B(X) is the set of objects that belong to X certainly, B(X) is the set of objects that possibly belong to X, hence B(X) ⊆ X ⊆ B(X). Decision table (DT ) is an information system where DT = {U, A ∪ D}, and D ∩ A = ∅. The attributes from D are called decision attributes, the attributes
A Distributed Decision Rules Calculation Using Apriori Algorithm
163
from A are called condition attributes. The set ACD (DT ) will be called the set of decision classes. Decision rule has the form t → s, where t = ∧(c, v), c ∈ A and v ∈ Va , s = ∧di ∈D (di , wi ), and wi ∈ Vdi , t is called antecedent, and s is called consequent. By ||t|| we denote the set of objects described by a conjunction of the pairs (attribute, value). The cardinality of ||t|| is called support of t, and will be denoted by sup(t). We say that the object u, u ∈ U , supports the rule t → s if u belongs to ||t||, and to ||s||, i.e. u ∈ ||t ∧ s||. Given a threshold, we say that the rule t → s is frequent, if its support is higher than the threshold. The rule is called certain, if ||t|| ⊆ ||s||, the rule is called possible, if ||t|| ⊆ A(||s||). The confidence of rule t → s is defined as sup(t ∧ s)/sup(t). For the certain rules confidence is 100%. Any rule with k pairs (attribute, value) in the antecedent is called k − ant rule. Any rule t ∧ p → s will be called extension of the rule t → s. Any rule t → s will be called seed of rule t ∧ p → s. The rule is called optimal certain/possible if there is no other rule certain/possible which has less conditions in antecedent.
3
Apriori Algorithm
RS − apriori [1] was invented as a way to compute rules from large databases. A drawback of the approach is that the resulting set of rules cannot be too large, otherwise the memory problems emerge with storing all the candidates. To this end we attempt to reduce the requirements for the memory, and achieve better time efficiency. First, let us recall the original version of RS − apriori [1] (Algorithm 3.1). The algorithm is run for a certain threshold to be defined by the user. As the first step, the set of rules with one element in antecendent (C1 ) is crated with INIT RULES, which is performed by single pass over decision table (Algorithm 3.2). Now, the loop starts. Every iteration pass consists of the following three phases: 1. counting support and confidence of all candidate rules; 2. selecting the rules with support higher than the threshold; 3. making k+1-ant rules based on k-ant rules. The third step in the iteration cycle performs pruning. The idea of pruning is based on the fact that every seed of the frequent rule has to be frequent. Using this property we can say that if some k-ant rule has a (k-1)-ant seed rule, which is not frequent, we can delete the k-ant rule without counting its support. So, the pruning reduces the number of candidate rules. The stop condition can be defined as follows: 1. after selecting rules candidates the number of rules is less than 2 (so, it is not possible to perform the next step), or 2. number k in k − ant rule level is grater than, or equal to, the number of conditional attributes.
164
T. Str¸akowski and H. Rybiński
Algorithm 3.1: RS-Apriori CERTAIN (DT, minSupport) procedure APRIORI RULE GEN(Ck ) insert into Ck+1 (ant, cons, antCount, ruleCount) select f.ant[1], f.ant[2], . . . , f.ant[k], c.ant[k], f.cons, 0, 0 f romCkf, Ckc where f.cons = c.cons and f.ant[k − 1] = c.ant[k − 1] and (f.ant[k]).attribute < (c.ant[k]).attribute for all c ∈ Ck+1 / ∗ rules pruning ∗ / if |f ∈ Ck |f.cons = c.cons and f.ant ⊂ c.ant| < k + 1 do then delete c f rom Ck+1 return (Ck+1 ) main C1 = INIT RULES(DT ) k=1 while⎧Ck = ∅ or k ≤ |A| for all candidatesc ∈ Ck ⎪ ⎪ ⎪ ⎪ do c.antCount = 0 c.ruleCount = 0 ⎪ ⎪ ⎪ ⎪ for all ⎪ ⎪ ⎧ objects u ∈ U ⎪ ⎪ for all⎧ candidates c ∈ Ck ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ if c.ant ⊂ IA (u) ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ then c.antCount = c.antCount + 1 Rk = ∅ ⎨ do do ⎪ ⎪ if c.cons ∈ IA (u) do ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ ⎪ then c.ruleCount = c.ruleCount + 1 ⎪ ⎪ ⎪ ⎪ for all candidates c ∈ Ck ⎪ ⎪ ⎪ ⎪ do if c.ruleCount ≤ minSupport ⎪ ⎪ ⎪ ⎪ then delete c f rom Ck ⎪ ⎪ ⎪ ⎪ else move c f rom Ck to Rk ⎪ ⎪ ⎩ C k+1 = APRIORI RULE GEN(Ck ) return ( k Rk ) Algorithm 3.2: INIT RULES(DT ) main C1 = ∅ for all objectsx ∈ DT C1 = C1 ∪ {c ∈ / C1 |c.cons = (d, v) ∈ IA (x) and do c.ant = (a, v) ∈ IA (X) and a ∈ A} return (C1 ) Already for this algorithm there is a possibility to perform some parts in parallel. In particular, instead of building the set C1 based on the whole decision table, we could run it separately for each decision class. The process of generating rules for a decision class is independent of the other classes, and can be run in parallel with the other processes. If we look for certain rules we use lower approximations
A Distributed Decision Rules Calculation Using Apriori Algorithm
165
of the decision class to build C1 . If we look for possible rules we use the whole decision class. 3.1
Tree Structure for Keeping Candidate Rules
In the original algorithm [6] candidate rules are stored in a hash tree for the two following reasons: 1. it guarantees more efficient counting of support of the cadidates; 2. it is easier and more efficient to prune the candidate set. Also in our approach we build a tree for storing the partial results. It has the same properties, as the original Agraval tree [6], but additionally only the rules that are in one node may be parents of the rules in the node’s subtree. This property is very important, as it provides a possibility to perform the process of searching rules independently for particular subtrees, and in a recursive way. Let us call particular pair (attribute, value) as single condition. All the k-ant rules are kept on k-th level of the tree. Every rule that can be potentially joined to a new candidate rule is kept in the same node of the tree.
(c=1) -> s (b=1) -> s (a=1) -> s
(a=1 (a=1
(a=1
-> s ^^ c=1) b=1) -> s
^
(b=1 c=1) -> s
^ b=1 ^ c=1) -> s
Fig. 1. Tree for keeping candidate rules
Let us assume that the condition parts in all the rules stored in the tree are always in the same order (i.e. attribute A is always before B, and B is before C). At the first level the tree starts with a root, where all 1-ant rules are stored together. On the second level of the tree each node stores the rules with the same attribute for the first single condition. So the number of nodes on the second level is equal to the number of different attributes used in the first conditions of the rules. Each of the third level nodes of the tree contains the rules that start
166
T. Str¸akowski and H. Rybiński
with the same attributes in the first two conditions of the antecedents. Generally speaking, on the k th level of the tree each node stores the rules with the same attributes on k-1 conditions, counted from the beginning. Below we present the proposed algorithm in more detail. 3.2
Recursive Version of Apriori
For the tree structure as above we can define an algorithm which recursively performs the process of discovering rules from a decision table. The proposed algorithm is presented as Algorithm 3.3. It consists of the definitions of the following procedures: 1. AprioriRuleGen, which for a tree node t in T generates all possible k+1-ant candidate rules, based on k-ant rules stored in the node; 2. CheckRules, which checks the support of all the candidate rules in a given node t and removes the rules with the support less than a threshold; 3. CountSupportF orN ode, which counts support and confidence of all candidate rules in a node t; 4. ComputeRules, which for decision table DT , minSupport and a given node t computes the rules by calling the procedure CountSupport in the first step, then CheckRules in the second step, and then recoursively calling itself for generating the next level rules. Having defined the four procedures, we can now perform the calculation process as follows: for each decision class 1. with InitRules in the first phase of the cycle, the root of the tree T is initialized by calculating 1-ant rules, as in Algorithm 3.1; 2. having initialized the root for the i − th decision class, now we calculate the decision rules by calling ComputeRules, which recursively finds out the rules for the decision class, and cummulates them in results. The recursive approach gives us worst time of computations, but has less requirements for the memory. Another advantage of this version is that in every moment of its execution, we can send a branch t of the tree T and a copy of DT to another processor, and start processing this branch independently. Having completed the branch calculations we join particular results R into one set, and remove duplicates. This way we can decompose the problem even into a massive parallel computing (note that each processor receiving a branch of the tree and a copy of DT can compute by itself, or can share the work with other agents. In the next section we explain in more detail how the recursiveness can be used for splitting the calculations into separate tasks, so that the whole process can be performed in parallel.
A Distributed Decision Rules Calculation Using Apriori Algorithm
167
Algorithm 3.3: RS-APRIORI-RECURSIVE(DT, minSupport) procedure AprioriRuleGen(T, t) IN SERT IN T O T (ant, cons, antCount, ruleCount) SELECT f.ant[1], f.ant[2], . . . , f.ant[k], c.ant[k], f.cons, 0, 0 F ROM t, t//f ∈ tandc ∈ t W HEREf.cons = c.cons and f.ant[k − 1] = c.ant[k − 1] and (f.ant[k]).attribute < (c.ant[k]).attribute procedure CheckRules(t, R, minSupport) for all candidates c ∈ t do if c.ruleCount ≤ minSup then delete c f rom t else if c.rulecount = c.antCount then move c f rom t to R procedure CountSupportForNode(t) for all candidates c ∈ t do c.antCount = 0 c.ruleCount = 0 for all obiects u ∈ U do for all candidates c ∈ t do if c.ant ⎧ ⊂ IA (u) ⎨c.antCount = c.antCount + 1 then if c.cons ∈ IA (u) ⎩ then c.ruleCount = c.ruleCount + 1 procedure ComputeRules(t, R, DT, minSupport) if t.level = |A| then ⎧ stop algorithmreturn (R) CountSupportForNode(t) ⎪ ⎪ ⎨ CheckRules(t, R, minSupport) else for all ti is a child of t ⎪ ⎪ ⎩ do ComputeRules(t, R, DT, minSupport) main D is a set of decision classes result = ∅ for i ← ⎧ 1 to |D| ⎨T is tree do T.root = INIT RULES(A(Di )) ⎩ result = result ∪ ComputeRules(T, result, DT, minSupport) return (result)
4
Parallel Computations
The approach presented in this Section is an upgraded version of an earlier approach, presented in [7], where we have proven that by generating separately
168
T. Str¸akowski and H. Rybiński
sets of rules from disjoint parts of DT , and then joining them we obtain the same result, as computing the rules by sequential apriori. The rationalization of computations in the current solution consists in the fact that now every processor has its own copy of the whole decision table (even if the first k-ant rules are computed from the disjoint subset). The version described in [7] suffered for a large number of messages exchanged between the processors. It is therefore justified to provide the whole table to every processor, which makes the number of network messages drastically reduced. In addition, in the version described in [7] each processor used the sequential version of apriori, whereas in the presented approach we use the recursive algorithm, which gives rise to a better use of the processors. It will be described in more detail below in this Section. As already mentioned, even the original version of the apriori algorithm can be performed in parallel. The main constraint in this case is the limited number of decision classes of DT . Actually, in the majority of practical cases the number of decision classes in decision tables is rather limited, thus the level of distributing the calculations is not too high. The presented recursive version of Algorithm 3.3 gives us a possibility to split the process into many parts. However, the problem is that in the first phase of the algorithm (initialization) there is nothing to split. Therefore we propose to split DT into disjoint subsets of the objects, and start parallel recursive calculations for each subset. In the sequel the starting sets will be called initial sets. Based on an assigned initial set, first the processor generates the 1-ant rules, then the computations continue on this processor recursively, based on the k-ant rules, and the whole DT (which, as mentioned above, is also available for the processor). The important advantage of the algorithm is its scalability. The problem is though with the maximum number of the processors that can start computations in parallel. This is limited by the number of abstraction classes in DT . Actually, if we split one abstraction class and process it by few processors, we receive identical sets of rules already in the first phase of the algorithm. So it would give us only a redundancy of computing. Therefore for the first phase we do not split the abtraction classes. On the other hand, after completing the first phase of the algorithm it is possible to use much more processors. Actually, after computing the first level every processor can share its job with co-working processors, assigning them the branches of the tree computed locally. Below we consider two versions of the algorithm: 1. the one consisting in generating initial sets by choosing randomly abstraction classes belonging to an approximtion of a selected decision class; 2. the one, which additionally minimizes the redundancy of 1-ant rules among various processors. Let us present the first version of the algorithm (Algorithm 4.1). We have defined here two procedures: 1. DistributeXintoY , which randomly splits n objects into l groups, where X1 ... Xn are the objects and Y1 ... Yl are groups; 2. Associate, which splits X into l initial sets Y1 ... Yl , where X = Xi , and X is lower or upper approximation of a decision class, Xi is an abstraction class.
A Distributed Decision Rules Calculation Using Apriori Algorithm
169
Algorithm 4.1: SplitDT(DT, N ) procedure distributeXintoY(X, Y ) X set of sets where ∀i Xi =∅ Y set of sets where ∀i Yi = ∅ for i ← 1 to k do Y(imod(l) +1) = {Y(imod(l) +1) , Xi } procedure associate(X, Y, l) X is set of {Cg , ..., Ch } where Ci is anabstraction class, and C is the set of all abstraction classes, so X ⊂ C and i=h i=g Ci is lower or upper aproximation of a decision class, Y is set of {Sk , ..., Sl } where Si is particular initial set, and Y is set of all initial sets so Y ⊂ S distributeXintoY(X, Y, |Y |) main N − the number of processors S = {S1 , ..., SN } − the set of initial sets Split DT into decision classes DC(DT ) = {D1 , ..., Dn } if n ≥ N then ⎧ distributeXintoY({A(D1 ), ..., A(Dn )}, S, N ) Y = {Y1 , ..., Yn } ⎪ ⎪ ⎪ ⎪ for i ← 1 to n ⎪ ⎪ ⎨ do Y = ∅ else distributeXintoY(S, Y, n) ⎪ ⎪ ⎪ ⎪ for i ← 1 to n ⎪ ⎪ ⎩ do associate(A(Di ), Yi ) As we start in the procedure with lower approximations, it calculates certain rules. If we want to find possible rules, we should use here A(Di ) instead of A(Di ). With the two procedures Algorithm 4.1 prepares tasks for N processors in the following way: 1. first, the decision classes are calculated, and then 2. the number of decision classes n is compared with the number of processors N; 3. If n ≥ N , the algorithm splits n decision classes into N groups, so that some processors may have more than 1 decision classes to work with; 4. If n < N , first we split N processors into n grups, so that each group of processors is dedicated to a particular decision class. If a decision class is assigned to more than one procesor, this class should be split into k initial sets, where k is the number of the processors in the group. Then each initial set in the group will be computed by a separate procesor.
170
T. Str¸akowski and H. Rybiński
Having split the problem into N processors, Algorithm 3.3 can be started on each processor, so that the process of calculating the decision rules is performed parallely on N processors. Let us illustrate the algorithm for the decision table as presented in Table 1. Table 1. Decision Table u1 u2 u3 u4 u5 u6 u7 u8 u9
a 0 0 1 1 0 0 1 1 1
b 1 1 0 0 0 0 1 1 1
c 1 0 0 1 0 0 0 1 1
d 1 1 0 0 0 0 1 1 0
We consider 4 scenarios of splitting the table into initial sets. The table contains two decision classes: – D1 = {1, 2, 7, 8} – D2 = {3, 4, 5, 6, 9} and 8 abstraction classes: C1 = {1}, C2 = {2}, C3 = {3}, C4 = {4}, C5 = {5}, C6 = {6}, C7 = {7}, C8 = {8, 9}, hence the maximal number of the processors for the first phase is 8. Scenario 1. We plan computing certain rules with 2 processors. The first step is to find out decision classes. As we have two decision classes, we assign one class to each processor. We create two initial sets S1 and S2 , where S1 = A(D1 ) = {1, 2, 7}, and S2 = A(D2 ) = {3, 4, 5, 6}. Scenario 2. We plan computing certain rules with 6 processors. Having found 2 decision classes for 6 processors we have to perform Step 4 (N < n), and generate initial sets. Given A(D1 ) = {1, 2, 7}, and A(D2 ) = {3, 4, 5, 6} we can have the following initial sets: S1 = {1}, S2 = {2}, S3 = {7}, S4 = {3}, S5 = {4}, S6 = {5, 6}. The objects 5 and 6 belong to the same lower approximation of a decision class, thus can belong to the same initial set. Scenario 3. Again we plan using 6 processors, but this time for computing possible rules. With the decision classes D1 , and D2 we compute upper approximations A(D1 ) and A(D2 ). So, we have more processor than classes. Given calculated A(D1 ) = {1, 2, 7, 8, 9} and A(D2 ) = {3, 4, 5, 6, 8, 9}, we may have the following 6 initial sets: S1 = {1, 8, 9}, S2 = {2}, S3 = {7}, S4 = {3, 6}, S5 = {4, 8, 9}, S6 = {5} to be assigned to the 6 processors for further calculations. Scenario 4. Let us plan computing certain rules with 3 processors. Again, having 2 decision classes, we have to use A(D1 ) and A(D2 ) (as in Scenarios 1 and 2) to generate 3 initial sets: S1 = {1, 2, 7}, S2 = {3, 5}, S3 = {4, 6}.
A Distributed Decision Rules Calculation Using Apriori Algorithm
171
It may happen that for the generated initial sets we receive a redundancy in partial results. Clearly, if two initial sets X1 and X2 have objects belonging to the same decision class, and they generate the same sets of 1-ant rules, then two processors will give exactly the same partial results. The proof is obvious: if we have identical two 1-ant rules resulting from two various initial sets (and two processors have identical copies of DT ), the results will be the same. Scenario 4 is such an example. Actually, for the generated initial sets S2 and S3 assigned to two different processors we achieve identical sets of 1-ant rules from the two processors. It leads to a redunduncy (100%) in two partial results. For this reason, the process of choosing initial sets is of crucial importance. In the next subsection we will discuss in more detail how to overcome the problem. 4.1
The Problem of Redundancy of Partial Results
An optimal solution for spliting the calculation process into N processors would be to have the N sets of 1-ant rules disjoint, unfortunately this restriction is in practice impossible to achieve. Let us notice that if n is the number of the abstraction classes, X1 , . . . , XN are the initial sets and Y (Xi ) is the set of 1-ant rules generated from the i-th initial set Xi , then for a given N , and growing number of abstraction classes n, the probability that the sets of 1-ant rules overlap is growing, which results in growing the redundancy of the partial results. Let us refer back to the example of Scenario 4 above. The objects of the initial sets S2 and S3 belong to the same decision class. The 1-ant rules generated from S2 are {(a = 1 → d = 0), (a = 0 → d = 0), (b = 0 → d = 0), (c = 0 → d = 0), (c = 1 → d = 0)}. The same set of 1-ant rules will be generated from the initial set S3 . Concluding, the processors 2 and 3 will give us the same results, so we have 100% redundancy in the calculations, and no gain from using an extra processor. To avoid such situations, in the sequel we propose modifying the algorithm of generating initial sets. Actually, only a minor modification is needed in the function Associate. We show it in Algorithm 4.2. First, let us note that if we use a particular atribute to split DT into initial sets by means of discernibility, in the worst case we have a guarantee, that every two initial sets differ at least by one pair (attribute, value). Hence, the modification guarantees that every two initial sets Si , Sj , which contain objects from the same decision class, have such sets of 1-ant rules that differ with at least one rule, therefore for any two processors we already have redundancy less than 100% by means of the computation process (still it may happen that some rules appear in the partial results of the two processors).
172
T. Str¸akowski and H. Rybiński
Algorithm 4.2: associate(X, Y ) procedure associate(X, Y ) Y is {S1 , ..., Sk } where Si is particular initial set, Y ⊂ S a = findAttribute(X, k) if |ACa (X)| ≥ k do distributeXintoY(AC a (X), Y, |Y |) ⎧ l = k − |AC (X)| − 1 ⎪ a ⎪ ⎪ ⎪ split Y into two subsets Y1 and Y2 that : ⎪ ⎪ ⎪ ⎪ ⎨|Y1 | = |ACa (X)| − 1 and |Y 2| = l else split ACa (X) = {C1 , ..., Ch } into two subsets XX1 and XX2 ⎪ ⎪ XX1 = {C1 , ..., Ch−1 }, XX2 = Ch ⎪ ⎪ ⎪ ⎪ distributeXintoY(XX1 , Y1 , |Y1 |) ⎪ ⎪ ⎩ associate(XX2 , Y2 ) procedure findAttribute(X, k) Xi ⊂ U k is integer f ind f irst Attribute a ∈ A where |ACa (X)| ≥ k or |ACa (X)|is maximal Having defined the splitting process (Algorithms 4.1 and 4.2) we can summarize the whole process of the parallel computations of rules, as follows: 1. compute decision classes; 2. depending on the number of decision classes and the number of processors either we compute initial sets according to SplitDT (N > n), or we assign to the processors the whole classes as the initial sets (N ≤ n) ; 3. every processor is given an initial set and a copy of DT ; 4. each processor calculates the 1-ant rules, based on the assigned initial sets; 5. the n-ant rules are generated locally at each processor, according to the recursive apriori. Every processor has its own tree structure for keeping candidate rules; 6. after having completed the local calculations at the processors, the results go to the central processor, where we calculate the final result set by merging all the partial results, and removing duplicate rules, and then extensions of the rules. 4.2
Optimal Usage of Processors
For the approach presented above, the processors usage can be shown as in Figure 2. It is hard to predict how much time it would take to compute all the rules from the constructed initial sets. Usually it is different for the particular processors. Hence, we propose the following solution: having evaluated the initial sets, the central node initializes computations on all the processors.
A Distributed Decision Rules Calculation Using Apriori Algorithm
173
processor processor processor processor central node init sets
time summing rules
rules generation
inactive time
Fig. 2. Processor usage
At any time every node can ask the central node if there are any processors that have finished their tasks, and are free. If so, the processor refers to a free one, and sends there a request for remote calculations, starting with the current node of its tree. This action can be repeated until a free processor can be assigned (or all the local nodes have been assigned). Having initialized the remote computations the processor goes back to the next node to be computed. The processor should remember all its co-workers, as they should give back the results. Having completed its own computations, along with the results from co-workers, the processor summarizes the results, and sends them back to the requestor of the given task. At the end, all results will return to the starting node (which is the central one). This approach guarantees a massive use of all involved processors, and better utilization of the processing power. Additionally we can utilize more processors than the number of initial sets. An example diagram of cooperation of the processors is presented on Fig. 3. A problem is how to distribute information on free processors in an efficient way, so that the communication cost in the network does not influences the total cost of calculations. In our experiments we used the following strategy: 1. the central processor has a knowledge on all free processors (all the requests for a co-worker go through it); 2. every processor periodically receives a broadcast from the central node with information on how many processors are free (m), and how many are busy (k); 3. based on m and k the processor calculates the probability of success in receving a free processor, according to the formula p = min( m k , 1), and then with the probability p it issues the request to the central node for a free processor;
174
T. Str¸akowski and H. Rybiński
4. when a processor completes a task, it sends a message to the central processor, so that the next broadcast of the central processor will provide updated values of m and k.
processor processor processor processor central node init sets
time summing rules
rules generation
inactive time
rule generating as a co-worker Fig. 3. Processor usage
4.3
Results of Experiments
In the literature, there are some measures for the evaluation of distributed algorithms. In our experiments we use two indicators: – speedup; – efficiency. Following [8] we define speedup as Sp = TT p1 , and efficiency as Ep = pp , where T1 is the time of execution of the algorithm on one processor, Tp is the time needed by p processors. For our experiments we have used three base data sets: (a) 4000 records and 20 conditional attributes (generated data); (b) 5500 records and 22 conditional attributes (the set mushrooms without records having missing values on the stalk-root attribute); (c) 8000 records and 21 conditional attributes (the set mushrooms without the stalk-root attribute). For each of the databases we have prepared a number of data sets, namely 4 sets for the set (a), 7 sets for (b), 12 sets for the set (c). Every set of data was prepared by a random selection of objects from the base sets. For each series of the data sets we have performed one experiment for the sequential algorithm, one for two processors and one for three processors. Below we present the results of the experiment. Table 2 contains the execution times for the sequential version of the algorithm for each of the 3 testing series respectively. In these tables column 4 shows the total execution time, columns 3 shows the number of rules. S
A Distributed Decision Rules Calculation Using Apriori Algorithm Table 2. Sequential computing elements number condition attributes the first set of data 2500 20 3000 20 3500 20 4000 20 the second set of data 2500 22 3000 22 3500 22 4000 22 4500 22 5000 22 5500 22 the third set of data 2500 21 3000 21 3500 21 4000 21 4500 21 5000 21 5500 21 6000 21 6500 21 7000 21 7500 21 8000 21
rules time ms 40 23 17 14
1570 1832 2086 2349
26 18 23 44 82 91 116
1327 1375 1470 2112 2904 2509 3566
23 16 21 27 67 76 99 89 112 131 126 177
1188 1247 1355 1572 2210 1832 2307 2956 3001 2965 3010 3542
Table 3. Parallel computing elements number
2500 3000 3500 4000 2500 3000 3500 4000 4500 5000 5500 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000
two processors three processors time speedup efficiency time speedup efficiency the first set of data 805 1,95 0,97 778 2,02 0,67 929 1,97 0,98 908 2,02 0,767 1062 1,96 0,98 960 2,17 0,72 1198 1,96 0,98 1140 2,06 0,69 the second set of data 701 1,89 0,94 631 2,10 0,70 721 1,90 0,95 663 2,07 0,69 778 1,88 0,94 739 1,98 0,66 1184 1,78 0,89 1081 1,95 0,65 1765 1,66 0,83 1460 2,01 0,67 1526 1,64 0,82 1287 1,94 0,65 2139 1,66 0,83 1766 2,01 0,67 the third set of data 620 1,91 0,97 601 1,97 0,66 652 1,91 0,96 628 1,98 0,66 716 1,89 0,96 713 1,90 0,63 820 1,91 0,85 854 1,84 0,61 1296 1,70 0,85 1474 1,49 0,49 1090 1,68 0,84 1256 1,45 0,48 1377 1,67 0,84 1599 1,44 0,48 1795 1,64 0,82 2128 1,38 0,46 1816 1,65 0,83 1182 2,53 0,84 1799 1,64 0,82 1509 1,96 0,65 1816 1,65 0,82 1458 2,06 0,68 2000 1,77 0,88 1633 2,16 0,72
175
176
T. Str¸akowski and H. Rybiński
Table 3 contains the results for parallel computing of the algorithm. In the case of two processors the efficiency is near 100% because all the sets have 2 decision classes and the initial sets have completely different 1-ant rules (there was no redundancy). In the case of three processors the speedup coefficient is better than in the previous case, but worse than 3, hence there is a redundancy, though less than 100%.
5
Conclusions and Future Work
We have presented in the paper a recursive version of apriori and have shown its suitability for distributed computations of the decision rules. The performed experiments have shown that the algorithm reduces the processing time essentially, depending on the number of processors. The experiments have also shown that relative speedup decreases with the growing number of processors - obviously with the growing number of processors, there is a growing overhead for redundancy of partial rules, as well as communication between the processors. Nevertheless the proposed algorithm shows the property of scalability and can be used for spliting large tasks into a number of processors, giving rise to reasonable computation time for larger problems.
References 1. Kryszkiewicz, M.: Strong rules in large databases. In: Proceedings of 6th European Congress on Intelligent Techniques and Soft Computing EUFIT 1998, vol. 1, pp. 85–89 (1998) 2. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 26-28, pp. 207–216. ACM Press, New York (1993) 3. Dean, J., Grzymala-Busse, J.: An overview of the learning from examples module LEM1. Technical Report TR-88-2, Department of Computer Science, University of Kansas (1988) 4. Lingyun, T., Liping, A.: Incremental learning of decision rules based on rough set theory. In: Proceedings of the 4th World Congress on Intelligent Control and Automation, vol. 1, pp. 420–425 (2001) 5. Geng, Z., Zhu, Q.: A multi-agent method for parallel mining based on rough sets. In: Proceedings of The Sixth World Congress on Intelligent Control and Automation, vol. 2, pp. 826–850 (2006) 6. Agraval, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.: Fast discovery of association rules. In: Fayyad, U., Shapiro, G.P., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI, Menlo Park (1996) 7. Strkowski, T., Rybiński, H.: A distributed version of priori rule generation based on rough set theory. In: Lindemann, G., Burkhard, H.-D., Skowron, L.C., Schlingloff, A., Suraj, H.Z. (eds.) Proceedings of CS& P 2004 Workshop, vol. 2, pp. 390–397 (2004) 8. Karbowski, A. (ed.): E.N.S.: Obliczenia równoległe i rozproszone. Oficyna Wydawnicza Politechniki Warszawskiej (2001) (in Polish)
Decision Table Reduction in KDD: Fuzzy Rough Based Approach Eric Tsang and Zhao Suyun Department of Computing, The Hong Kong Polytechnic University Hong Kong
[email protected],
[email protected]
Abstract. Decision table reduction in KDD refers to the problem of selecting those input feature values that are most predictive of a given outcome by reducing a decision table like database from both vertical and horizontal directions. Fuzzy rough sets has been proven to be a useful tool of attribute reduction (i.e. reduce decision table from vertical direction). However, relatively less researches on decision table reduction using fuzzy rough sets has been performed. In this paper we focus on decision table reduction with fuzzy rough sets. First, we propose attribute-value reduction with fuzzy rough sets. The structure of the proposed value-reduction is then investigated by the approach of discernibility vector. Second, a rule covering system is described to reduce the valued-reduced decision table from horizontal direction. Finally, numerical example illustrates the proposed method of decision table reduction. The main contribution of this paper is that decision table reduction method is well combined with knowledge representation of fuzzy rough sets by fuzzy rough approximation value. The strict mathematical reasoning shows that the fuzzy rough approximation value is the reasonable criterion to keep the information invariant in the process of decision table reduction. Keywords: Fuzzy Rough Sets, Decision Table Reduction, Discernibility Vector.
1
Introduction
Decision table like database is one important type of knowledge representation system in Knowledge discovery in databases (KDD) and it is represented by a two dimensional table with rows labeled by objects and columns labeled by attributes (composed by condition and decision attributes). Most decision problems can be formulated employing a decision table formulism; therefore, this tool is particularly useful in decision making. At present, large-scale problems are becoming more popular, which raise some efficient problems in many areas such as machine learning, data mining and pattern recognition. It is not surprising that to reduce the decision table has been paid attentions by many researchers. However, the existing work tends to reduce the decision table from vertical direction, i.e. dimensionality reduction (transformation-based approaches [7], fuzzy rough J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 177–188, 2010. c Springer-Verlag Berlin Heidelberg 2010
178
E. Tsang and S. Zhao
approach). However, a technique that can reduce decision table from both horizontal and vertical direction using the information contained within the dataset and need no additional information (such as expert knowledge, data distribution) is clearly desirable. Rough set theory can be used as such a tool to reduce the decision table. However, we find that one obvious limitation of traditional rough sets is that it only works effectively on symbolic problems. As a result, one type of extensions of RS, Fuzzy Rough Sets has been proposed and studied which is used to handle problem with real numbers[10]. Owing to its wide applied background, fuzzy rough sets have attracted more and more attentions from both theoretical and practical fields [2,8-10,13-16,18,22,26,28-29,32]. Recently, some efforts have been put on attribute reduction with fuzzy rough sets [12,14,24]. Shen et al. first proposed the method of attribute reduction based on fuzzy rough sets [14]. Their key idea of attribute reduction is to keep the dependency degree invariant.Unlike [14], the authors in [12] proposed one attribute reduction method based on fuzzy rough sets by adopting the information entropy to measure the significance of the attributes. Their work perform well on some practical problems, but one obvious limitation of them is that their algorithm lacks mathematical foundation and theoretical analysis, many interesting topics related to attribute reduction, e.g. core and the structure of attribute reduction, are not discussed. In consideration of the above limitation, a unified framework of attribute reduction is then proposed in [24] which not only proposed a formal notion of attribute reduction based on fuzzy approximation operators, but also analyzed the mathematical structure of the attribute reduction by employing discernibility matrix. However, all these approaches do not mention another application of fuzzy rough sets, decision table reduction (which means to reduce the decision table from both vertical and horizontal dimensions). Till now, there exist one gap between decision table reduction and fuzzy rough sets. Decision table reduction, also called rule induction, from real valued datasets using rough set techniques has been less studied [25,27,30]. At most datasets containing real-valued features, some rule induction methods performed a discretization step beforehand and then designed the rule induction algorithm using rough set technique [25][30]. Unlike [25][30], by using rough set technique the reference [27] designed a method of learning fuzzy rules from database without discretization. This method performed well on some datasets, whereas the theoretical foundation of their method [27] is rather weak. For example, only the lower and upper approximation operators are proposed, whereas the theoretical structure of them, such as topologic and algebraic properties, is not studied [27]. Furthermore, the lower and upper approximations are not even used in the process of knowledge discovery, that is knowledge representation and knowledge discovery are unrelated. All these show that there exists one gap that fuzzy rough sets, as a well-defined generalization of rough sets has been less studied on decision table method designing. Now it is necessary to propose a method of decision table reduction (this method can induce a set of rules), in which the knowledge representation part and knowledge reduction part are well combined.
Decision Table Reduction in KDD: Fuzzy Rough Based Approach
179
In this paper, we propose one method of decision table reduction by using fuzzy rough sets. First, we give some definitions of attribute-value reduction. The key idea of attribute-value reduction is to keep the critical value, i.e. fuzzy lower approximation value invariant before and after reduction. Second, the structure of attribute value reduction is completely studied by using the discernibility vector approach. With this vector approach all the attribute value reductions of each of initial objects can be computed. Thus a solid mathematical foundation is setup for decision table reduction with fuzzy rough sets. After the description of rule covering system, a reduced fuzzy decision table is obtained which keep the information contained in the original decision table invariant. This reduced decision table corresponds to a set of rules which covers all the objects in the original dataset. Finally, a numerical example is presented to illustrate the proposed method. The rest of this paper is structured as follows. In Section 2 we mainly review basic concept about fuzzy rough sets. In Section 3 we propose the concept of attribute value reduction with fuzzy rough sets and study its structure by using discernibility vector. Also, rule covering system is described which is helpful to induce a set of rules from fuzzy decision table. In Section 4 one illustrative example is given to show the feasibility of the proposed method. The last section concludes this paper.
2
Review of Fuzzy Rough Sets
In this section we only review some basic concepts of fuzzy rough sets found in [16], a detailed review of the existing fuzzy rough sets can be found in [32][6]. Given a triangular norm T , the binary operation on I, ϑT (α, γ) = sup{θ ∈ I : T (α, θ) ≤ γ}, α, γ ∈ I , is called a R− implicator based on T . If T is lower semi-continuous, then θT is called the residuation implication of T , or the T −residuated implication. The properties of T −residuated implication are listed in [16]. Suppose U is a nonempty universe. A T −fuzzy similarity relation R is a fuzzy relation on U which is reflexive (R(x, x) = 1 ), symmetric (R(x, y) = R(y, x)) and T −transitive ( R(x, y) ≥ T (R(x, z), R(y, z)), for every x, y, z ∈ U . If ϑ is the T −residuated implication of a lower semi-continuous T −norm T , then the lower and upper approximation operators were defined as for every A ∈ F (U ), Rϑ A(x) = infu∈Y ϑ(R(u, x), A(u)); RT A(x) = supu∈U T (R(u, x), A(u)). In[16] [32] these two operators were studied in detail from constructive and axiomatic approaches, we only list their properties as follows. Theorem 2.1 [4][16][32] Suppose R is a fuzzy T −similarity relation. The following statements hold: 1) Rϑ (Rϑ A) = Rϑ A, RT (RT A) = RT A; 2) Both of Rϑ and RT are monotone; 3) RT (Rϑ A) = Rϑ A, Rϑ (RT A) = RT A;
180
E. Tsang and S. Zhao
4) RT (A) = A, Rϑ (A) = A; 5) Rϑ (A) = ∪{RT xλ : RT xλ ⊆ A}, RT (A) = ∪{RT xλ : xλ ⊆ A}, , here xλ is a fuzzy set defined as λ, y = x xλ (y) = 0, y =x
3
Decision Table Reduction with Fuzzy Rough Sets
In this section we design one method of decision table reduction with fuzzy rough sets. We first formulate a fuzzy decision table. Then we propose the concept of attribute-value reduction (corresponding to one reduction rule) and design one method (i.e. discernibility vector) to compute the attribute value reduction. The strict mathematical reasoning results show that by using the discernibility vector approach we can find all the attribute value reductions for each of original object in fuzzy decision table. Finally, we propose a rule covering system, in which each decision rule can cover several objects. As a result, a reduced decision table is obtained without missing the information contained in original decision table and this reduced decision table is equivalent to a set of rules. A fuzzy decision table, denoted by F DT = (U, C, D), consists of three parts: a finite universe U , a family of conditional fuzzy attributes C and a family of symbolic decision attribute D. For every fuzzy attribute (i.e. the attributes with real values), a fuzzy T −similarity relation can be employed to measure the similar degree between every pair of objects [10]. In the following of this paper we use RC to represent the fuzzy similarity relation defined by the condition attribute set C. One symbolic decision attribute corresponds to one equivalence relation, which generates a partition on U , denoted by U/P = {[x]P |x ∈ U } , where [x]P is the equivalent class containing x ∈ U . Given a fuzzy decision table F DT , with every x ∈ U we associate f dx to represent the corresponding decision rule of x ∈ U . The restriction of f dx to C, denoted by f dx |C = { a(x) a |a ∈ C} and the restriction of f dx to D, denoted d(x) by f dx |D = { d |d ∈ D} will be called the condition and decision of f dx respectively. This decision rule can be denoted by f dx |C → f dx |D. 3.1
Attribute-Value Reduction
The concept of attribute-value reduction is the preliminary of rule induction of fuzzy rough sets. The key idea of attribute-value reduction in this paper is to keep the information invariant before and after attribute-value reduction. Now it is necessary to find the critical value to keep the information of each object invariant. Let us consider the following theorem. Theorem 3.1: Given one object x and its corresponding decision rule f dx |C → f dx |D in a fuzzy decision table F DT = (U, C, D), for one object y ∈ U , if ϑ(RC (x, y), 0) < infu∈U ϑ(RC (x, u), [x]D (u)), then [x]D (y) = 1 . Proof: We prove it by contradiction. Assume [x]D (y) = 0 , then
Decision Table Reduction in KDD: Fuzzy Rough Based Approach
181
infu∈U ϑ(RC (x, u), [x]D (u)) ≤ ϑ(RC (x, y), [x]D (y)) = ϑ(RC (x, y), 0). This contradicts the given condition and we get [x]D (y) = 1. Theorem 3.1 show that if ϑ(RC (x, y), 0) > infu∈U ϑ(RC (x, u), [x]D (u)), then [x]D (y) = 0 may happen. That is to say infu∈U ϑ(RC (x, u), [x]D (u)) is the maximum value to guarantee two objects consistent (having the identical decision class). When ϑ(RC (x, y), 0) < infu∈U ϑ(RC (x, u), [x]D (u)), the objects x and y are always consistent, otherwise they may be inconsistent . By considering this theorem, we define the consistence degree of one object in F DT . Definition 3.1 (consistence degree): Given an arbitrary object x in F DT = (U, C, D), let ConC (D)(x) = infu∈U ϑ(RC (x, u), [x]D (u)), then ConC (D)(x) is called the consistence degree of x in F DT . Since f dx |C → f dx |D is the corresponding decision rule of x, ConC (D)(x) is also called the consistence degree of f dx |C → f dx |D. It is important to remove the superfluous attribute value in F DT . Similar with the key idea of attribute reduction, the key idea of reducing attribute value is to keep the information invariant, that is, to keep the consistence degree of each fuzzy decision rule invariant. Definition 3.2 (attribute-value reduction, i.e. reduction rule): Given an arbitrary object x in F DT = (U, C, D), if the subset B(x) ⊆ C(x) satisfies the following two formulae (D1) ConC (D)(x) = ConB (D)(x); (D2) ∀b ∈ B, ConC (D)(x) > Con{B−b} (D)(x). Then attribute-value subset B(x) is the attribute-value reduction of x. We also say that the fuzzy rule f dx |B → f dx |D is the reduction rule of f dx |C → f dx |D. The attribute value a(x) ∈ B(x) ⊆ C(x) is dispensable in attribute-value set B(x) if ConC (D)(x) = Con{B−a} (D)(x) ; otherwise it is indispensable. Definition 3.3 (attribute-value core): Given one object x in F DT = (U, C, D), the collection of the indispensable attribute values is the value core of x, denoted by Core(x). Theorem 3.2: For A ⊆ U in F DT = (U, C, D), Let λ = Rϑ A(x), then RT xλ ⊆ A for x ∈ U . Here R is the T −similarity relation corresponding to the attribute set C . Proof: By thegranular representation of Rυ A = {RT xλ : (RT xλ )α ⊆ A}, we get that β = ( {RT xλ : (RT xλ )α ⊆ A})(z). For any x ∈ U there exist t ∈ (0, 1] and y ∈ U satisfying β = RT yt (z) and (RT yt )α ⊆ A. Then T (R(y, z), t) = β and T (R(y, x), t) ≤ A(x) for any x ∈ U hold. Thus, the statement ∀x ∈ U , RT zβ (x) = T (R(z, x), β) = T (R(z, x), T (R(y, z), t)) ≤ T (R(y, x), t) holds. If T (R(x, z), t) > α, then T (R(y, z), t) > α. Thus we have T (R(x, z), β) ≤ A(x). Hence RT zβ ⊆ A. Theorem 3.3: Given one object x in F DT = (U, C, D) , the following two formulae are equivalent.
182
E. Tsang and S. Zhao
(T1): B(x) contains one attribute-value reduction of x. (T2): B(x) ⊆ C(x) satisfies T (RB (x, y), λ) = 0 for every y ∈ [x]D . Here λ = infu∈U ϑ(RC (x, u), [x]D (u)). Proof: (T1)⇒(T2): Assume that B(x) contains the attribute-value reduction of x. By Definition 3.2, we have ConC (D)(x) = ConB (D)(x). Let λ = infu∈U ϑ (RC (x, u), [x]D (u)) we have λ = infu∈U ϑ(RB (x, u), [x]D (u)) by the definition of the consistence degree. By Theorem 3.2, we have RB ⊆ [x]D . Thus we have T (RB (x, y), λ) = 0 for every y ∈ [x]D . (T2)⇒(T1): Clearly, λ = infu∈U ϑ(RC (x, u), [x]D (u)) ≥ infu∈U ϑ(RB (x, u), [x]D (u)). We have (RB )T xλ ⊆ [x]D by the condition T (RB (x, y), λ) = 0 for every y ∈ [x]D . By Theorem 2.1, we get that (RB )T xλ ⊆ (RB )ϑ [x]D . (RB )T xλ (x) ≤ (RB )ϑ [x]D (x) ⇒ λ ≤ (RB )ϑ [x]D (x) ⇒ λ = (RC )ϑ [x]D (x) ≤ (RB )ϑ [x]D (x). Thus λ = (RC )ϑ [x]D (x) = (RB )ϑ [x]D (x) holds. By the definition of consistence degree, we have ConC (D)(x) = ConB (D)(x) . By the definition of attributevalue reduction, we conclude that B(x) contains one condition attribute-value reduction of x. Using Theorem 3.3, we construct the discernibility vector as follows. Suppose U = {x1 , x2 , x3 , · · · xn }, by V ector(U, C, D, xi ) we denote an n × 1 vector (cj ), called the discernibility vector of xi ∈ U , such that (V 1) cj = {a : T (a(xi , xj ), λ) = 0}, here λ = infu∈U ϑ(RC (xi , u), [xi ]D (u)) for D(xi , xj ) = 0 ; (V 2) Cj = ∅, for D(xi , xj ) = 1. A discernibility function fx (F DT ) for x in F DT is a Boolean function of m Boolean variables a1 , · · · , am corresponding to the attributes a1 , · · · , am , respectively, and defined as follows: fx (F DT )(a1 , · · · , am ) = ∧{∨(cj ) : 1 ≤ j ≤ n}, where ∨(cj ) is the disjunction of all variables a such that a ∈ cj . Let gx (F DT ) be the reduced disjunctive form of fx (F DT ) obtained from fx (F DT ) by applying the multiplication and absorption laws as many times as possible. Then there exist l and Reductk (x) ⊆ C(x) for k = 1, · · · , l such that gx (F DT ) = (∧Reduct1 (x)) ∨ · · · ∨ (∧Reductl (x)) where every element in Reductk (x) only appears one time. Theorem 3.4: RedD (C)(x) = {Reduct1 (x), · · · , Reductl (x)}, here RedD (C)(x) is the collection of all attribute-value reductions of x. The proof is omitted since this theorem is similar to the one in [24]. 3.2
Rule Covering System
Since each attribute-value reduction corresponds to a reduction rule. It is necessary to discuss the relation between decision rule and objects. In the following we propose one concept named rule covering. Definition 3.4 (rule covering): Given one fuzzy decision rule f dx |C → f dx |D and one object y in F DT , the fuzzy decision rule f dx |C → f dx |D is said to cover the object y if ϑ(RC (x, y), 0) < ConC (D)(x) and [x]D (y) = 1.
Decision Table Reduction in KDD: Fuzzy Rough Based Approach
183
One may note that when the fuzzy decision rule f dx |C → f dx |D covers the object y, it may happen that the fuzzy decision rule f dy |C → f dy |D does not cover the object x since ϑ(RC (x, y), 0) < ConC (D)(x) can not infer that the formula ϑ(RC (x, y), 0) < ConC (D)(y) holds. Corollary 3.1: Given one fuzzy decision rule f dx |C → f dx |D and one object y in F DT , if ϑ(RC (x, y), 0) < ConC (D)(x), then fuzzy decision rule f dx |C → f dx |D covers the object y. It is straightforward from Theorem 3.1 and Definition 3.4. We now describe several theorems about covering power of reduction rule as follows. Theorem 3.5: Given one fuzzy decision rule f dx |C → f dx |D and one object y in F DT , if the fuzzy decision rule f dx |C → f dx |D covers the object y , then the reduction rule of f dx |C → f dx |D also covers the object y. Proof: Assume f dx |B → f dx |D is the reduction rule of f dx |C → f dx |D. According to the definition of reduction rule , we get that ConB (D)(x) = ConC (D)(x). According to the definition of rule covering, we get that ϑ(RC (x, y), 0) < ConC (D)(x) and f dx |D = f dy |D. By the monotonicity of residual implicator, we get that ϑ(RB (x, y), 0) ≤ ϑ(RC (x, y), 0) < ConC (D)(x), i.e. ϑ(RB (x, y), 0) < ConB (D)(x) . By the definition of rule covering, we get that the reduction rule of f dx |C → f dx |D covers the object y. Theorem 3.5 shows that the covering power of decision rules in the fuzzy decision table does not change after value reduction. That is to say, the reduction rule keeps the information contained in the original decision rule invariant. Theorem 3.6: Suppose f dy |B → f dy |D is the reduction rule of f dx |C → f dx |D in F DT . For one object y ∈ U , if ϑ(RB (x, y), 0) < ConB (D)(x), then [x]D (y) = 1. Proof: it is straightforward to get this result from Theorem 3.1.
By using the method of attribute-value reduction and rule covering system, a set of rules can be found which covers all the objects in a fuzzy decision table. This result is seen as one decision table reduced from both vertical and horizontal directions.
4
An Illustrative Example
As pointed in [19], in crisp rough sets the computing of reductions by discernibility matrix is a NP-complete problem. Similarly, the computing of reduction rules by using the discernibility vector approach is also a NP-complete problem. In this paper we will not discuss the computing of reduction rules with fuzzy rough sets, which will be our future work. Following we employ an example to illustrate our idea in this paper. Our method is specified on triangular norm ’Lukasiewicz T −norm’ to design the example since many discussions on the selection of triangular norm T − emphasize the well known Lukasiewicz’s triangular norm as a suitable selection
184
E. Tsang and S. Zhao
[1][10][23]. After specification of the triangular T −norm, the discernibility vector based on ’Lukasiewicz T −norm’ is then specified as follows. Suppose U = {x1 , x2 , x3 , · · · xn }, by V ector(U, C, D, xi ) we denote an 1 × n vector (cj ), called the discernibility vector of the fuzzy decision rule xi , such that (V 3) cj = {a : (a(xi , xj ) + λ) ≤ 1 + α}, for D(xi , xj ) = 0 , where λ = infu∈U ϑ(RC (xi , u), [xi ]D (u)) ; (V 4) Cj = ∅, for D(xi , xj ) = 1. Example 4.1: Given a simple decision table with 10 objects as follows (see Table 1), there are 12 condition fuzzy attributes R={a,b,c,d,e,f,g,h,i.j,k,h} and one decision symbolic attribute {D}. There are two decision classes: 0 and 1. The objects {1 2 6 8} belong to class 0 and {3 4 5 7 9 10} belong to class 1. Table 1. One simple fuzzy decision table Objects x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
a 0.9 0.9 0.1 0 0.1 0.1 0 0.9 0.8 0
b 0.1 0.1 0.9 0.1 0 0.1 1 0.1 0.2 0.1
c 0 0.1 0.2 0.9 0.9 0.9 0 0 0 0.9
d 0.9 0.8 0.9 0.1 0 0 0 0.3 0 0
e 0.1 0.2 0.1 0.9 0.1 0.2 0.1 0.9 0.4 1
f 0 0.1 0.1 0 0.9 0.9 0.9 0.1 0.6 0
g 0.8 0.9 0.9 0.6 0 0.1 0.1 0.9 0 0
h 0.2 0.2 0.1 0.5 1 0.9 0.9 0.1 1 1
i 0 0 0 0 0 0 0 0 0 0
j 0.7 0.1 0.9 0.8 0.8 0.1 0.2 1 1 0.9
k 0.4 0.8 0.1 0.3 0.2 0.9 0.9 0 0 0.1
l 0 0 0 0 0 0 0 0 0 0
D 0 0 1 1 1 0 1 0 1 1
In this example, the Lukasiewicz T −norm TL (x, y) = max(0, x + y − 1) is selected as the T −norm to construct the S−lower approximation operator. Since the dual conorm of Lukasiewicz T −norm is SL = min(1, x + y), the lower approximation is then specified as RS A(x) = infu∈U min(1 − R(x, u) + A(u), 0). The lower approximations of each decision class are listed in Table 2. The consistence degree of each object computed by ConR (D) = RS [x]D (x) is listed in Table 3. The consistence degree is the critical value to reduce the redundant attribute-values. By strict mathematical reasoning, the discernibility vector is designed to compute all the attribute-value reductions of each of the objects in fuzzy decision table.
Table 2. The lower approximation of each decision class Objects X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Lower Approximation of class 0 0.8 0.8 0 0 0 0.7 0 0.8 0 0 Lower Approximation of class 1 0 0 0.8 0.9 0.7 0 0.9 0 0.9 0.9
Decision Table Reduction in KDD: Fuzzy Rough Based Approach
185
Table 3. The consistence degree of each object Objects X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Consistence degree 0 0.8 0.8 0.8 0.9 0.7 0.7 0.9 0.8 0.9 0.9
The specified formulae to compute the discernibility vector are: cj = {a : (a(xi , xj ) + λ) ≤ 1 + α}, for D(xi , xj ) = 0 , where λ = ConR (D)(xi )) ; cj = ∅, for D(xi , xj ) = 1. The discernibility vectors of each of the objects are listed in Table 4. Each column in Table 4 corresponds one discernibility vector. As a result, all discernibility vectors compose one matrix with 10 rows and 10 columns. Table 4. The discernibility vectors Objects x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
X1 X2 ∅ ∅ ∅ ∅ {ab} {abj} {acde} {ac} {acdf gh} {acdf gh} ∅ ∅ {abdf } {abdf g} ∅ ∅ {dgh} {dghjk} {acdegh} {acdeghj}
X3 X4 {ab} {ac} {abj} {a} ∅ ∅ ∅ ∅ ∅ ∅ {bdf ghjk} {f } ∅ ∅ {abe} {ac} ∅ ∅ ∅ ∅
X5 X6 {acdf gh} ∅ {acdf ghj} ∅ ∅ {bcdf ghjk} ∅ {ef j} ∅ {jk} {jk} ∅ ∅ {bc} {acef gh} ∅ ∅ {acjk} ∅ {ef jk}
X7 X8 X9 {abdf } ∅ {d} {ab} ∅ {gj} ∅ {abe} ∅ ∅ {ac} ∅ ∅ {acef gh} ∅ {bc} ∅ {cjk} ∅ {abef ghjk} ∅ {abk} ∅ {gh} ∅ {gh} ∅ ∅ {acgh} ∅
X10 {acde} {ag} ∅ ∅ ∅ {f } ∅ {acgh} ∅ ∅
The attribute-value core is the collection of the most important attribute values of the corresponding object, which corresponds to the union of the entries with single element in the discernibility vector. For example, the attribute-value core of object X4 is the collection of the values of the attributes a and f on the object X4 , i.e. {(a, 0)(f, 0.1)}. The attribute-value core of each of the objects is listed in Table 5. Table 5. The attribute-value core of every object Objects X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Core 0 ∅ ∅ ∅ {af } ∅ ∅ ∅ ∅ {d} {f }
One attribute-value reduction of each object is calculated. All the value reductions are listed in Table 6. The notation ∗ represents attribute-value on the corresponding position has been reduced. Table 6 shows that the fuzzy decision table have been significantly reduced after attribute-value reduction. According to the analysis and discussion on Section 3, each attribute-value reduction corresponds one reduction rule. By using the rule covering system, we can find a set of rules which covers all the objects in the fuzzy decision table. One set of rules induced from Table is listed in Table 7. This set of rules is the result of decision table reduction. This rule set can be seen as a rule-based classifier learning from Table 1 by using fuzzy rough sets. Each row in Table 7 corresponds to one if-then production rule. For example, the first row corresponds the following rule:
186
E. Tsang and S. Zhao Table 6. Attribute-value reduction of every object Objects x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
a 0.9 0.9 0.1 0 0.1 ∗ 0 0.9 ∗ 0
b ∗ ∗ 0.9 ∗ ∗ ∗ 1 ∗ ∗ ∗
c ∗ ∗ ∗ 0.9 ∗ 0.9 ∗ ∗ 0 ∗
d 0.9 0.8 ∗ ∗ ∗ ∗ ∗ ∗ 0 ∗
e ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
f ∗ ∗ ∗ 0 ∗ ∗ ∗ ∗ ∗ 0
g ∗ ∗ ∗ ∗ ∗ ∗ ∗ 0.9 0 ∗
h ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
i ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
j ∗ ∗ ∗ ∗ 0.8 0.1 ∗ ∗ ∗ ∗
k ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
l ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
D 0 0 1 1 1 0 1 0 1 1
Table 7. A reduced decision table (which can also be seen as a set of decision rules) Objects x5 x6 x1 x3
a 0.1 ∗ 0.9 0.1
b ∗ ∗ ∗ 0.9
c ∗ 0.9 ∗ ∗
d ∗ ∗ 0.9 ∗
e ∗ ∗ ∗ ∗
f ∗ ∗ ∗ ∗
g ∗ ∗ ∗ ∗
h ∗ ∗ ∗ ∗
i ∗ ∗ ∗ ∗
j 0.8 0.1 ∗ ∗
k ∗ ∗ ∗ ∗
l ∗ ∗ ∗ ∗
D Consistence degree 1 0.7 0 0.7 0 0.8 1 0.8
If N (P1 (objecty , objectx )) < 0.7 with x = (a, 0.1), (j, 0.8) and P1 = {a, j}, then the object y belongs to decision class 1. This example shows that the proposed method of rule induction in this paper is feasible. It is promised to handle the real classification problems by using the idea of building classifier in this paper.
5
Conclusion
In this paper a method of decision table reduction with fuzzy rough sets is proposed. The proposed method is twofold: attribute-value reduction and rule covering system. The key idea of the attribute-value reduction is to keep the information invariant before and after attribute-value reduction. Based on this idea, the approach of discernibility vector is proposed by using which all the attribute value reductions of each object can be found. After designing the rule covering system, a set of rules can be induced from one fuzzy decision table. This set of rules is equivalent to one decision table reduced from both horizontal and vertical directions. Finally, one illustrative example shows that the proposed method is feasible. For further real applications, our future work is to improve the proposed method to a robust framework.
Acknowledgements This research has been supported by the Hong Kong RGC CERG research grants: PolyU 5273/05E (B-Q943) and PolyU 5281/07E (B-Q06C).
Decision Table Reduction in KDD: Fuzzy Rough Based Approach
187
References 1. Bezdek, J.C., Harris, J.O.: Fuzzy partitions and relations: an axiomatic basis of clustering. Fuzzy Sets and Systems 84, 143–153 (1996) 2. Bhatt, R.B., Gopal, M.: On fuzzy rough sets approach to feature selection. Pattern recognition Letters 26, 1632–1640 (2005) 3. Cattaneo, G.: Fuzzy extension of rough sets theory. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 275–282. Springer, Heidelberg (1998) 4. Degang, C., Wenxiu, Z., Yeung, D.S., Tsang, E.C.C.: Rough approximations on a complete completely distributive lattice with applications to generalized rough sets. Information Sciences 176, 1829–1848 (2006) 5. Chen, D.G., Wang, X.Z., Zhao, S.Y.: Attribute Reduction Based on Fuzzy Rough Sets. In: Kryszkiewicz, M., Peters, J.F., Rybi´ nski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 381–390. Springer, Heidelberg (2007) 6. Cornelis, C., De Cock, M., Radzikowska, A.M.: Fuzzy rough sets: from theory into practice. In: Pedrycz, W., Skowron, A., Kreinovich, V. (eds.) Handbook of Granular Computing. Springer, Heidelberg (in press) 7. Devijver, P., Kittler, J.: Pattern Recognition: A Statistical Approach. Prentice Hall, Englewood Cliffs (1982) 8. Dubois, D., Prade, H.: Rough fuzzy sets and fuzzy rough sets. Internat. J. Genaral Systems 17(2-3), 191–209 (1990) 9. Dubois, D., Prade, H.: Putting rough sets and fuzzy sets together, Intelligent Decision support. In: Slowinski, R. (ed.) Handbook of applications and advances of the rough sets theory. Kluwer Academic Publishers, Dordrecht (1992) 10. Fernandez Salido, J.M., Murakami, S.: Rough set analysis of a general type of fuzzy data using transitive aggregations of fuzzy similarity relations. Fuzzy Sets and Systems 139, 635–660 (2003) 11. Greco, S., Inuiguchi, M., Slowinski, R.: A new proposal for fuzzy rough approximations and gradual decision rule representation. In: Peters, J.F., Skowron, A., Dubois, D., Grzymala-Busse, J.W., Inuiguchi, M., Polkowski, L. (eds.) Transactions on Rough Sets II. LNCS, vol. 3135, pp. 319–342. Springer, Heidelberg (2004) 12. Hu, Q.H., Yu, D.R., Xie, Z.X.: Information-preserving hybrid data reduction based on fuzzy-rough techniques. Pattern Recognition Letters 27, 414–423 (2006) 13. Hong, T.P.: Learning approximate fuzzy rules from training examples. In: The proceeding of the Tenth IEEE International Conference on Fuzzy Systems, Melbourne, Australia, vol. 1, pp. 256–259 (2001) 14. Jensen, R., Shen, Q.: Fuzzy-rough attribute reduction with application to web categorization. Fuzzy Sets and Systems 141, 469–485 (2004) 15. Mi, J.S., Zhang, W.X.: An axiomatic characterization of a fuzzy generalization of rough sets. Information Sciences 160(1-4), 235–249 (2004) 16. Morsi, N.N., Yakout, M.M.: Axiomatics for fuzzy rough sets. Fuzzy Sets and Systems 100(1998), 327–342 (1998) 17. Pawlak, Z.: Rough Sets Internat. J. Comput. Inform. Sci. 11(5), 341–356 (1982) 18. Radzikowska, A.M., Kerre, E.E.: A comparative study of fuzzy rough sets. Fuzzy Sets and Systems 126, 137–155 (2002) 19. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems, Intelligent Decision support. In: Slowinski, R. (ed.) Handbook of applications and advances of the rough sets theory. Kluwer Academic Publishers, Dordrecht (1992)
188
E. Tsang and S. Zhao
20. Skowron, A., Polkowski, L.: Rough sets in knowledge discovery, vol. 1,2. Springer, Berlin (1998) 21. Slowinski, R. (ed.): Intelligent decision support: Handbook of applications and advances of the rough sets theory. Kluwer Academic Publishers, Dordrecht (1992) 22. Slowinski, R., Vanderpooten, D.: Similarity relation as a basis for rough approximations. In: Wang, P.P. (ed.) Advances in Machine Intelligence and Soft-Computing, Department of Electrical Engineering, Duke University, Durham, NC, USA, pp. 17–33 (1997) 23. Sudkamp, T.: Similarity, interpolation, and fuzzy rule construction. Fuzzy Sets and Systems 58, 73–86 (1993) 24. Tsang, E.C.C., Chen, D.G., Yeung, D.S., Wang, X.Z., Lee, J.W.T.: Attributes reduction using fuzzy rough sets. IEEE Transaction on Fuzzy System (in press) 25. Tsai, Y.-C., Cheng, C.-H., Chang, J.-R.: Entropy-based fuzzy rough classification approach for extracting classification rules. Expert Systems with Applications 31(2), 436–443 (2006) 26. Wang, X.Z., Hong, J.R.: Learning optimization in simplifying fuzzy rules. Fuzzy Sets and Systems 106, 349–356 (1999) 27. Wang, X.Z., Tsang, E.C.C., Zhao, S.Y., Chen, D.G., Yeung, D.S.: Learning Fuzzy Rules from Fuzzy Samples Based on Rough Set Technique. Information Sciences 177(20), 4493–4514 (2007) 28. Wu, W.Z., Mi, J.S., Zhang, W.X.: Generalized fuzzy rough sets. Information Sciences 151, 263–282 (2003) 29. Wu, W.Z., Zhang, W.X.: Constructive and axiomatic approaches of fuzzy approximation operators. Information Sciences 159(3-4), 233–254 (2004) 30. Yasdi, R.: Learning classification rules from database in the context of knowledge acquisition and representation. IEEE transaction on knowledge and data engineering 3(3), 293–306 (1991) 31. Yao, Y.Y.: Combination of rough and fuzzy sets based on level sets. In: Lin, T.Y., Cercone, N. (eds.) Rough Sets and Data mining: Analysis for Imprecise Data, pp. 301–321. Kluwer Academic Publishers, Boston (1997) 32. Yeung, D.S., Chen, D.G., Tsang, E.C.C., Lee, J.W.T.: On the Generalization of Fuzzy Rough Sets. IEEE Transactions on Fuzzy Systems 13, 343–361 (2005) 33. Zhao, S., Tsang, E.C.C.: The Analysis of Attribute Reduction on Fuzzy Rough Sets: the T-norm and Fuzzy Approximation Operator Perspective. Submitted to information science special issue 34. Ziarko, W.P. (ed.): Rough sets, fuzzy sets and knowledge discovery, Workshop in Computing. Springer, London (1994)
Author Index
Blaszczy´ nski, Jerzy
40
Cyran, Krzysztof A.
53
Gomoli´ nska, Anna 66 Grzymala-Busse, Jerzy W. Moshkov, Mikhail
Rybi´ nski, Henryk 161 Rz¸asa, Wojciech 14
1, 14
Sikora, Marek 130 Skowron, Andrzej 92 Slowi´ nski, Roman 40 Stefanowski, Jerzy 40 Str¸akowski, Tomasz 161 Suraj, Zbigniew 92
92 Tsang, Eric
Pal, Sankar K.
106
Zhao, Suyun
177 177