Advances in Fuzzy Object-Oriented Databases: Modeling and Applications
Zongmin Ma Université de Sherbrooke, Canada
IDEA GROUP PUBLISHING Hershey • London • Melbourne • Singapore
Acquisitions Editor: Senior Managing Editor: Managing Editor: Development Editor: Copy Editor: Typesetter: Cover Design: Printed at:
Mehdi Khosrow-Pour Jan Travers Amanda Appicello Michele Rossi Lori Eby Jennifer Wetzel Lisa Tosheff Yurchak Printing Inc.
Published in the United States of America by Idea Group Publishing (an imprint of Idea Group Inc.) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail:
[email protected] Web site: http://www.idea-group.com and in the United Kingdom by Idea Group Publishing (an imprint of Idea Group Inc.) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 3313 Web site: http://www.eurospan.co.uk Copyright © 2005 by Idea Group Inc. All rights reserved. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Library of Congress Cataloging-in-Publication Data Advances in fuzzy object-oriented databases : modeling and applications / Zongmin Ma, editor. p. cm. Includes bibliographical references and index. ISBN 1-59140-384-7 (h/c) — ISBN 1-59140-385-5 (s/c) — ISBN 1-59140-386-3 (eISBN) 1. Object-oriented databases. 2. Fuzzy systems. I. Ma, Zongmin, 1965QA76.9.D3A34833 2004 005.75’7—dc22 2004017843
British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher.
Advances in Fuzzy Object-Oriented Databases: Modeling and Applications Table of Contents
Preface .............................................................................................................. v
SECTION I Chapter I. A Constraint Based Fuzzy Object Oriented Database Model ............................................................................................................... 1 G. de Tré, Ghent University, Belgium R. de Caluwe, Ghent University, Belgium Chapter II. Fuzzy and Probabilistic Object Bases .................................. 46 T. H. Cao, Ho Chi Minh City University of Technology, Vietnam H. Nguyen, Ho Chi Minh City Open University, Vietnam Chapter III. Generalization Data Mining in Fuzzy Object-Oriented Databases ....................................................................................................... 85 Rafal Angryk, Tulane University, USA Roy Ladner, Naval Research Laboratory, USA Frederick E. Petry, Tulane University & Naval Research Laboratory, USA Chapter IV. FRIL++ and Its Applications ............................................ 113 J. M. Rossiter, University of Bristol, UK & Bio-Mimetic Control Research Center, The Institute of Physical and Chemical Research (RIKEN), Japan T. H. Cao, Ho Chi Minh City University of Technology, Vietnam
SECTION II Chapter V. Fuzzy Information Modeling with the UML .................... 153 Zongmin Ma, Université de Sherbrooke, Canada
SECTION III Chapter VI. A Framework to Build Fuzzy Object-Oriented Capabilities Over an Existing Database System ........................................................ 177 Fernando Berzal, University of Granada, Spain Nicolás Marín, University of Granada, Spain Olga Pons, University of Granada, Spain M. Amparo Vila, University of Granada, Spain Chapter VII. Index Structures for Fuzzy Object-Oriented Database Systems ....................................................................................................... 206 Sven Helmer, Universität Mannheim, Germany Chapter VIII. Introducing Fuzziness in Existing Orthogonal Persistence Interfaces and Systems ....................................................... 241 Miguel Ángel Sicilia, University of Alcalá, Spain Elena García-Barriocanal, University of Alcalá, Spain José A. Gutiérrez, University of Alcalá, Spain
SECTION IV Chapter IX. An Object-Oriented Approach to Managing Fuzziness in Spatially Explicit Ecological Models Coupled to a Geographic Database ...................................................................................................... 269 Vincent B. Robinson, University of Toronto at Mississauga, Canada Phil A. Graniero, University of Windsor, Canada Chapter X. Object-Oriented Publish/Subscribe for Modeling and Processing Imperfect Information .......................................................... 301 Haifeng Liu, University of Toronto, Canada Hans Arno Jacobsen, University of Toronto, Canada About the Authors ..................................................................................... 332 Index ............................................................................................................ 338
v
Preface
A major goal for database research has been the incorporation of additional semantics into the data model. Classical data models often suffer from their incapability to represent and manipulate imprecise and uncertain information that may occur in many real-world applications. Since the early 1980s, Zadeh’s fuzzy logic has been used to extend various data models. The purpose of introducing fuzzy logic in data modeling is to enhance the classical models so that uncertain and imprecise information can be represented and manipulated. This resulted in numerous contributions, mainly with respect to the popular relational model or to some related form of it. However, rapid advances in computing power brought opportunities for databases in emerging applications in CAD/CAM, multimedia, geographic information systems, knowledge management, etc. These applications characteristically require the modeling and manipulation of complex objects and semantic relationships. The advances of object-oriented databases are acknowledged outside the research and academic worlds. It proves that the object-oriented paradigm lends itself extremely well to the requirements. Because the classical relational database model and its extension of fuzziness do not satisfy the need of modeling complex objects with imprecision and uncertainty, currently, much research has concentrated on fuzzy object-oriented database models in order to deal with complex objects and uncertain data together. This book focuses on an important extension of the object-oriented paradigm that allows for the inclusion of fuzzy information in this paradigm and presents the latest research and application results in fuzzy object-oriented databases. Some major issues on concepts, semantics, models, design, implementation, and applications of fuzzy object-oriented databases will be investigated in the book. The different chapters in the book were contributed by different authors and provide possible solutions for the different types of technological problems concerning fuzzy object-oriented databases. Each of the contributors to the book is a leading researcher in the field of fuzzy object-oriented databases who has made numerous contributions to fuzzy information engineering.
vi
Introduction This book is organized into four major sections. The first section discusses the issues of the representation, semantics, and models of fuzzy object-oriented databases in the first four chapters. Chapter V describes fuzzy object-oriented conceptual data modeling and comprises the second part. The next three chapters covering the implementation issues in fuzzy object-oriented databases comprise the third part. Finally, the last two chapters, which comprise the fourth part, contain applications of fuzzy object-oriented information modeling and fuzzy databases in publish/subscribe and geographic information systems, respectively. First, we will look at the problem of the representation, semantics, and models of fuzzy object-oriented databases. The authors of the Chapter I, de Tré and de Caluwe, define a fuzzy objectoriented formal database model that allows us to model and manipulate information in a (true to nature) natural way. The presented model was built upon an object-oriented-type system and an elaborated constraint system, which, respectively, support the definitions of types and constraints. Types and constraints are the basic building blocks of object schemes, which, in turn, are used for defining database schemes. Finally, the definition of the database model was obtained by providing adequate data definition operators and data manipulation operators. Novelties in the approach are the incorporation of generalized constraints and of extended possibilistic truth values, which allow for a better representation of data(base) semantics. Cao and Nguyen introduce an extension of the probabilistic object base model. Their model is not the same as the probabilistic object base model that was investigated in the literature. Their model uses fuzzy sets for representing and handling vague and imprecise values of object attributes. A probabilistic interpretation of relations on fuzzy set values is proposed to integrate them into that probability-based framework. Then, the definitions of fuzzy-probabilistic object base schemas, instances, and algebraic operations are presented. Angryk, Ladner, and Petry extend the attribute generalization algorithms that were most commonly applied to relational databases and consider the application of generalization-based data mining to fuzzy similarity based object-oriented databases. A key aspect of generalization data mining is the use of a concept hierarchy. The objects of the database are generalized by replacing specific attribute values with the next higher-level term in the hierarchy. This will eventually result in generalizations that represent a summarization of the information in the database. The authors focus on the generalization of similarity-based simple fuzzy attributes for an object-oriented database (OODB) using approaches to the fuzzy concept hierarchy developed from the given similarity relation of the database. They then consider application of this approach to complex structure-valued data in the fuzzy OODB.
vii
Rossiter and Cao introduce a deductive probabilistic and fuzzy object-oriented database language, called FRIL++, which can deal with both probability and fuzziness. Its foundation is a logic-based probabilistic and fuzzy object-oriented model in which a class property (i.e., an attribute or a method) can contain fuzzy set values, and uncertain class membership and property applicability are measured by lower and upper bounds on probability. Each uncertainly applicable property is interpreted as a default probabilistic logic rule, which is defeasible. Probabilistic default reasoning on fuzzy events is proposed for uncertain property inheritance and class recognition. The authors present the design, implementation, and basic features of FRIL++. FRIL++, as described in Chapter IV can be used as a modeling and a programming language, as demonstrated by its applications to machine learning, user modeling, and modeling with words herein. The next section takes another look at the semantics and representation of fuzzy object-oriented data modeling, but from the perspective of a conceptual data model. Conceptual data models were proposed for the conceptual design of databases and conceptual data modeling in some nontraditional areas. Ma concentrates on the Unified Modeling Language (UML), a set of object-oriented modeling notations, and a standard of the Object Data Management Group (ODMG), which can be applied in many areas of software engineering and knowledge engineering. In order to model complex objects and uncertain data, the author extends the class of the UML by using fuzzy set and possibility distribution theory. The different levels of fuzziness are introduced, and the corresponding graphical representations are given. The class diagrams of the UML can hereby model fuzzy information. In the third section, we see some implementation issues of fuzzy object-oriented databases: building fuzzy object-oriented capabilities over an existing database system, indexing fuzzy object-oriented database systems, and introducing fuzziness in existing orthogonal persistence interfaces and systems. Berzal, Marín, Pons, and Vila describe both a framework and an architecture that can be used to develop fuzzy object-oriented capabilities using the conventional features of the object-oriented data paradigm. The authors present a framework composed of a set of classical classes that gives support to fuzzilydescribed complex objects. They also explain how to deal with fuzzy extensions of object-oriented features using, as a basis, conventional object-oriented features. The proposal given in the chapter can be used to build a fuzzy objectoriented database system, taking as its basis an existing database system, minimizing the development effort. Helmer gives an overview of indexing techniques suitable for fuzzy objectoriented databases (FOODBS). First, the author identifies typical query patterns used in FOODBS, namely single-valued, set-valued, navigational, and type hierarchy access. Here, the description of the patterns does not follow a par-
viii
ticular fuzzy object-oriented data model but is kept general enough to be used in different FOODBS contexts. Second, the author presents the index structures for each query pattern, which support the efficient evaluation of these queries. An explanation of the basic techniques from standard index structures (like B-trees) to sophisticated access methods (like Join Index Hierarchies) is given in the chapter rather than an exhaustive description. Sicilia, García-Barriocanal, and Gutiérrez focus on how to integrate the models and techniques that can deal with imprecise and uncertain information in the facets of object data stores with current database design and programming practices, so that the benefits of fuzzy extensions can be easily adopted and seamlessly integrated in current applications. The authors try to provide some criteria to use to select the fuzzy extensions that more seamlessly integrate into the current object storage paradigm known as orthogonal persistence, in which programming language object models are directly stored, so that database design becomes mainly a matter of object design. They provide concrete examples and case studies as practical illustrations of the introduction of fuzziness, both at the conceptual and the physical levels of this kind of persistent system. In the fourth section, we see the applications of fuzzy object-oriented information modeling and fuzzy databases. Robinson and Graniero use a spatially explicit, individual-based ecological modeling problem to illustrate an approach to managing fuzziness in spatial databases that accommodates the use of nonfuzzy as well as fuzzy representations of geographic databases. The approach taken in the chapter uses the Extensible Component Objects for Constructing Observable Simulation Models (ECOCOSM) system loosely coupled with geographic information systems. The ecological modeling problem described in the chapter is used to illustrate how combining Probes and ProbeWrappers with Agent objects affords a flexible means of handling semantic variation and serves as an effective approach to utilize heterogeneous sources of spatial data. The publish/subscribe systems describe such a paradigm that information providers disseminate publications to all consumers who expressed interest by registering subscriptions with the publish/subscribe system. Liu and Jacobsen notice that in all existing publish/subscribe systems, neither subscriptions nor publications can capture uncertainty inherent to the information underlying the application domain. However, in many situations, exact knowledge of either specific subscriptions or publications is not available. To address this problem, the authors propose a new object-oriented publish/subscribe model based on possibility theory and fuzzy set theory to process imperfect information for either expressing subscriptions or publications or both combined. Furthermore, the authors define the approximate publish/subscribe matching problem and develop and evaluate the algorithms for solving it.
ix
Acknowledgments The editor would like to acknowledge the help of all involved in the collation and review process of the book, without whose support the project could not have been satisfactorily completed. Most of the authors of chapters included in this book also served as referees for papers written by other authors. Thanks go to all those who provided constructive and comprehensive reviews. A special note of thanks goes to all the staff at Idea Group Publishing, whose contributions throughout the whole process, from inception of the initial idea to final publication, have been invaluable. Special thanks go to the publishing team at Idea Group Publishing. In particular to Mehdi Khosrow-Pour, whose enthusiasm motivated me to initially accept his invitation for taking on this project, and to Michele Rossi, who continuously prodded via e-mail to keep the project on schedule. In closing, I wish to thank all of the authors for their insights and excellent contributions to this book. I also want to thank all of the people who assisted in the reviewing process. In addition, this book would not have been possible without the ongoing professional support from Mehdi Khosrow-Pour and Jan Travers at Idea Group Publishing. Zongmin Ma, Ph.D. Sherbrooke, Canada April 2004
SECTION I
A Constraint Based Fuzzy Object Oriented Database Model 1
Chapter I
A Constraint Based Fuzzy Object Oriented Database Model G. de Tré Department of Telecommunications and Information Processing, Ghent University, Belgium R. de Caluwe Department of Telecommunications and Information Processing, Ghent University, Belgium
Abstract The objective of this chapter is to define a fuzzy object-oriented formal database model that allows us to model and manipulate information in a (true to nature) natural way. Not all the elements (data) that occur in the real world are fully known or defined in a perfect way. Classical database models only allow the manipulation of accurately defined data in an adequate way. The presented model was built upon an object-oriented type system and an elaborated constraint system, which, respectively, support the definitions of types and constraints. Types and constraints are the basic building blocks of object schemes, which, in turn, are used for defining database schemes. Finally, the definition of the database model was obtained by providing adequate data definition operators and data manipulation operators. Novelties in the approach are the incorporation of generalized constraints and of extended possibilistic truth values, which allow for a better representation of data(base) semantics.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
2 de Tré & de Caluwe
Introduction In this chapter, a formal object-oriented database model that is suited to model both perfect and imperfect information is built. This model distinguishes itself from existing fuzzy object-oriented models by integrating (generalized) constraints (Zadeh, 1997). These constraints are used to define the semantics and integrity of the data and to define query criteria. Another novelty is its underlying logical framework of extended possibilistic truth values (de Tré, 2002). Moreover, the model is built upon the Object Data Management Group (ODMG) data model (Cattell & Barry, 2000), as far as its crisp components are considered. The starting point for the formalism is an algebraic foundation, in which sets of objects, operators on these sets, and constraints that are defined for these sets are central (de Tré, de Caluwe, & Van der Cruyssen, 2000). Special domainspecific elements that are represented by the “⊥” symbol, are used to formalize “undefined” (or inapplicable) data. This foundation is formally defined on the basis of a type system and a constraint system. Starting from this basis, object schemes and database schemes are defined, which allow for databases to be defined rather easily. Furthermore, querying is generalized to a manageable closed set of operators. Contrary to existing proposals that extend a crisp model, an approach based on generalization allows databases to be defined that handle perfect data as special cases of imperfect data. For the generalization, fuzzy set theory and possibility theory are used. Moreover, with the presented work, it is shown how Zadeh’s theory on fuzzy information granulation and generalized constraints (Zadeh, 1996, 1997) can be applied within the context of a database model. The underlying logic of the database model is many valued and uses so-called extended possibilistic truth values (de Tré, 2002), which are obtained by considering the three truth values — “true,” “false,” and “undefined” — and adding possibilistic uncertainty. This logic allows for a more epistemological modeling of truth and, moreover, can explicitly handle those cases where some of the data are not applicable. The remainder of the chapter is organized as follows. In the next section, an overview of different approaches in fuzzy object-oriented database modeling is given. Furthermore, some preliminary concepts and definitions are introduced. In the section entitled, “Types and Type System,” a type system, which supports the formal definition of all data types defined in the database model, is presented. These data types are compliant with the ODMG data model, as far as their crisp counterparts are considered. In “Constraints and Constraint System,” a constraint system supporting the formalization of constraints is defined. Constraints are important for defining database semantics and query criteria. In “Object
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 3
Schemes and Database Schemes,” object (scheme) and database (scheme) definitions are given. The data definition and data manipulation operators are presented in “Database Model.” Finally, the achieved results are summarized, and some ideas for future research are discussed in the concluding section.
Some Preliminaries Simultaneously with the maturation of object-oriented database models, research on “fuzzy” object-oriented databases is getting more attention. Nowadays, several fuzzy object-oriented database models exist. Based on some of them, prototypes were already implemented.
Related Work Among the existing “fuzzy” object-oriented database models are the following: the object-centered model of Rossazza et al. (1990, 1997); the object-oriented model of Tanaka et al. (1991); the similarity-based model of George et al. (1992, 1997); the fuzzy object-oriented data (FOOD) model of Bordogna et al. (1994, 1999, 2000); the fuzzy algebra of Rocacher et al. (1996); the UFO model of Van Gyseghem (1998); the fuzzy association algebra of Na and Park (1997); the FIRMS model of Mouaddib et al. (1997); the FOODM model of Marín et al. (2000, 2001, 2003); and the “rough” object-oriented database of Beaubouef and Petry (2002).
The Object-Centered Model of Rossazza et al. In this model (Rossazza, 1990; Rossazza et al., 1997), all information is contained in objects that are completely described by a set of attributes. For these objects, no behavior is defined. Objects with the same attributes are collected in classes that are organized in class hierarchies. A range of allowed values and a range of typical values are specified for the attributes. These ranges may be fuzzy. Various kinds of (graded) inclusion relations can be defined between classes.
The Object-Oriented Model of Tanaka et al. In this model, fuzziness is considered on both structural and behavioral aspects of objects (Tanaka, Kobayashi, & Sakanoue, 1991). Attribute values can be
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
4 de Tré & de Caluwe
fuzzy predicates. Furthermore, fuzziness is considered at the levels of instantiation, of inheritance, and of the relationships between objects by introducing extra special classes.
The Similarity-Based Model of George et al. The capability of this model, to facilitate an enhanced representation of different types of imprecision, is derived by utilizing a similarity relation to generalize equality to similarity (George, 1992; George et al., 1997). Similarity permits the representation of impreciseness in data and impreciseness in inheritance. An object algebra based on extensions of the five “classical” operators (union, difference, product, projection, and selection) is provided.
The Food Model of Bordogna et al. This model (Bordogna, Lucarella, & Pasi, 1994; Bordogna, Pasi, & Lucarella, 1999) is based on a visualization paradigm that supports the representation of the data semantics and the direct browsing of the information. It was defined as an extension of a graph-based object model, in which the database scheme and instances are represented as directed labeled graphs. A prototype of the model was implemented (Bordogna, Leporati, Lucarella, & Pasi, 2000).
The Fuzzy Algebra of Rocacher et al. This algebra (Rocacher & Connan, 1996) is an extension of the so-called EQUAL-algebra, which is part of the object-oriented database model, Extensible and Natural Common Object Resource (ENCORE) (Shaw & Zdonik, 1990). The extension is based on the ODMG data model (Cattell & Barry, 2000) and is aimed at the modeling and manipulation of fuzzy data.
The UFO Model of Van Gyseghem This model (Van Gyseghem, 1998) was an attempt to extend an object-oriented database model as generally as possible in order to be able to deal with fuzziness as well as with uncertainty. Different model levels were extended (attributes, methods, objects, classes, inheritance, instantiation, etc.).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 5
The Fuzzy Association Algebra of Na and Park In this approach (Na & Park, 1997), a fuzzy object-oriented data model was built by means of fuzzy classes and fuzzy associations. Fuzzy databases are represented by a fuzzy schema graph at the schema level and a fuzzy object graph at the object instance level. Data manipulation is handled by means of a fuzzy association algebra, which consists of operators that can operate on the fuzzy association patterns of homogeneous and heterogeneous structures. As the result of these operators, truth values are returned with the patterns.
The FIRMS Model of Mouaddib et al. This model (Mouaddib & Subtil, 1997) can deal with fuzzy, uncertain, and incomplete information. At the basis of the model are the concepts “nuanced value” and “nuanced domain.” Furthermore, a fuzzy thesaurus is used to restrict the allowed domain values of discrete attributes. A Chomsky grammar is used to generate the characteristic membership functions of the thesaurus terms. In the FIRMS model, no class hierarchies are supported.
The FOODM Model of Marín et al. This model (Marín, Pons, & Vila, 2000; Blanco, Marín, Pons, & Vila, 2001) shows how different sources of vagueness can be managed over a regular object-oriented database model. It is founded on the concept of “fuzzy type,” where properties are ranked in different levels of precision according to their relationships with the type. Objects are created using α-cuts of their fuzzy types. An architecture of a prototype implementation of the model was presented in the literature (Berzal, Marín, Pons, & Vila, 2003).
The “Rough” Object-Oriented Database of Beaubouef and Petry In this approach (Beaubouef & Petry, 2002), the indiscernibility relation and approximation regions of rough set theory are used to incorporate uncertainty and vagueness into the database model. The majority of these models do not conform to a single underlying object data model, as a logical consequence of the present lack of (formal) object standards. The ODMG proposal (Cattell & Barry, 2000) offers some perspectives. However, it still suffers from some shortcomings, such as the absence of formal
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
6 de Tré & de Caluwe
semantics (Kim, 1994; Alagiæ , 1997) and its limited ability to deal with constraints, despite the fact that a thorough support of constraints is the most obvious way to define the semantics of a database (Kuper, Libkin, & Paredaens, 2000; de Tré & de Caluwe, 2000). The presented fuzzy object-oriented database model is consistent with the ODMG data model (as far as its crisp components are considered) and, moreover, deals with constraints. Zadeh’s generalized constraints (Zadeh, 1997) were integrated in the framework and allow for a general, extensible definition of the semantics and integrity of the data and of the query criteria. Furthermore, a logic based on extended possibilistic truth values is used to be able to explicitly cope with missing information.
Generalized Constraints The concept of generalized constraint was introduced by L. A. Zadeh (Zadeh, 1986, 1997) as the basis for a computational approach to meaning and knowledge representation. The introduction of this concept was motivated by the fact that conventional crisp constraints of the form X ∈ C, where X is a variable and C is a set, are insufficient to represent the meaning of perceptions. A generalized constraint is, in effect, a family of constraints and can be seen as a generalization of an assignment statement (Zadeh, 1997). Definition 1 (Generalized constraint): An unconditional generalized constraint on a variable X is defined by: X isr R where R is the constraining relation, and isr is a variable copula in which the discrete-valued variable r defines the way in which R constrains X. As specified in (Zadeh, 2002), the principal constraints are the following:
• •
Equality constraint: r = e, i.e., X ise R. X equals R. Possibilistic constraint: r = blank, i.e., X is R. R is the possibility distribution of X (Zadeh, 1978; Dubois & Prade, 1988). For example, the possibilistic constraint “car A is expensive,” on the price variable of car A, in which expensive is a disjunctive fuzzy set with membership function
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 7
µexpensive, denotes that ΠX(x) = µexpensive(x), where x is a numerical value of price and ΠX (x) is the possibility that the price of car A is x.
•
Veristic constraint: r = v, i.e., X isv R. R is the verity distribution of X (Zadeh, 1999). For example, the veristic constraint “car A isv {(blue,0.1), (white,1)},” in which {(blue,0.1), (white,1)} is a conjunctive fuzzy set, denotes that the verity of the proposition “car A is blue” is 0.1, and the verity of the proposition “car A is white” is 1. This expresses that car A is almost white, but at same time also has some blue parts.
•
Probabilistic constraint: r = p, i.e., X isp R. R is the probability distribution of X. For example, the probabilistic constraint “consumption of car A isp N(8,1.5)” means that the consumption of car A is a normally distributed random variable with mean 8 and variance 1.5.
•
Probability-value constraint: r = pv, i.e., X ispv R. X is the probability of a fuzzy event (Zadeh, 1968), and R is its value. For example, the proposition “it is likely that car A is expensive” can be modeled by the probability-value constraint “Prob(car A is expensive) ispv likely” in which likely is a fuzzy probability.
•
Random set constraint: r = rs, i.e., X isrs R. R is the fuzzy-set-valued probability distribution of X. For example, if the price of car A is uncertain, and the potential price values are modeled by the fuzzy sets “around 4.000 USD,” “almost 5.000 USD,” and “more than 6.000 USD,” with respective probabilities 0.5, 0.2, and 0.3, this can be expressed by the random-set constraint “car A isrs (0.5\around 4.000 USD + 0.2\almost 5.000 USD + 0.3\more than 6.000 USD).”
•
Fuzzy graph constraint: r = fg, i.e., X isfg R. X is a function, and R is its fuzzy graph (Zadeh, 1997). For example, if X is a function expressing the relationship between speed and stopping distances of cars, and X is approximated by the fuzzy graph f* = low × short + average × rather long + high × very long, this can be expressed by the fuzzy graph constraint “X isfg f*.”
•
Usuality constraint: r = u, i.e., X isu R. This means that “usually X is R.” A usuality constraint is a special case of a probability-value constraint. For example, the usuality constraint “Mercedes isu expensive” should be interpreted as an abbreviation of “Prob(Mercedes is expensive) ispv usually.”
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
8 de Tré & de Caluwe
Extended Possibilistic Truth Values The concept of extended possibilistic truth value (EPTV) (de Tré, 2002) is an extension of the concept of possibilistic truth value that was originally introduced in the literature by Prade (1982) and was further developed by De Cooman (1995, 1999). EPTVs provide an epistemological representation of the truth of a proposition, which allows us to reflect on our knowledge about the actual truth. They were specifically designed to deal with those cases in which the truth value of a proposition is either unknown or undefined. The truth value of a proposition is unknown if, e.g., some data in the proposition exist but are not available. For example, the truth value of the proposition “the price of car A is 20.000 USD” is unknown if car A is for sale but no information about its price is given. The truth value of a proposition is undefined if, e.g., the proposition cannot be evaluated due to the nonapplicability of (some of) its elements. For example, the truth value of the same proposition “the price of car A is 20.000 USD” is considered to be undefined if it is known for sure that car A is not for sale, in which case it does not make sense to ask for its price (in the supposition that price information is not applicable to cars that are not for sale). Definition 2 (EPTV): With the understanding that P represents the universe of all propositions, and ℘~(I*) denotes the set of all regular, ordinary fuzzy sets (hereby excluding the empty fuzzy set) that can be defined over the universal set I* = {T,F,⊥} of truth values (where T represents “true,” F represents “false,” and ⊥ represents an undefined truth value), the EPTV t~*(p) of a proposition p ∈ P is formally defined by means of a mapping: t ~*:P → ℘~(I*):p → t ~*(p) that associates with each p ∈ P a fuzzy set t ~*(p) = {(T,µ t˜*(p) (T)), (F,µ t˜*(p) (F)), (⊥,µ t˜*(p) (⊥))}. The semantics of this associated fuzzy set is defined in terms of a possibility distribution. With the understanding that t *:P → I* is the mapping function that associates the value T with p if p is true, that associates the value F with p if p is false, and that associates the value ⊥ with p if (some of) the elements of p are not applicable, undefined, or not supplied, this means that:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 9
∀ x ∈ I*:Π t*(p) (x) = µ
t˜*(p)
(x)
where Π t*(p)(x) denotes the possibility that the value of t*(p) conforms to x, and µ t˜*(p)(x) is the membership grade of x within the fuzzy set t~*(p). Special cases of EPTVs are as follows: t ~*(p)
Interpretation
{(T,1)}
p is true
{(F,1)}
p is false
{(T,1), (F,1)}
p is unknown
{(⊥,1)}
p is undefined
{(T,1), (F,1), (⊥,1)}
p is unknown or undefined
As an example, consider the modeling of an unknown truth value by the possibility distribution {(T,1), (F,1)}, which denotes that it is completely possible that the proposition is true (T), but it is also completely possible that the proposition is false (F). New propositions can be constructed from existing propositions, using so-called logical operators that have definitions based on the operators of a strong threevalued Kleene logic (Resher, 1969). An unary operator ¬˜ is provided for the negation of a proposition. Binary operators ∧˜, ∨˜, ⇒˜, and ⇔˜ are provided, respectively, for the conjunction, disjunction, implication, and equivalence of propositions. The arithmetic rules to calculate the EPTV of a composite proposition and the algebraic properties of extended possibilistic truth values are presented in de Tré (2002). As illustrated in the literature (de Tré & de Caluwe, 2003), EPTVs can be used to express query satisfaction in flexible database querying. Every object o in the result set of a (flexible) query Q was assigned a calculated EPTV t ~*(“o satisfies Q”), where the membership grades of T, F, and ⊥, denote, respectively, the possibility that o satisfies Q, the possibility that o does not satisfy Q, and the possibility that Q is not (fully) applicable to o.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
10 de Tré & de Caluwe
Types and Type System The common characteristics of a data collection can be described by means of a type. For this reason, most database models, including the model presented in this chapter, support some “type” notion.
Definition of Types In order to give a complete definition of the concept of “type,” it is necessary to provide the rules that define its syntax, as well as the rules that define its semantics. Definition 3 (Type): Each type supported by the type system is defined by its syntax and its semantics.
•
The syntax of a type. The syntax rules for a type can be formally described by means of some mathematical expressions.
•
The semantics of a type. The semantic definition of a type t can be fully determined by:
• • • •
A set of domains Dt A designated domain domt ∈ D t A set of operators Ot A set of axioms At
The designated domain dom t defines the set of valid values for the type and is called the domain of the type. In order to deal with cases where a regular domain value does not apply, the assumption was made that every domain domt contains a special, domain-specific value ⊥t, which is used to represent “undefined” domain values. The set of operators O t contains the operators, which are defined on the domain dom t. The set of domains Dt consists of the domains that are involved in the definition of the operators of Ot, whereas the set of axioms At consists of the axioms that are involved in the definition of the semantics of the operators of Ot.
Type System In order to define the types supported by the presented database model, a type system (Lausen & Vossen, 1998) was built. The presented type system is Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 11
consistent with the specifications of the ODMG object model (Cattell & Barry, 2000). To guarantee this consistency, a distinction was made between a socalled void type (which is the most primitive type of the system), literal types, object types, and reference types (which are new with respect to the ODMG model). Reference types enable us to refer to the instances of object types and are used to formalize the binary relationships between the object types in a database scheme. Each type supported by the type system is formally defined as prescribed by Definition 3. The syntax rules for the types of the presented type system are defined as in Definition 4. Definition 4 (Types: syntax rules): Let ID denote the set of valid identifiers, and let the sets of type expressions that satisfy the syntax of a reference type, a literal type, and an object type be denoted, respectively, as Treference, Tliteral, and Tobject, where:
•
The set Treference is defined by: Treference ≡ Tsingle_ref ∪ T multi_ref where Tsingle_ref ≡ {Ref (t)|t ∈ T object} and Tmulti_ref ≡ {SetRef (t), BagRef (t), ListRef (t)|t ∈ T object}
Type t is called the (most) significant type of the reference type.
•
The set Tliteral is defined by induction as follows:
•
Basic types: Tbasic ≡ {Integer, Real, Boolean, Octet, String} ⊂ Tliteral
•
Collection types: Tcollect ≡ {Set(t), Bag(t), List(t), Array(t), Dict(t′,t) | t′,t ∈ Tliteral} ⊂ Tliteral
Type t is called the significant type of the collection type. In the case of nested collection types, the significant type of the innermost collection type is called the most significant type of the collection type.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
12 de Tré & de Caluwe
•
Enumeration types: Tenum ≡ {Enum id (id1,id2,…,idn |{id,id1,id 2,…,idn} ⊂ ID} ⊂ Tliteral
The identifier id identifies the enumeration type, whereas (id1,id2,…,idn) represents the ordered sequence of identifiers that is described by the type.
•
Structured types: Tstruct ≡ {Struct id (id1 isr1 t1; id2 isr2 t2;…; idn isrn tn) | ({id,id1,id2,…,idn} ⊂ ID) ∧ [∀ 1 ≤ i ≤ n: (isri ∈ {ise,is,isv}) ∧ (ti ∈ Tliteral ∪ Treference)]} ⊂ Tliteral
Hereby, id identifies the structured type, whereas (id1 isr 1 t1; id 2 isr2 t2;…; id n isr n tn) represents the components of the structured type. Each component idi isr i ti, 1 ≤ i ≤ n is a (generic) generalized constraint on a variable id i with associated type ti ∈ Tliteral ∪ Treference.
•
If isri = ise, the valid values of idi are restricted to the values of the domain domti of the associated type ti.
•
If isri = is, idi is interpreted as a disjunctive (possibilistic) variable, with valid values that are restricted as follows:
•
If ti ∉ Tcollect ∪ T multi_ref , the valid values are restricted to fuzzy sets that are defined over domain domti of the associated type t i.
•
If ti ∈ Tcollect ∪ T multi_ref , the valid values are restricted to collections of fuzzy sets that are defined over the domain domt′ i of the most significant type t′i of type ti.
The membership grades of all fuzzy sets are interpreted as degrees of possibility.
•
If isri = isv, idi is interpreted as a conjunctive (veristic) variable, with valid values that are restricted as follows:
•
If ti ∉ Tcollect ∪ T multi_ref , the valid values are restricted to fuzzy sets that are defined over domain domti of the associated type t i.
•
If ti ∈ Tcollect ∪ Tmulti_ref , the valid values are restricted to collections of fuzzy sets that are defined over the domain domt′ i of the most significant type t′i of type ti. The membership grades of these fuzzy sets are interpreted as degrees of verity.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 13
•
The set Tobject is defined by the following:
Let Vsignat denote the set of all valid operator signatures, which is defined as follows:
• •
∀ t' ∈ Tliteral ∪ Treference ∪ {Void}: Signat (( ) → t' ) ∈ Vsignat ∀ t' ∈ Tliteral ∪ Treference ∪ {Void}, ∀ {id' 1,id'2,...,id' p} ⊂ ID, ∀ isri ∈ {ise,is,isv},∀ t' i ∈ T literal ∪ Treference, 1 ≤ i ≤ p: Signat ((id' 1 isr1 t' 1;id'2 isr2 t' 2;…;id'p isrp t'p) → t' ) ∈ Vsignat
Hereby, Void denotes the void type, which is used in situations where a further type specification could not be given (Cattell & Barry, 2000). Furthermore, t' is the type of the returned value(s) of the operator, and id'i isri t'i, 1 ≤ i ≤ p are the input parameters of the operator. Each input parameter is a (generic) generalized constraint on a variable id'i with associated type t'i ∈ T literal ∪ T reference. These generalized constraints are interpreted as specified previously. If id ∈ ID, {idˆ 1 , idˆ 2 ,…, idˆ m} ⊂ ID \ {id}, {id 1 ,id 2 ,…,id n } ⊂ ID, isr i ∈ {ise,is,isv} and si ∈ Tliteral ∪ T reference ∪ Vsignat, 1 ≤ i ≤ n, then:
• • •
Class id (id 1 isr1 s1;id2 isr2 s 2;…;idn isrn sn) ∈ Tobject Class id : idˆ1, idˆ2,…, idˆm ( ) ∈ Tobject Class id : idˆ1, idˆ2,…, idˆm (id1 isr1 s1;id2 isr2 s2;…;id n isrn sn) ∈ Tobject
The identifier id identifies the object type. Like many object models, the ODMG Object Model includes an inheritance-based type-subtype hierarchy. The identifiers idˆi, 1 ≤ i ≤ m denote the supertypes of the object type (if existent). The characteristics1 of the object type are represented by (id1 isr1 s1;id2 isr2 s2;…;idn isrn sn). Each characteristic idi isri s i, 1 ≤ i ≤ n is a (generic) generalized constraint on a variable idi with associated specification s i ∈ Tliteral ∪ T reference ∪ Vsignat. The semantics of the generalized constraints are the same as specified previously. If si ∈ Tliteral, the characteristic is called an attribute; if s i ∈ Treference, the characteristic is called a binary relationship; whereas if si ∈ Vsignat, the characteristic is a method. The generalized constraint puts a restriction on the return values of the operator. In addition to the characteristics stated in its type specification, an object type inherits the characteristics of its supertypes (if existent).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
14 de Tré & de Caluwe
Then, the set T of all type expressions is defined by the following: T ≡ {Void} ∪ Treference ∪ Tliteral ∪ Tobject Furthermore, the full semantics of the types t ∈ T (cf. Definition 4) are defined by providing an appropriate definition for the set of domains Dt, the domain domt of the type, the set of operators Ot and the set of axioms At. Below, some informal descriptions are given:
•
Void type. The domain of the Void type is, by definition, {⊥Void}. Its corresponding set of operators is the singleton {⊥: → dom Void} consisting of the bottom operator ⊥, which always results in an undefined domain value (represented by the symbol ⊥Void).
•
Reference types. The reference types are all generic types, designated by a type generator and an object type parameter. Reference types were introduced in order to formalize binary association relationships between object types. An association relationship between two object types has a “one-to-one,” a “one-to-many,” or a “many-to-many” cardinality, which denotes the maximum number of participating domain values of both types. To support the notion of cardinality, a distinction was made between singlevalued and multivalued reference types. Multivalued reference types are subdivided into “set-of-references,” “bag-of-references,” and “list-ofreferences,” in order to formalize the different ODMG definitions of “oneto-many” and “many-to-many” relationships (Cattell & Barry, 2000).
•
Single-valued reference types are denoted by the type generator Ref and an object-type parameter t ∈ Tobject. The domain of the singlevalued reference type Ref (t) consists of the “undefined” domain value ⊥Ref(t) and of references to regular elements (objects) of domt. The associated set of operators consists of the operators =, ≠, dereference, and ⊥. For example, with TPerson being the identifier of an object type that is used to represent information about persons, Ref(TPerson) is a single-valued reference type that allows reference to be made to a single-person object.
•
Multivalued reference types include “set-of-references,” “bag-ofreferences,” and “list-of-references,” and are denoted, respectively, by the type generators SetRef, BagRef, and ListRef and by an object-type parameter t ∈ Tobject. The domain of type Set Ref(t) [resp. BagRef(t) and ListRef(t)] consists of the “undefined” domain value ⊥Set_Ref(t) [resp. ⊥Bag_Ref(t) and ⊥List_Ref(t)] and of sets (resp. bags and lists) of references to regular elements of dom t. Furthermore, the types SetRef(t) [resp.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 15
BagRef(t) and ListRef(t)] have the same semantics as a corresponding collection type that should be defined over the single-valued reference type t (cf. description of collection types). For example, with TPerson being the identifier of an object type, SetRef(TPerson) is a multivalued reference type with domain values that are all sets of references to single-person objects.
•
Basic types. The definition of the basic types is straightforward. Each basic type has a domain that consists of simple, noncomposite, values. Its corresponding set of operators consists of the usual operators defined over its domain. For example, the domain of the Integer type consists of the integer numbers and of the “undefined” value ⊥Integer. The set of operators OInteger consists of the operators =, ≠, <, >, ≤, ≥, +, -, *, div, mod, and the bottom operator ⊥, which always results in an undefined domain value.
•
Collection types. The collection types are all generic types, designated by a type generator and one or two type parameters, e.g., the bag types are denoted by the type generator Bag and a type parameter t. The domain of the bag type Bag(t) consists of the “undefined” domain value ⊥Bag(t) and of unordered collections of elements of the domain of type t, in which duplicates are allowed. The associated set of operators consists of =, ≠, cardinality, is_empty, count, +, ∪, ∩, \, is_element, and ⊥. For example, the collection type Set(Integer) is used to model sets of integer numbers, whereas the collection type Bag(Real) is used to model bags of real numbers.
•
Enumeration types. The domain of an enumeration type Enum id (id 1,id2,…,idn) consists of the “undefined” domain value ⊥id and of the identifiers id 1,id2,…,idn. Its corresponding set of operators consists of =, ≠, <, >, ≤, ≥, and ⊥. For example, the enumeration type Enum TLang (French, Dutch, German) defines the set of enumeration constants {French, Dutch, German} and represents the official languages spoken by people in Belgium.
•
Structured types. The domain of a structured type Struct id (id1 isr1 t1;id2 isr2 t2;…;idn isrn tn) contains the “undefined” domain value ⊥id. All other domain values are composite and consist of n values id i isr'i vi, with isr'i ∈ {ise,is}, i = 1,2,…,n. Each value in the composition is, in turn, described by a generalized constraint, for which the semantics are as follows:
• •
If isr'i = ise, the value for component id i equals vi. If isr'i = is, the value for component idi is uncertain and is described by possibility distribution vi.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
16 de Tré & de Caluwe
For example, the structured type: Struct TCompany ( Name ise String; #_Employees is Integer; Company_language isv TLang) describes a simple representation for companies where the “Name ise String” component denotes the company’s name, the “#_Employees is Integer” component is used to model the number of people employed by the company, and the “Company_language isv TLang” component models the main language(s) used in the company. By combining the generalized constraint of the type specification, denoted by the copula isri, with the generalized constraint of the domain value, denoted by the copula isr'i, we obtain the following interpretations for the values vi, i = 1,2,…,n:
•
•
isri = ise
•
If isr'i = ise, then vi ∈ dom ti. The value of idi is crisply described. For example, the value “Name ise ‘My_company’” is a valid value for the “Name ise String” component of TCompany and denotes that the name of the represented company is certain and equals “My_company.”
•
If isr'i = is, then vi ∈ ℘~(domti), in which ℘~(domti) denotes the fuzzy power set of the domain domti of the associated type ti. The value of id i is uncertain. All candidate values are crisply described. For example, the value “Name is {(‘My_companyA’,1), (‘My_companyB’,0.4)}” is a valid value for the “Name ise String” component of TCompany. It denotes that the name of the represented company is uncertain and is represented by the possibility distribution equal to {(“My_companyA”,1), (“My_companyB”, 0.4)}, which denotes that it is completely possible that the name of the company is “My_companyA,” and it is less possible that the name is “My_companyB.”
isri = is
•
isr'i = ise
•
If ti ∉ Tcollect ∪ T multi_ref, then vi ∈ ℘~(domti). The value of idi is vague or imprecise. For example, the value “#_Employees ise About_2000,” where About_2000 is a possibility distribution defined over the set of integer values, is a valid value for the “#_Employees is Integer” component of TCompany
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 17
and denotes that there are about 2000 employees in the considered company.
•
•
•
If ti ∈ Tcollect ∪ Tmulti_ref, then vi is a collection of vague or imprecise values, all specified by fuzzy sets over the domain domt'i of the most significant type t' i of ti. For example, consider a component “Ages_of_children is Set(Integer)” that is used to represent the ages of the children of a person. Then, “Ages_of_children ise Set(Around_6, Teenager)” might be the value for a person with two children, the youngest being around six years old, the other being a teenager.
isr'i = is
•
If ti ∉ Tcollect ∪ Tmulti_ref, then vi ∈ ℘~(℘~(domti)) in which ℘~(℘~(domti)) denotes the set of all Level 2 fuzzy sets that can be defined over dom ti (Gottwald, 1979). The value of id i is uncertain, what is described by the membership grades in the “outer-level” fuzzy set. Candidate values can be fuzzy or imprecise, what is described by the “inner-level” fuzzy sets (de Tré & de Caluwe, 2003a). For example, the value “#_Employees is {(About_2000,1), (About_4000,1)}” denotes that there are possibly about 2000 or possibly about 4000 employees in the considered company.
•
If ti ∈ Tcollect ∪ Tmulti_ref, then vi is uncertain and is a fuzzy set of collections of vague or imprecise values, which, in turn, are all specified by fuzzy sets over the domain domt'i of the most significant type t'i of t i. For example, the value “Ages_of_children ise {(Set(Around_6,Teenager),1), (Set(Around_6,Around_22),0.4)}” denotes that the youngest child is around 6 years old, but the other child is either a teenager, or less possibly around 22 years old.
isri = isv
•
isr' i = ise
•
If ti ∉ Tcollect ∪ T multi_ref, then vi ∈ ℘~(domti). The value of id i is veristic. For example, a value “Company_language ise {(Dutch,1),(French,0.6)}” for the “Company_language isv TLang” component of TCompany denotes that the main languages used in the company are Dutch and French, of which Dutch is mostly used.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
18 de Tré & de Caluwe
•
•
If ti ∈ Tcollect ∪ Tmulti_ref, then vi is a collection of veristic values, all specified by fuzzy sets over the domain domt'i of the most significant type t' i of ti.
isr'i = is
•
If ti ∉ Tcollect ∪ Tmulti_ref, then vi ∈ ℘~(℘~(dom ti)), in which ℘~(℘~(domti)) denotes the set of all Level 2 fuzzy sets that can be defined over dom ti. The value of idi is uncertain, what is described by the membership grades in the “outer-level” fuzzy set. Candidate values are veristic, what is described by the “inner-level” fuzzy sets. For example, a value “Company_language ise {({(Dutch,1),(French,0.6)},1), ({(German,1)},0.2)}” denotes that it is uncertain whether the main languages of the company are Dutch and French (in which case, Dutch is mostly used) or German.
•
If ti ∈ Tcollect ∪ Tmulti_ref, then vi is uncertain and is a fuzzy set of collections of veristic values, which, in turn, are specified by fuzzy sets over the domain domt'i of the most significant type t'i of ti.
The associated set of operators consists of =, ≠, . (period member operator), set_component, get_component, and ⊥. In order to deal with values that are represented by fuzzy sets or Level 2 fuzzy sets, the operators of the sets O ti, i = 1,2,…,n, are extended with the following:
•
Operators that are extensions of the original operators in O ti and are obtained by applying Zadeh’s extension principle (Zadeh, 1975) one time (for fuzzy sets) or two consecutive times (for Level 2 fuzzy sets) (de Tré & de Caluwe, 2003a). Due to this principle, almost every classical mathematical concept and structure based on (binary) logic and set theory can be “fuzzified.” Consider the ordinary sets U1,U2,…,Un and Y and a mapping R from U 1 × U2 × … × Un to Y. The extension principle of Zadeh defines the “extended” mapping R~ of R as: R~: ℘~(U 1) × ℘~(U2) × … × ℘~(Un) → ℘~(Y) V1~, V2~,…, Vn~ → R~(V1~, V2~,…, Vn~) with R~(V1~, V2~,…, Vn~) being defined as R~(V1~, V2~,…, Vn~): Y → [0,1] y → sup
R(x1,x2,...,xn) = y
min (µV~1(x1), µV~2(x2),..., µV~n(xn))
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 19
In the type system, “fuzzified” operators are defined using polymorphism and operator overloading, which allows a different meaning to be assigned to operators in different contexts. Operators then vary depending on whether their parameters are ordinary values, fuzzy sets, or Level 2 fuzzy sets.
•
•
Operators intended for the handling of fuzzy sets and of Level 2 fuzzy sets. Examples include the operators =, ∪, ∩, co, normalize, support, core, α-cut, α−-cut, and µ (where µ(F,x) returns the membership grade of element x within fuzzy set F). Each other operator preserves its usual semantics.
Object types. The object types are the most elaborated types of the type system. Each object type is characterized by a number of properties (which describe its structure) and a number of explicitly defined operators, also called methods (which describe its behavior). As specified in Definition 4, a property is either an attribute or a binary relationship. In order to define the binary relationships between object types, a partial association relation ↔ is defined over the set Tobject. (id1 ↔ id2 denotes that “object type id 1 is binary related to object type id2.”) An object type can inherit properties and methods from its parent types (Taivalsari, 1996). In order to define the inheritance-based type-subtype relationships between object types, a partial ordering relation < is defined over the set Tobject. (idˆ < id denotes that “object type id inherits all characteristics of object type idˆ.”) The domain of an object type id contains the “undefined” domain value ⊥id and the undefined domain values ⊥idˆ of the parent types idˆ of type id. Each other domain value is composite and contains a value id i isr' i v i , with isr' i ∈ {ise,is}, for each of the (inherited) properties id i isri si, si ∈ Tliteral ∪ Treference of the type. Each value in the composition is, in turn, described by a generalized constraint, for which the semantics are the same as that explained with the structured types. The set of operators associated with a given object type is the union of a set of implicitly defined operators and a set of explicitly defined operators. The implicitly defined operators are =, ≠, . (period member operator), set_property, get_property, and ⊥. The explicitly defined operators are the (inherited) methods id i isri si, si ∈ Vsignat of the object type.
The type system TS, which defines all the valid types supported by the presented database model, is defined by the following definition.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
20 de Tré & de Caluwe
Definition 5 (Type system): The type system TS is defined by the quadruple: TS ≡ [ID,T,↔,<] where
•
ID is the set of the valid identifiers
• •
T is the set of valid types (cf. Definition 4)
•
↔: Tobject × Tobject → {True,False} is the partial relation, which is used to define the binary association relationships between object types <: Tobject × T object → {True,False} is the partial ordering relation, which is used to define the inheritance-based type-subtype relationships between object types
Example 1: The type system allows for definitions like the following, which are intended to describe a (simplified) type representing employees. With the structured types
• • •
Struct TAddress (Street ise String;City ise String) Struct TCompany (Name ise String;Location ise String) Struct TWorks (Company ise TCompany; Percentage is Real)
and the enumeration type Enum TLang (French, Dutch, German) the object types TPerson and TEmployee can be defined by: Class TPerson ( Name ise String; Age is Integer; Address ise TAddress; Languages isv TLang; Children ise SetRef (TPerson); Add_child ise Signat ((New_child ise TPerson) → Void) ) and
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 21
Class TEmployee:TPerson ( EmployeeID ise String; Works_for ise Bag(TWorks) )
Instances of Types The instances of a reference type, a literal type, and an object type are, respectively, called reference instances, literals, and objects, whereas the Void type cannot have instances. Definition 6 (Reference instance): Every reference instance r is defined as a pair: [t,v] where t ∈ Treference and v ∈ domt. Definition 7 (Literal): Every literal l is defined as a pair: [t,v] where t ∈ T literal and v ∈ domt. Depending on its lifetime, an object can be either transient or persistent. Definition 8 (Transient object): A transient object o is defined as a triple [t,v, t ~* (“o is an instance of t”)] in which:
• • •
t ∈ Tobject is the type of the object v ∈ dom t is the state of the object t ~* (“o is an instance of t”) is the EPTV that expresses the truth value of the proposition “o is an instance of object type t”
Definition 9 (Persistent object): A persistent object o is defined as a quintuple [oid,N,t,v, t ~* (“o is an instance of t”)] in which:
• • • • •
t ∈ Tobject is the type of the object v ∈ dom t is the state of the object oid is a unique object identifier N is a (finite) set of object names t ~* (“o is an instance of t”) is the EPTV that expresses the truth value of the proposition “o is an instance of object type t”
The unicity of the object identifier has to be guaranteed over the whole database. The object identifier oid is used to refer to the (state of the) object. The set of
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
22 de Tré & de Caluwe
object names N can be empty. The set of all the instances of an object type t ∈ Tobject is written as Vtinstance. If t is a subtype of another object type tˆ, then Vtinstance ⊆ Vtˆ instance. The extent of an object type t is written as Vtextent and is defined as the set of all the persistent instances of t within a particular database. Obviously, Vtextent ⊆ Vtinstance. If t is a subtype of another object type tˆ, then Vtextent ⊆ Vtˆ extent. Example 2: The instances of the object type TPerson of Example 1 are either TPerson objects or TEmployee objects (because TEmployee is a subtype of TPerson). Examples of persistent TPerson objects are as follows: [oid 1, { }, TPerson, (
Name ise “Ann”; Age ise Around_14; Address ise (Street ise “Cross Street, 12”; City ise “Ghent”); Languages is {({(Dutch,1)},1), ({(Dutch,1),(French,0.4)},0.8)}; Children ise Set( ) ), {(T,1)}]
and [oid 2, { }, TPerson, (
Name ise “Tom”; Age is {(Around_16,1), ({(19,1)},1)}; Address ise (Street ise “Cross Street, 12”; City ise “Ghent”); Languages ise {(Dutch,1),(French,0.5),(German,0.7)}; Children ise Set( ) ), {(T,1)}]
An example of a persistent TEmployee object is as follows: [oid 3, { }, TEmployee, (
Name ise “Joe”; Age ise {(42,1)}; Address ise (Street ise “Cross Street, 12”; City ise “Ghent”); Languages ise {(Dutch,1),(German,0.8),(French,1)}; Children ise Set(oid1,oid 2);
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 23
EmployeeID ise “ID25”; Works_for ise Bag(( Company ise (
Name ise “XYZ”; Location ise “Brussels”); Percentage ise {(100,1)})) ), {(T,1)}]
Constraints and Constraint System Constraints can be formally seen as relations that must be satisfied. With respect to database systems, constraints are considered to be an important and adequate means with which to define the semantics of the database (Kuper, Libkin, & Paredaens, 2000; de Tré & de Caluwe, 2000). For example, if information about persons is handled, constraints can be used to define the full semantics of the valid (domain) values for a person’s age, height, and weight. Other constraints can define the valid transitions for a person’s salary (e.g., to specify that a salary cannot decrease) or specify another integrity rule. An instance then belongs to the database insofar that it satisfies all of its defining constraints. Constraints can also be used to impose selection criteria for information retrieval. In this case, every constraint defines a condition for the instances to belong to the result of the retrieval. Every instance belongs to the result insofar as it satisfies all the imposed criteria. For example, if someone wants to retrieve all the persons who are around 20 years old and who live in Paris, two constraints can be imposed: a constraint that selects all the persons around 20 years old and a constraint that selects all the persons living in Paris.
Definition of (Specific) Constraints In order to give a complete definition of a constraint, it is necessary to provide the rules that define its syntax, as well as the rules that define its semantics. Definition 10 (Constraint): Each constraint supported by the constraint system is defined for a set of objects V instance and is fully specified by its syntax and its semantics.
•
The syntax of a constraint. The syntax rules for a constraint can be formally described by means of some mathematical expressions.
•
The semantics of a constraint. The semantic definition of a constraint c is fully determined by a logical function of the following form:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
24 de Tré & de Caluwe
c: V instance → ℘~(I*): o → t ~* (“o satisfies c”) that associates an EPTV t ~* (“o satisfies c”), with each o ∈ V instance. The extra truth value ⊥ (of EPTVs) is used to model the cases where constraint c does not (completely) apply to object o (cf. Definition 2).
Definition of the Constraint System In order to define the constraints supported by the presented database model, a constraint system was built. Different kinds of constraints are distinguished. A first distinction is based on whether a constraint is defined for the instances of one single object type or not (single-type dependent versus multitype dependent). A second distinction is based on whether or not the entire extent of an object type is involved in the evaluation of the constraint. All the constraints supported by the constraint system are formally defined as specified in Definition 10. Their syntax rules are defined as follows. Definition 11 (Constraints: syntax rules): Let ID denote the set of valid identifiers, and let the constraint expressions that satisfy the syntax of the four distinguished categories be denoted, respectively, as Cis, Ces, C im, and Cem, where:
•
The set C is consists of single-type dependent constraints that are not defined with respect to the entire extent of an object type and is defined as follows:
•
“Not null” constraints: If id ∈ ID is a path expression2 that denotes a property or component 3 of an object type, then: c{id}not_null [ ] ∈ Cis
•
Certainty constraints: If id ∈ ID is a path expression that denotes a property or component of an object type, then c{id}certain [ ] ∈ Cis
•
Value constraints: If id ∈ ID is a path expression that denotes a property or component of an object type t, and e is a logical
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 25
expression (resulting in an EPTV), without aggregation operators, that is defined over the properties and components of t and its associated types and expresses a restriction for the domain values of the property or component denoted by id, then c{id}value [e] ∈ Cis
•
Transition constraints: If id ∈ ID is a path expression that denotes a property or component of an object type t, and e is a logical expression, without aggregation operators, that is defined over the properties and components of t and its associated types and expresses a restriction for the transitions between old and new domain values of the property or component denoted by id (such transitions occur when the set_property or set_component operator is applied), then c{id}trans [e] ∈ Cis
•
Aggregate constraints: If t ∈ Tobject, and e is a logical expression with at least one aggregation operator, that is defined over the properties and components of t and its associated types and expresses a restriction for the set of instances Vtinstance of t, then c{t}aggr [e] ∈ Cis
•
The set C es consists of single-type dependent constraints that are defined with respect to the entire extent of an object type, and it is defined as follows:
•
Key constraints: If t ∈ Tobject and {id 1,id2,…,idn} ⊂ ID is a finite set of identifiers of properties of t, then c{t}key [id 1,id 2,…,idn] ∈ Ces
•
The set Cim consists of multitype dependent constraints that are not defined with respect to the entire extent of an object type, and it is defined as follows:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
26 de Tré & de Caluwe
•
Value constraints: If U = {t1,t 2,…,tn} ⊂ Tobject, n > 1, id ∈ ID is a path expression that denotes a property or component of an object type t ∈ U, and e is a logical expression, without aggregation operators, that is defined over the properties and components of all types in U and expresses a restriction for the domain values of the property or component denoted by id, then c{id,U}value [e] ∈ Cim
•
Transition constraints: If U = {t1,t 2,…,tn} ⊂ Tobject, n > 1, id ∈ ID is a path expression that denotes a property or component of an object type t ∈ U, and e is a logical expression, without aggregation operators, that is defined over the properties and components of all types in U and expresses a restriction for the transitions between old and new domain values of the property or component denoted by id, then c{id,U}trans [e] ∈ Cim
•
Aggregate constraints: If U = {t 1,t 2,…,tn} ⊂ Tobject, n > 1, t ∈ U, and e is a logical expression with at least one aggregation operator that is defined over the properties and components of all types in U and expresses a restriction for the set of instances Vtinstance of t, then c{t,U}aggr [e] ∈ Cim
•
The set C em consists of multitype dependent constraints that are defined with respect to the entire extent of an object type, and it is defined as follows:
•
Uniqueness constraints: If U ⊂ Tobject and t ∈ U, then c{t,U}oid [ ] ∈ Cem and c {t,U}name [ ] ∈ Cem
•
Referential constraints: If id ∈ ID is a path expression, which denotes an association relationship of an object type t, then
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 27
c{id}reference [ ] ∈ Cem If there exists an “inverse” association relationship in the referenced object type t' and id' ∈ ID is the path expression, which denotes this relationship, then c{id,id’}reference [ ] ∈ C em Then the set C of all constraint expressions is defined by: C ≡ C is ∪ C es ∪ C i m ∪ C em The full semantics of the constraints c ∈ C are defined by providing an appropriate definition for their corresponding logical function (cf. Definition 10). Below, informal descriptions are given:
•
“Not null” constraints. A “not null” constraint c{id}not_null [ ] excludes the “undefined” value ⊥t from the domain of the type t of the property or component, which is denoted by the path expression id.
•
Certainty constraints. A certainty constraint c{id}certain [ ] prevents the use of the copula “is” in the allowed values for the property or component, which is denoted by the path expression id. This implies that all allowed values have to be described by a generalized constraint id ise v, which guarantees that no uncertainty exists about the value of property or component id.
•
Value constraints. A value constraint c{id}value [e] or c{id,U}value [e] restricts the domain of the type t of the property or component that is denoted by the path expression id. This is done by excluding the domain values for which the expression e evaluates to the EPTV {(F,1)} (i.e., false).
•
Transition constraints. A transition constraint c{id}trans [e] or c{id,U}trans [e] prevents the execution of an update of the value of the property or component that is denoted by the path expression id, in the cases where this update would result in an evaluation {(F,1)} (false) of the expression e.
•
Key constraints. A key constraint is used to define a key, i.e., an irreducible set of one or more properties of an object type with value(s) that are used together to uniquely identify the persistent instances of the object type. A key constraint c{t}key [id1,id 2,…,idn] defines a key for the object type t that consists of the properties identified by the identifiers id 1,id2,…,idn. The constraint guarantees the (irreducibility of the) uniqueness of the
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
28 de Tré & de Caluwe
values of these properties over the extent Vtextent of type t. Furthermore, the constraint guarantees that none of these values is “undefined.”
•
Aggregate constraints. An aggregate constraint c{t}aggr [e] or c{t,U}aggr [e] prevents the addition of a new instance to the set of instances Vtinstance of type t, in those cases where this addition would result in an evaluation {(F,1)} (false) of the expression e.
•
Uniqueness constraints. A uniqueness constraint c{t,U}oid [ ] is used to guarantee the uniqueness of the object identifiers (oid) of the persistent instances of type t over the union of the extents of the types of set U. A uniqueness constraint c{t,U}name [ ] is used to guarantee the uniqueness of the object names (∈ N) of the instances of type t over the union of the extents of the types of set U.
•
Referential constraints. Referential constraints are used to maintain the referential integrity of the (binary) association relationships between objects. A referential constraint c{id}reference [ ] guarantees that all object identifiers specified in a value of the relationship denoted by the path expression id exists (are identifiers of objects present in the database). A referential constraint c{id,id’}reference [ ] additionally guarantees that if an object with identifier oid refers to an object with identifier oid' via its value for the relationship id, then the object with identifier oid' inversely refers to the object with identifier oid via its value for the relationship id'.
The constraint system CS, which defines all the valid constraints supported by the presented database model, is defined by the following: Definition 12 (Constraint system): The constraint system CS is formally defined by the triple CS = [ID,E,C] where:
• • •
ID is the set of valid identifiers E is the set of valid expressions C is the set of valid constraints (cf. Definition 11)
Example 3: With respect to the object types TPerson and TEmployee presented in Example 1, the following constraints can be considered:
• • •
c 1 = c {TEmployee.EmployeeID}not_null [ ] c2 = c{TPerson.Age}value [0 ≤ TPerson.Age ≤ around_120] c3 = c{TEmployee.Works_for.Percentage}value [0 ≤ TEmployee.Works_for.Percentage ≤ 100]
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 29
• • • • • •
c4 = c{TPerson}key [TPerson.Name] c 5 = c {TPerson,{TPerson,TEmployee}} oid [ ] c 6 = c {TPerson,{TPerson,TEmployee}} name [ ] c 7 = c {TEmployee,{TPerson,TEmployee}}oid [ ] c 8 = c {TEmployee,{TPerson,TEmployee}}name [ ] c9 = c{TPerson.Children}reference [ ]
Object Schemes and Database Schemes The definitions of object scheme and database scheme rely on the definitions of types and constraints.
The Object Scheme and Its Instances The full semantics of an object are described by its object scheme. This scheme “in fine” completely defines the object, now including the definitions of the specific constraints that apply to it. Definition 13 (Object scheme): Every object scheme is a quadruple os = [id,t,M,C t] in which:
• • •
id ∈ ID represents the name of the object scheme
•
Ct ∈ ℘~(Cis) is a normalized fuzzy set of constraints, which all have to be applied onto the objects of type t. The membership grades in Ct are interpreted as weights and denote the relative importance of the constraints with respect to the definition of the object scheme.
t ∈ Tobject is the type of the object scheme M represents the “meaning” of the object scheme. M is provided to add comments, which are usually described in a natural language.
The set of all existing object schemes is denoted as OS and is defined as the union of the set of all the quadruples that satisfy Definition 13 and the singleton {⊥OS}, with an element that represents an “undefined” object scheme. An instance o of the object type t is defined to be an instance of the object scheme os = [id,t,M,Ct], if and only if it satisfies [with an EPTV that differs from
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
30 de Tré & de Caluwe
{(F,1)}] all constraints in C t and all constraints in the fuzzy sets C tˆ of the object schemes [idˆ,tˆ,Mˆ,Cˆtˆ] that were defined for the supertypes tˆ of t. By this, inheritance has an impact on the specific constraints that has to be satisfied. The set of all the instances of an object scheme os is denoted as Vosinstance, whereas the set of all the persistent instances of os is written as Vosextent. Obviously, Vosinstance ⊆ Vtinstance and Vosextent ⊆ Vtextent. Example 4: With the object types TPerson and TEmployee presented in Example 1 and the constraints c1,c2,…,c9 presented in Example 3, the following object schemes can be constructed: OSPerson = [OSPerson,TPerson,“scheme to represent persons,”{(c2,1)}] and OSEmployee = [OSEmployee,TEmployee,“scheme employees,”{(c 1,1),(c 3,0.7)}]
to
represent
The Database Scheme and Its Instances A database scheme describes the full semantics of the objects stored in a database. Definition 14 (Database scheme): Every database scheme ds is a quadruple ds = [id,D,M,CD] in which:
• •
• •
id ∈ ID is the name of the database scheme. D = {os1,os2,…,os n} ⊂ OS \ {⊥OS} is a finite set of object schemes. Each object scheme in D has a different object type. If an object scheme os ∈ D is defined for an object type t, and t' is a supertype of t or t' is an object type for which a binary relationship with t has been defined, then an object scheme os' ∈ D has to be defined for t'. M denotes the “meaning” of the database scheme. CD ∈ ℘~(Ces ∪ Cim ∪ Cem) is a normalized fuzzy set of constraints that impose extra conditions on the instances of the object schemes of D. The membership grades in CD are interpreted as weights and denote
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 31
the relative importance of the constraints with respect to the definition of the database scheme. For every object scheme os ∈ D, uniqueness constraints exist in CD that guarantee the uniqueness of the object identifiers and object names of the instances of os. Furthermore, every constraint c ∈ Ces ∪ Cem, for which µCD (c) ≠ 0, has to be defined over the extent of the type t of an object scheme os ∈ D. The set of all existing database schemes is denoted as DS and is defined as the union of the set of all the quadruples that satisfy Definition 14 and the singleton {⊥DS}, with an element that represents an “undefined” database scheme. Every persistent instance o of an object scheme os ∈ D of a database scheme ds has to satisfy all the constraints in C D, with an EPTV that differs from {(F,1)}. An instance of a database scheme ds is called a database and is defined as the set of the extents of all the object schemes of ds. By this definition, every database is a set of sets of objects. Example 5: With the object schemes OSPerson and OSEmployee of Example 4 and the constraints c1,c2,…,c9 presented in Example 3, the following database scheme can be constructed: DSEmpl = [DSEmployee,{OSPerson,OSEmployee}, “scheme for an employee database,” {(c4,1),(c5,1),(c6,1),(c7,1),(c8,1),(c9,1)}] By considering the object identifiers of the persistent objects of Example 2, the corresponding database can be represented by the following: {{oid 1,oid 2,oid3}, {oid 3}}
Database Model The database model is finally obtained by extending the formalism with data definition (DDL) and data manipulation operators (DML).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
32 de Tré & de Caluwe
Data Definition Operators For data definition purposes, the set of operators O DDLmodel was introduced. Definition 15 (Data definition operators): ODDLmodel = {create_DB, drop_DB, create_OS, drop_OS, add_Char, drop_Char, add_OSC, drop_OSC, add_DBC, drop_DBC} All the operators of ODDLmodel operate on the set of all database schemes DS:
•
The operators create_DB and drop_DB are meant to create and remove a database and its database scheme.
•
The operators create_OS and drop_OS, respectively, allow an object scheme in a given database scheme to be created and an object scheme from a given database scheme to be removed.
•
The operators add_Char and drop_Char are meant to add and drop a characteristic, i.e., a property or a method, in the object type of a given object scheme in a given database scheme.
•
The operators add_OSC and drop_OSC are used to add and remove a weighted constraint to or from a given object scheme in a given database scheme.
•
The operators add_DBC and drop_DBC are meant to add and remove a weighted constraint to or from a given database scheme.
Data Manipulation Operators The data manipulation operators provide a facility for inserting, deleting, updating, and querying (database) objects. They operate on sets of instances associated with an object scheme and result in a new object scheme with a new associated set of instances. This way, every data manipulation operator can operate on the result of every data manipulation operator. This principle of “compositionality” guarantees the closure property of the algebra. The set of data manipulation operators is denoted as ODMLmodel and is defined by Definition 16.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 33
Definition 16 (Data manipulation operators): ODMLmodel = {∪, ∩, \, ⊗, Π, Θ, σ, τ, make_transient, make_persistent} These operators act as follows:
•
Union, intersection, and difference (∪, ∩, and \): The binary operators union, intersection, and difference are only defined for object schemes that are “scheme compatible”, i.e., object schemes as follows:
•
The types of both schemes have the same (inherited) characteristics and the associated fuzzy sets of constraints of both schemes are equal.
• •
The types of both schemes are subtypes of a “common” ancestor type. The type of one object scheme is a subtype of the type of the other object scheme.
With the “scheme-compatible” object schemes os 1 = [id 1 ,t 1 ,M 1 ,C t1 ] and os 2 = [id 2 ,t 2 ,M 2 ,C t2 ] as arguments, the operation ∪(os1,os2) [resp. ∩(os1,os2) and \(os1,os2)] results in a new object scheme: ∪(os1,os2) = os' = [id',t',M',∅] where
•
The object type t' inherits all common characteristics of the types t1 and t2, i.e., t' inherits from the supertype or from the “common” ancestor type, and has no specific characteristics of its own.
•
The fuzzy set of specific constraints Ct' is empty, but, as a result of inheritance, all constraints that were defined for the inherited characteristics remain valid and must hold.
The set of all instances Vos' instance of os' is constructed by preserving the objects for which the state v is in the union (resp. intersection and difference) of the sets of states of the instances of os1 and os2 and by calculating the associated EPTVs by applying the logical operators ∧~, ∨~, and ¬~ for EPTVs (as presented in de Tré, 2002). The set of all the persistent instances of os' is defined to be empty, i.e., V os' extent = ∅.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
34 de Tré & de Caluwe
•
(Cartesian) product (⊗): With the object schemes os1 = [id1,t 1,M1,Ct1] and os 2 = [id2,t2,M2,Ct2], the binary (Cartesian) product operation ⊗(os 1,os2) returns a new object scheme: ⊗(os1,os2) = os' = [id',t',M',C t2 ] where
•
The object type t' is constructed by merging the (inherited) characteristics of the types t1 and t2 of the given object schemes.
•
The fuzzy set of specific constraints C t' consists of all the single-type dependent constraints (with associated membership grades) that were defined for the characteristics of type t' and necessarily have to be an element of Ct1, Ct2, or Ctˆ, with tˆ being an ancestor type of t1 or t2.
The set of all instances Vos' instance is constructed by calculating the Cartesian product Vos1instance ⊗ Vos2instance and merging the states of the objects of the resulting pairs. The associated EPTVs are calculated by applying the logical conjunction operator ∧~ for EPTVs. Vos' extent = ∅
•
Projection (Π): This operator is intended to select a number of characteristics from the (inherited) characteristics of the type of an object scheme and the (inherited) characteristics of the object types that are binary related to this type (via the partial association relation ↔). If {id1,id2,…,id n} ⊂ ID is the set of the identifiers of the selected characteristics of the type t of a given object scheme os = [id,t,M,C t ], then the operation Π(os,{id 1,id 2,…,idn}) results in a new object scheme: Π(os,{id 1,id 2,…,idn}) = os' = [id',t',M',Ct' ] where
•
The object type t' has as characteristics, the characteristics identified by the identifiers {id1,id2,…,idn}.
•
The fuzzy set of specific constraints Ct' consists of the single-type dependent constraints (with associated membership grades) that were
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 35
defined for the characteristics with identifiers id 1,id 2,…,idn and necessarily have to be an element of Ct or C t'' with t'' being an ancestor type of t, a type that is binary related to t, or an ancestor type of a type that is binary related to t. The set of all instances Vos' instance is constructed by adapting the state of the objects of Vosinstance by keeping only the values for the selected characteristics. Vos' extent = ∅
•
Extension (Θ): This operator adds a “derived” property to the type of a given object scheme. “Derived” property values are calculated from the values of other properties and cannot be changed by the user. If os = [id,t,M,Ct] is the given object scheme, id isr s with isr ∈ {ise,is,isv} and s ∈ Tliteral ∪ Treference is the new property, and e ∈ E is the expression that will be evaluated to obtain the values of this property, then the operation Θ(os,id isr s,e) results in a new object scheme: Θ(os,id isr s,e) = os' = [id',t',M',Ct'] where
•
Type t' is obtained by adding the extra property id isr s to the specification of type t of the object scheme os.
•
The fuzzy set of constraints C t' = Ct.
Because values for “derived” properties are not stored in the database, the set of all instances Vos' instance equals Vosinstance. Vos' extent = ∅
•
Restriction (σ): This operator allows extra restrictions to be imposed on the set of instances of an object scheme. This is obtained by extending the fuzzy set of constraints of the object scheme with an extra single-type dependent constraint c ∈ Cis, which has to be applied onto the objects of type t. For a given object scheme os = [id,t,M,C t] and a given constraint
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
36 de Tré & de Caluwe
c ∈ Cis with associated weight w, the operation σ(os,c,w) results in a new object scheme: σ(os,c,w) = os' = [id',t',M',Ct'] where
•
The object type t' = t.
•
The fuzzy set of constraints Ct' = C t ∪ {(c,w)} is obtained as the union of the fuzzy sets C t and {(c,w)}.
The set of all instances Vos' instance consists of all instances of Vosinstance for which the extra condition that is imposed by constraint c is satisfied [with an EPTV that differs from {(F,1)}]. Vos' extent = ∅
•
Threshold (τ): This operator is intended to restrict the set of instances of a given object scheme by applying a threshold value for each of the membership grades µt˜*(“o is an instance of t”) (T), µ t˜*(“o is an instance of t”) (F), and µ t˜*(“o is an instance of t”) (⊥) of the EPTVs t˜*(“o is an instance of t”) associated with the instances o of the (type of the) object scheme. For a given object scheme os = [id,t,M,C t] and given threshold values τ T, τF, and τ ⊥, the operation τ(os,τ T,τF,τ⊥) results in a new object scheme: τ(os,τT,τ F,τ ⊥) = os' = [id',t',M',Ct'] where
• •
The object type t' = t. The fuzzy set of constraints C t' = Ct.
The set of all instances Vos' instance consists of all instances o of Vosinstance for which the threshold restriction:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 37
[µt~*(“o is an instance of t”) (T) ≥ τ T] ∧ [µt~*(“o is an instance of t”) (F) ≤ τF] ∧ [µt~*(“o is an instance of t”) (⊥) ≤ τ ⊥] is satisfied. Vos' extent = ∅
•
The operators make_persistent and make_transient: By definition, all the instances of the resulting set of instances of the previous operators are transient. Therefore, the operator make_persistent, as well as its counterpart make_transient, were added in order to make transient objects (of a given object scheme) persistent, and vice versa.
Definition of the Database Model Definition 17 (Database model): The database model DM is defined by the following DM = [TS, CS, OS, DS, ODDLmodel, ODMLmodel] in which:
• • • • • •
TS is the type of system (Definition 5). CS is the constraint system (Definition 12). OS represents the set of all the object schemes. DS represents the set of all the database schemes. ODDLmodel is the set of data definition operators (Definition 15). ODMLmodel is the set of data manipulation operators (Definition 16).
Illustrative Example As an illustration of the flexible querying facilities of the presented database model, consider the database scheme DSEmpl as presented in Example 5. Example 6: With the employee database with database scheme DSEmpl, consider the query: “Find the names and employee IDs of all young employees that are fluent in Dutch and French (the criterion ‘young’ is less important than the criterion ‘fluent in Dutch and French’).”
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
38 de Tré & de Caluwe
This query can be expressed by the following: Π(σ(σ(OSEmployee, c{TEmployee.Age}value [TEmployee.Age is young], 0.8), c{TEmployee.Languages}value [µ (TEmployee.Languages, Dutch) is fluent ∧˜ µ(TEmployee.Languages, French) is fluent ], 1),{EmployeeID,Name}) This results in a new object scheme: OSResult = [OSResult,TResult,“Query result”, ∅] with Class TResult ( EmployeeID ise String; Name ise String ) With the understanding that young is defined by the fuzzy set with membership function: µyoung(x) = 1
if 0 ≤ x ≤ 30
µyoung(x) = -x/20 + 5/2
if 30 < x < 50
µyoung(x) = 0
if x ≥ 50
and fluent is defined by the fuzzy set with membership function µfluent(x) = x
if 0 ≤ x ≤ 1
the set of all instances VOSResult instance consists of all instances that satisfy the query conditions [with an EPTV that differs from {(F,1)}] and equals: VOSResult instance = {[TResult,(EmployeeID ise “ID25”; Name ise “Joe”), {(T,0.4), (F,0.6)}]} The EPTV {(T,0.4), (F,0.6)} was calculated as follows:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 39
First, the degree of satisfaction of constraint c1 = c{TEmployee.Age}value [TEmployee.Age is young] is calculated. This is done by means of the following formula, (as fully explained in de Tré & de Baets, 2003): µt*~(A is F)(T) = supx∈dom A min (πA(x), µF(x)) µt*~(A is F)(F) = sup x∈dom A\{⊥} min (πA(x), 1-µF(x)) µt*~(A is F)(⊥) = min (πA(⊥), 1-µF(⊥)) where πA is the possibility distribution representing the value of attribute A, and µF is the membership function representing the linguistic term F. Applying the previous function yields t*~(c1) = {(T,0.4), (F,0.6)} Second, the degree of satisfaction of constraint c2 = c{TEmployee.Languages}value [µ(TEmployee.Languages, Dutch) is fluent ∧˜ µ(TEmployee.Languages, French) is fluent] is calculated by applying the same formula two times and calculating the conjunction of both resulting EPTVs, i.e.: t*~(µ(TEmployee.Languages,Dutch) is fluent) = {(T,1)} t*~(µ(TEmployee.Languages,French) is fluent) = {(T,1)} so that t*~(c2) = {(T,1)} ∧˜ {(T,1)} = {(T,1)} Next, the impact of the importance weights is calculated by applying the residual implicator fim and co-implicator fimco functions (as fully explained in de Tré & de Baets, 2003), i.e.:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
40 de Tré & de Caluwe
with fim being defined by fim:[0,1]2 → [0,1]:(w, µ) → sup {v|v∈[0,1]∧ min(w,v) ≤ µ} and fimco being defined by fimco:[0,1]2 → [0,1]:(w, µ) → inf {v|v∈[0,1] ∧ max(w,v) ≥ µ} the impact g of weight w on EPTV t is calculated by the following: µg(w,t)(T) = fim(w, µt(T)) µg(w,t)(F) = fimco(1-w, µt(F)) µg(w,t)(⊥) = fimco(1-w, µt(⊥)) and yields g(0.8, t*~(c1)) = {(T,0.4), (F,0.6)} and g(1, t*~(c2)) = {(T,1)} so that the degree of satisfaction for all constraints imposed by the query yields {(T,0.4), (F,0.6)} ∧˜ {(T,1)} = {(T,0.4), (F,0.6)} Note that with the first criterion being much less important, e.g., with weight 0.2, this result would have been {(T,1)} ∧˜ {(T,1)} = {(T,1)} because then g(0.2, t*~(c1)) = {(T,1)} and g(1, t*~(c2)) = {(T,1)} Furthermore, VOSResult
extent
= ∅
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 41
Conclusions and Future Trends With the foregoing definitions, the fundamentals of a mathematical framework for the definition of a possibilistic, constraint-based object-oriented database model were presented. This framework is based on an algebraic type system and a related constraint system, which is meant to define the database semantics. Central to the proposed database model are the concepts of object schemes and database schemes. The proposed model is consistent with the ODMG data model (as far as its crisp components are considered), which is proven to be very useful in unifying attempts to define object database models and query languages. The incorporation of constraints allows for a better definition of database semantics and opens new perspectives to extend the model toward other formalism, e.g., supporting fuzzy spatio-temporal databases (de Tré, de Caluwe, Hallez, & Verstraete, 2002). Typical for the presented approach is the integration and use of Zadeh’s generalized constraints and of logic based on extended possibilistic truth values. This allows for a general and extensible definition of the semantics and integrity of the data and of the query criteria. As generalized constraints, only the equality constraint, the possibilistic constraint, and the veristic constraint were integrated in the presented framework. In future research, the incorporation of other generalized constraints and the usability and applicability of Zadeh’s so-called “protoforms,” which can be seen as generalizations of generalized constraints, will be studied.
References Alagiæ , S. (1997). The ODMG object model: Does it make sense? ACM SIGPLAN Notices, 32(10), 253–270. Beaubouef, T., & Petry F. E. (2002). Uncertainty in OODB modeled by rough sets. In Proceedings of the IPMU 2002 conference (Vol. III, pp. 1697– 1703). Annecy, France. Berzal, F., Marín, N., Pons, O., & Vila, M. A. (2003). FoodBi: Managing fuzzy object-oriented data on top of the Java platform. In Proceedings of the 10th IFSA World Congress (pp. 384–387). Istanbul, Turkey. Blanco, I., Marín, N., Pons, O., & Vila, M. A. (2001). Softening the objectoriented database model: Imprecision, uncertainty and fuzzy types. In Proceedings of the IFSA/NAFIPS World Congress (pp. 2323–2328). Vancouver, Canada.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
42 de Tré & de Caluwe
Bordogna, G., & Pasi, G. (eds.). (2000). Recent issues on fuzzy databases. Heidelberg, Germany: Physica-Verlag. Bordogna, G., Lucarella, D., & Pasi, G. (1994). A fuzzy object oriented data model. In Proceedings of the Third IEEE International Conference on Fuzzy Systems, FUZZ-IEEE’94 (pp. 313–318). Orlando, FL. Bordogna, G., Pasi, G., & Lucarella, D. (1999). A fuzzy object-oriented data model for managing vague and uncertain information. International Journal of Intelligent Systems, 14(7), 623–651. Bordogna, G., Leporati, A., Lucarella, D., & Pasi, G. (2000). The fuzzy objectoriented database management system. In G. Bordogna, & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 209–236). Heidelberg, Germany: Physica-Verlag. Cattell, R. G. G., & Barry, D. (eds.). (2000). The object data standard: ODMG 3.0. San Francisco, CA: Morgan Kaufmann Publishers. de Cooman, G. (1995). Towards a possibilistic logic. In D. Ruan (Ed.), Fuzzy set theory and advanced mathematical applications (pp. 89–133). Boston, MA: Kluwer Academic Publishers. de Cooman, G. (1999). From possibilistic information to Kleene’s strong multivalued logics. In D. Dubois, E. P. Klement, & H. Prade (Eds.), Fuzzy sets, logics and reasoning about knowledge (pp. 315–323). Boston, MA: Kluwer Academic Publishers. de Tré, G. (2002). Extended possibilistic truth values. International Journal of Intelligent Systems, 17, 427–446. de Tré, G., & de Baets, B. (2003). Aggregating constraint satisfaction degrees expressed by possibilistic truth values. IEEE Transactions on Fuzzy Systems, 11(3), 361–368. de Tré, G., & de Caluwe, R. (2000). The application of generalized constraints to object-oriented database models. Mathware and Soft Computing, VII(2–3), 245–255. de Tré, G., & de Caluwe, R. (2003). Modelling uncertainty in multimedia database systems: An extended possibilistic approach. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 11(1), 5–22. de Tré, G., & de Caluwe, R. (2003a). Level-2 fuzzy sets and their usefulness in object-oriented database modelling. Fuzzy Sets and Systems, 140, 29–49. de Tré, G., de Caluwe, R., & Van der Cruyssen, B. (2000). A generalised objectoriented database model. In G. Bordogna, & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 155–182). Heidelberg, Germany: Physica-Verlag.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 43
de Tré, G., de Caluwe, R., Hallez, A., & Verstraete, J. (2002). Fuzzy and uncertain spatio-temporal database models: A constraint-based approach. In Proceedings of the Ninth International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems IPMU 2002 (pp. 1713–1720). Annecy, France. Dubois, D., & Prade, H. (1988). Possibility theory. New York: Plenum Press. Dubois, D., & Prade, H. (1997). The three semantics of fuzzy sets. Fuzzy Sets and Systems, 90(2), 141–150. George, R. (1992). Uncertainty management issues in the object-oriented database model. Ph.D. thesis, Tulane University, New Orleans, LA. George, R., Yazici, A., Petry, F. E., & Buckles, B. P. (1997). Modeling impreciseness and uncertainty in the object-oriented data model — A similarity-based approach. In R. de Caluwe (Ed.), Fuzzy and uncertain object-oriented databases: Concepts and models (pp. 63–95). Singapore: World Scientific. Gottwald, S. (1979). Set theory for fuzzy sets of higher level. Fuzzy Sets and Systems, 2(2), 125–151. Kim, W. (1994). Observations on the ODMG-93 proposal for an object-oriented database language. ACM SIGMOD Record, 23(1), 4–9. Kuper, G., Libkin, L., & Paredaens, J. (Eds.). (2000). Constraint databases. Berlin, Germany: Springer-Verlag. Lausen, G., & Vossen, G. (1998). Models and languages of object-oriented databases. Harlow, UK: Addison-Wesley. Marín, N., Pons, O., & Vila, M. A. (2000). Fuzzy types: A new concept of type for managing vague structures. International Journal of Intelligent Systems, 15(11), 1061–1085. Mouaddib, N., & Subtil, P. (1997). Management of uncertainty and vagueness in databases: The FIRMS point of view. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 5(4), 437–457. Na, S., & Park, S. (1997). Fuzzy object-oriented data model and fuzzy association algebra. In R. de Caluwe (Ed.), Fuzzy and uncertain object-oriented databases: Concepts and models (pp. 187–206). Singapore: World Scientific. Prade, H. (1982). Possibility sets, fuzzy sets and their relation to Lukasiewicz logic. In Proceedings of the 12th International Symposium on MultipleValued Logic (pp. 223–227). Rescher, N. (1969). Many-valued logic. New York: McGraw-Hill.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
44 de Tré & de Caluwe
Rocacher, D., & Connan, F. (1996). A fuzzy algebra for object oriented databases. In Proceedings of the Fourth European Congress on Intelligent Techniques and Soft Computing, EUFIT’96 (Vol. 2, pp. 871– 876). Aachen, Germany. Rossazza, J. -P. (1990). Utilisation de hiérarchies de classes floues pour la représentation de connaissances imprécises et sujettes à exception: le système “SORCIER.” Ph.D. thesis, Université Paul Sebatier, Toulouse, France. Rossazza, J. -P., Dubois, D., & Prade, H. (1997). A hierarchical model of fuzzy classes. In R. de Caluwe (Ed.), Fuzzy and uncertain object-oriented databases: Concepts and models (pp. 21–61). Singapore: World Scientific. Shaw, G. M., & Zdonik, S. B. (1990). A query algebra for object-oriented databases. In Proceedings of the Sixth International Conference on Data Engineering, ICDE’90 (pp. 154–162). Los Angeles, CA. Taivalsari, A. (1996). On the notion of inheritance. ACM Computing Surveys, 28(3), 438–479. Tanaka, K., Kobayashi, S., & Sakanoue, T. (1991). Uncertainty management in object-oriented database systems. In D. Karagiannis (Ed.), Proceedings of the International Conference on Database and Expert System Applications, DEXA 1991 (pp. 251–256). Berlin, Germany: SpringerVerlag. Van Gyseghem, N. (1998). Imprecision and uncertainty in the UFO database model. Journal of the American Society for Information Science, 49(3), 236–252. Zadeh, L. A. (1968). Probability measures of fuzzy events. Journal of Mathematical Analysis and Applications, 23, 421–427. Zadeh, L. A. (1975). The concept of linguistic variable and its application to approximate reasoning (Parts I, II, and III). Information Sciences, 8, 199– 251, 301–357 ; 9, 43–80. Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(1), 3–28. Zadeh, L. A. (1986). Outline of a computational approach to meaning and knowledge representation based on a concept of a generalized assignment statement. In M. Thoma, & A. Wyner (Eds.), Proceedings of the International Seminar on Artificial Intelligence and Man–Machine Systems (pp. 198–211). Heidelberg, Germany: Springer. Zadeh, L. A. (1996). Fuzzy logic = Computing with words. IEEE Transactions on Fuzzy Systems, 4(2), 103–111.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Constraint Based Fuzzy Object Oriented Database Model 45
Zadeh, L. A. (1997). Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems, 90(2), 111–127. Zadeh, L. A. (1999). From computing with numbers to computing with words — from manipulation of measurements to manipulation of perceptions. IEEE Transactions on Circuit Systems, 45, 105–119. Zadeh, L. A. (2000). Toward a preception-based theory of probabilistic reasoning with imprecise probabilities. Journal of Statistical Planning and Inference, 105, 233–264.
Endnotes 1
The term “characteristic” is used to denote the properties (structure) — i.e., the attributes and relationships — and the operators (behavior) of the object type.
2
Path expressions are an adequate means with which to identify the components of a structured type or an object type. In this model, every path expression is defined as an identifier, which is obtained by applying the period member operator (.) an adequate number of times with the identifiers of (the components or characteristics of the) type as arguments.
3
Component of an object type is a short notation for component of a structured type that is associated with an (inherited) attribute of that object type.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
46 Cao & Nguyen
Chapter II
Fuzzy and Probabilistic Object Bases T. H. Cao Ho Chi Minh City University of Technology, Vietnam H. Nguyen Ho Chi Minh City Open University, Vietnam
Abstract Database systems have evolved from relational databases to those integrating different modeling and computing paradigms, in particular, object orientation and probabilistic reasoning. This chapter introduces an extension of the probabilistic object base model by Eiter et al. (2001), using fuzzy sets for representing and handling vague and imprecise values of object attributes. A probabilistic interpretation of relations on fuzzy set values is proposed to integrate them into that probability-based framework. Then, the definitions of fuzzy-probabilistic object base schemas, instances, and selection operation are presented. Other algebraic operations, namely, projection, renaming, Cartesian product, join, intersection, union, and difference of the probabilistic object base model are also adapted for its fuzzy extension.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 47
Introduction For modeling real-world problems and constructing intelligent systems, the integration of different methodologies and techniques has been the quest and focus of significant interdisciplinary research effort. The advantages of such a hybrid system are that the strengths of its partners are combined and are complementary to each other’s weaknesses. In particular, object orientation provides a hierarchical data abstraction scheme and an information hiding and inheritance mechanism. Meanwhile, probability theory and fuzzy logic provide measures and rules for representing and reasoning with uncertainty and imprecision in the real world. Many uncertain and fuzzy object-oriented models (e.g., George, Buckles, & Petry, 1993; Itzkovich & Hawkes, 1994; Rossazza, Dubois, & Prade, 1997; Van Gyseghem & De Caluwe, 1997; Bordogna, Pasi, & Lucarella, 1999; Dubitzky et al., 1999; Yazici & George, 1999; Blanco et al., 2001; Cross, 2003) were proposed and developed. However, only a few of them combine probability theory and fuzzy logic, in order to deal with both uncertainty and imprecision. Early works on fuzzy extension of object-oriented models were done by George, Buckles, and Petry (1993) and Itzkovich and Hawkes (1994), which introduced inclusion degrees between classes in a hierarchy. An inclusion degree of one class to another could be computed on the basis of the fuzzy ranges of their common attributes. For example, Rossazza, Dubois, and Prade (1997) defined four inclusion degrees, depending on whether necessary ranges or typical ranges were used for each of the two classes. Arguing for flexible modeling, Van Gyseghem and De Caluwe (1997) introduced the notion of fuzzy property as an intermediate between the two extreme notions of required property and optional property. Each fuzzy property of a class was associated with possibility degrees of applicability of the property to the class. Meanwhile, Yazici and George (1999) presented a deductive fuzzy objectoriented model but did not address uncertain applicability of properties. A general data model including fuzzy attribute values as well as uncertain properties was proposed by Bordogna, Pasi, and Lucarella (1999), where the treatment of uncertainty was, however, based on possibility theory rather than on probability theory. As a first attempt to integrate both probabilistic and fuzzy measures into an object-oriented model, Dubitzky et al. (1999) assumed that each property of a concept had a probability degree for it occurring in exemplars of that concept. However, the method therein for computing a membership degree of an object to a concept, based on matching the object’s properties with the uncertainty applicable properties of the concept, is in our view not justifiable. Also, the work
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
48 Cao & Nguyen
did not address the problem of how inheritance is performed under the membership and applicability uncertainty. Recently, Blanco et al. (2001) and De Tré (2001) sketched out general models to manage different sources of imprecision and uncertainty, including probabilistic ones, on various levels of an object-oriented database model. However, no foundation was laid to integrate probability theory and fuzzy logic, in case probability was used to represent uncertainty. Later, Cross (2003) reviewed existing proposals and presented recommendations for the application of fuzzy set theory in a flexible generalized object model. Meanwhile, Cao (2001), Cao et al. (2002), and Cao and Rossiter (2003) introduced a logic-based fuzzy and probabilistic object-oriented model, which could represent and handle fuzzy attribute values as well as uncertain class properties. Mass assignment theory (Baldwin, Martin, & Pilsworth, 1995; Baldwin, Lawry, & Martin, 1996) was employed to compute with fuzzy sets and probabilities in an integrated framework. Nevertheless, the definition of class hierarchies in that model was crisp, that is, no uncertainty was considered on class links. In another direction, Eiter et al. (2001) developed algebra to handle object bases with uncertainty, called POBs, where the conditional probability for an object of a class belonging to one of its subclasses was specified in the class hierarchy of discourse. Also, for each attribute of an object, uncertainty about its value was represented by lower-bound and upper-bound probability distribution functions over a set of values. However, the major shortcoming of the POB model is that it does not allow vague and imprecise attribute values. For instance, in the Plant example therein, the values of the attribute sun are chosen to be only enumerated symbols, such as mild, medium, and heavy, without any interpretation. Meanwhile, in practice, those values are inherently vague and imprecise over degrees of sunlight. Moreover, without an interpretation, they cannot be measured, and their probability distributions cannot be calculated. Because fuzzy set theory and fuzzy logic provide a basis for defining the semantics of, and computing with, linguistic terms (Zadeh, 1978), we apply them to extend the POB model to allow vague and imprecise attribute values. For instance, the values mild, medium, and heavy of the attribute sun in the aforementioned Plant example can be defined by fuzzy sets. Primary results of this extension were presented by Cao and Nguyen (2002). In this chapter, the second section presents fundamentals of probability and fuzzy set theories and, in particular, introduces a probabilistic interpretation of relations on fuzzy sets to integrate them into the probability-based framework of POBs. Then, the third, fourth, fifth, and sixth sections present a fuzzy extension and generalization of the definitions of POB schemas, instances, and algebraic
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 49
operations for fuzzy POBs (FPOBs). Finally, the last section concludes the chapter and suggests further work.
Fundamentals of Probabilities and Fuzzy Sets Voting Model of Fuzzy Sets In this work, for extending the probabilistic model of POBs with fuzzy set values, we apply the voting model interpretation of fuzzy sets (Gaines, 1978; Baldwin, Martin, & Pilsworth, 1995). That is, given a fuzzy set A on a domain U, each voter has a subset of U as his of her own crisp definition of the concept that A represents. For example, a voter may have the interval [0, 35] representing human ages from 0 to 35 years as his or her definition of the concept young, while another voter may have [0, 25] instead. The membership function value µA(u) is then the proportion of voters whose crisp definitions include u. This model defines a mass assignment (i.e., probability distribution) on the power set of U, where the mass (i.e., probability value) assigned to a subset of U is the proportion of voters who have that subset as a crisp definition for the fuzzy concept A. As such, this mass assignment corresponds to a family of probability distributions on U. Let us take the Dice example given by Baldwin, Martin, and Pilsworth (1995). Given the dice values from the set {1, 2, 3, 4, 5, 6}, suppose that a score high is defined by the discrete fuzzy set {3:0.2, 4:0.5, 5:0.9, 6:1}, i.e., the membership of value 3 is 0.2, and so on. The voting pattern for a group of 10 persons for this score could be as shown in Table 1. Table 1. Voting pattern for high dice values Voters
P1
P2
P3
P4
P5
3
x
x
4
x
5 6
P6
P7
P8
P9
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
P10
Scores 1 2
x
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
50 Cao & Nguyen
That is, all voters, P1 to P10, vote for value 6 as a high score, while only two of them, P1 and P2, vote for 3 as a high score, and so on. In other words, the crisp definition of P10 for the high score is {6}, while that of P1 and P2 is {3, 4, 5, 6}, for instance. An assumption made in this voting model is that any person who accepts a value as a high score also accepts all values that have higher membership grades in the fuzzy set high. This model defines the following mass assignment (i.e., probability distribution) on the power set of {1, 2, 3, 4, 5, 6}: {6}:0.1 {5, 6}:0.4 {4, 5, 6}:0.3 {3, 4, 5, 6}:0.2 where the mass (i.e., probability value) assigned to a subset of {1, 2, 3, 4, 5, 6} [e.g., m high({5, 6}) = 0.4] is the proportion of voters who have that subset as a crisp definition for the fuzzy concept high score. This mass assignment corresponds to a family of probability distributions on {1, 2, 3, 4, 5, 6}.
Probabilistic Interpretation of Relations on Fuzzy Sets On the basis of this voting model, we introduce a probabilistic interpretation of the following binary relations on fuzzy sets. We write Pr(E1 | E2) to denote the conditional probability of E1 given E2. Definition 1. Let A be a fuzzy set on a domain U; B be a fuzzy set on a domain V; and θ be a binary relation from {=, ≤, <, ⊆, ∈} assumed to be valid on (U × V). The probabilistic interpretation of a relation A θ B, denoted by prob(A θ B), is a value in [0, 1] that is defined by ∑S,T ⊆ UPr(u θ v | u ∈ S, v ∈ T).mA(S).mB(T). Intuitively, given fuzzy propositions x ∈ A and y ∈ B, prob(A θ B) is the probability for x θ y being true. The rationale of the above probabilistic interpretation is that, given each crisp definition S of A and T of B, the conditional probability u θ v given u ∈ S and v ∈ T is calculated and weighted by the product of the masses associated with S and T. Then prob(A θ B) is the sum of those weighted conditional probability values. Also, we define prob(A ≥ B) = prob(B ≤ A), prob(A > B) = prob(B < A), prob(A ⊇ B) = prob(B ⊆ A), and prob(A ∋ B) = prob(B ∈ A). Example 1: In the Dice example above, suppose that about_5 is defined by the fuzzy set {6:0.3, 5:1, 4:0.3}, whose mass assignment is:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 51
{5}:0.7 {4, 5, 6}:0.3 Given x ∈ about_5 and y ∈ high, prob(about_5 = high) measures how likely it is that x = y, as calculated below: prob(about_5 = high) =
Pr(u = v|u ∈ {5},v ∈ {6}).mabout_5({5}).m high({6}) Pr(u = v|u ∈ {5},v ∈ {5,6}).mabout_5({5}).m high({5, 6}) + Pr(u = v|u ∈ {5},v ∈ {4, 5, 6}).mabout_5({5}).m high({4, 5, 6}) + Pr(u = v|u ∈ {5},v ∈ {3, 4, 5, 6}).mabout_5({5}).mhigh({3, 4, 5, 6}) + Pr(u = v|u ∈ {4, 5, 6},v ∈ {6}).mabout_5({4, 5, 6}).m high({6}) + Pr(u = v|u ∈ {4, 5, 6},v ∈ {5, 6}).mabout_5({4, 5, 6}).mhigh({5, 6}) + Pr(u = v|u ∈ {4, 5, 6},v ∈ {4, 5, 6}).m about_5({4, 5, 6}).m high({4, 5, 6}) + Pr(u = v|u ∈ {4, 5, 6},v ∈ {3, 4, 5, 6}).m about_5({4, 5, 6}).m high({3, 4, 5, 6})
=
0.0 × 0.7 × 0.1 + 1/2 × 0.7 × 0.4 + 1/3 × 0.7 × 0.3 + 1/4 × 0.7 × 0.2 + 1/3 × 0.3 × 0.1 + 1/3 × 0.3 × 0.4 + 1/3 × 0.3 × 0.3 + 1/4 × 0.3 × 0.2
= 0.34 Definition 2. Let A and B be two fuzzy sets on a domain U. The probabilistic interpretation of the relation A → B, denoted by prob(A → B), is a value in [0, 1] that is defined by ∑S,T ⊆ UPr(u ∈ T | u ∈ S).mA(S).mB(T). The intuitive meaning of prob(A → B) is that it is the probability for x ∈ B being true given x ∈ A being true. In other words, it is the fuzzy conditional probability of x ∈ B given x ∈ A as defined by Baldwin, Martin, and Pilsworth (1995). We note that the above probabilistic interpretation can also be adapted for fuzzy sets on continuous domains, using integration instead of addition, as in the definition of fuzzy conditional probability (Baldwin, Lawry, & Martin, 1996) as follows:
11
prob( A → B) = ∫ ∫
00
Pr ( xA∩ y B ) Pr ( xA)
1 1 x A∩ y B
dxdy = ∫ ∫ 00
x
dxdy A
where x A and y B are α-cuts of the fuzzy sets A and B with α = x and α = y, respectively. We also define prob(A ← B) = prob(B → A). Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
52 Cao & Nguyen
Example 2: In the Dice example, one has: prob(high → about_5) =
Pr(u ∈ {5} | u ∈ {6}).mhigh({6}).mabout_5({5}) Pr(u ∈ {5} | u ∈ {5,6}).mhigh({5,6}).mabout_5({5}) + Pr(u ∈ {5} | u ∈ {4,5,6}).mhigh({4,5,6}).mabout_5({5}) + Pr(u ∈ {5} | u ∈ {3,4,5,6}).mhigh({3,4,5,6}).mabout_5({5}) + Pr(u ∈ {4,5,6} | u ∈ {6}).m high({6}).mabout_5({4,5,6}) + Pr(u ∈ {4,5,6} | u ∈ {5,6}).mhigh({5,6}).m about_5({4,5,6}) + Pr(u ∈ {4,5,6} | u ∈ {4,5,6}).mhigh({4,5,6}).mabout_5({4,5,6}) + Pr(u ∈ {4,5,6} | u ∈ {3,4,5,6}).mhigh({3,4,5,6}).mabout_5({4,5,6})
=
0.0 × 0.1 × 0.7 + 1/2 × 0.4 × 0.7 + 1/3 × 0.3 × 0.7 + 1/4 × 0.2 × 0.7 + 1.0 × 0.1 × 0.3 + 1.0 × 0.4 × 0.3 + 1.0 × 0.3 × 0.3 + 3/4 × 0.2 × 0.3
=
0.53
Probabilistic Combination Strategies Given two events e1 and e2 having probabilities in the intervals [L1, U 1] and [L2, U2], one may need to compute the probability intervals of the conjunction event e1 ∧ e2, disjunction event e1 ∨ e2, or difference event e1 ∧ ¬ e2. In this chapter, we employ the conjunction, disjunction, and difference strategies given by Lakshmanan et al. (1997) and Eiter et al. (2001) as presented in Table 2, where ⊗, ⊕, and denote the conjunction, disjunction, and difference operators, respectively.
FPOB Types and Schemas Overview of FPOB Conceptual Model An architecture of FPOB systems is illustrated in Figure 1, which is adapted and extended with fuzzy sets from that of POB systems. The user expresses declarative queries in an FPOB-calculus through a graphical user interface. Those queries are processed and converted into procedural queries in the FPOB algebra that this chapter is presenting. They will then be executed by an FPOB algebra execution engine, accessing data in an FPOB of discourse. All the components refer to a library consisting of the following:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 53
Table 2. Examples of probabilistic combination strategies Strategy
Operators
Ignorance
([L1, U1] ⊗ig [L2, U2]) = [max(0, L1 + L2 − 1), min(U1, U2)] ([L1, U1] ⊕ig [L2, U2]) = [max(L1, L2), min(1, U1 + U2)] ([L1, U1] yig [L2, U2]) = [max(0, L1 − U2), min(U1, 1 − L2)] ([L1, U1] ⊗in [L2, U2]) = [L1.L2, U1.U2]
Independence
([L1, U1] ⊕in [L2, U2]) = [L1 + L2 − (L1.L2), U1 + U2 − (U1.U2)] ([L1, U1] yin [L2, U2]) = [L1.(1 − U2), U1.(1 − L2)] Positive correlation
([L1, U1] ⊗pc [L2, U2]) = [min(L1, L2), min(U1, U2)]
(when e1 implies e2,
([L1, U1] ⊕pc [L2, U2]) = [max(L1, L2), max(U1, U2)]
or e2 implies e1)
([L1, U1] ypc [L2, U2]) = [max(0, L1 − U2), max(0, U1 − L2)]
Mutual exclusion
([L1, U1] ⊗me [L2, U2]) = [0, 0]
(when e1 and e2 are
([L1, U1] ⊕me [L2, U2]) = [min(1, L1 + L2), min(1, U1 + U2)]
mutually exclusive)
([L1, U1] yme [L2, U2]) = [L1, min(U1, 1 − L2)]
•
A set of probabilistic combination strategies for the user to express dependencies between events.
•
A set of functions for the user to specify how probabilities are distributed over the domain of values of attributes.
•
A set of fuzzy sets for the user to express vague and imprecise values of attributes.
Figure 1. Architecture of FPOB systems USER
GUI
FPOB Calculus Query
FPOB-Algebra Query Manager
probabilistic combination strategies
FPOB Algebra Query
FPOB-Algebra Execution Engine
probabilistic distributions
FPOB
fuzzy sets
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
54 Cao & Nguyen
Figure 2. An example FPOB class hierarchy PLANTS
0.6 ANNUALS
0.4
@
0.4 PERENNIALS
@ VEGETABLES
0.8 ANNUALS_HERBS
0.3
0.2
HERBS
0.3
0.4 FLOWERS
0.3
PERENNIALS_FLOWERS
For FPOBs, we use the same definition of class hierarchy as that used for POBs. Figure 2 shows an example POB hierarchy of plants given by Eiter et al. (2001), which are classified as being either perennials or annuals and, alternatively, as being vegetables, herbs, or flowers. Those subclasses of a class that are connected to a d node are mutually disjoint (i.e., an object cannot belong to any two of them at the same time), and they form a cluster of that class. In this example, the class PLANTS has two clusters, namely, {ANNUALS, PERENNIALS} and {VEGETABLES , HERBS , FLOWERS }. The value in [0, 1] associated with the link between a class and one of its immediate subclasses represents the probability for an object of the class belonging to that subclass. For instance, the hierarchy says 60% of plants are annuals, while the rest (40%) are perennials. Also, ANNUALS _HERBS is a common subclass of ANNUALS and HERBS, where ANNUALS_HERBS constitute 40% and 80% of annuals and herbs, respectively.
FPOB Types and Values As in the classical object-oriented model, each class in POBs is characterized by a number of attributes with values that are of particular types. For POBs, types can be atomic types, set types, or tuple types. For FPOBs, we extend the set types to be fuzzy set types as in the following definition. Definition 3. Let A be a set of attributes and T be a set of atomic types. Then types are inductively defined as follows: 1.
Every atomic type from T is a type.
2.
If τ is a type, then {τ} is the fuzzy set type of τ.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 55
3.
If A1, A2, …, Ak are pairwise different attributes from A and τ1, τ2, …, τ k are types, then τ = [A1: τ1, A2: τ2, …, Ak: τk] is the tuple type over {A1, A2, …, Ak}. One writes τ.Ai to denote τ i, and A1, A2, …, Ak are called top-level attributes of τ.
Example 3: In the Plant example above, the attributes can be soil, sun, water, which describe the conditions for a plant to grow, and name, size, width, and height. Some atomic types can be integer, real, string, and soil-type. Some fuzzy set and tuple types can be {real}, [soil: soil-type, sun: {real}, water: integer], and [name: string, size: [height: integer, width: integer]]. Each type has a domain of its values as defined below (cf., Eiter et al., 2001). Definition 4. Let every atomic type τ ∈ T be associated with a domain dom(τ). Then values are defined by induction as follows: 1. 2. 3.
For every τ ∈ T , every v ∈ dom(τ) is a value of type τ. For every τ ∈ T , every fuzzy set on dom(τ) is a value of type {τ}. If A1, A2, …, Ak are pairwise different attributes from A and v1, v2, …, vk are values of types τ1, τ2, …, τk, then [A1: v1, A2: v2, …, Ak: vk] is a value of type [A1: τ 1, A2: τ2, …, Ak: τ k].
We recall that a crisp set A on a domain U can be considered as a special fuzzy set Af on U with membership defined by, for every x ∈ U, µAf(x) = 1 if x ∈ A and µAf(x) = 0 if x ∉ A. Also, every v ∈ U can be treated as a special fuzzy set vf on U with membership defined by, for every x ∈ U, µvf(x) = 1 if x = v and µAf(x) = 0 if x ≠ v. Example 4: In the Plant example, let soil-type be an enumerated type such that dom(soil-type) = {loamy, swampy, sandy}, and mild, medium, and heavy are linguistic labels of fuzzy sets on dom(real) as shown in Figure 3, with membership functions as follows:
1 if x ∈ [0, 5] mild ( x) = 0.2(5 − x) + 1 if x ∈ (5,10) 0 otherwise
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
56 Cao & Nguyen
0.2( x − 10) + 1 if x ∈ [5,10) 1 if x ∈ [10,15) medium( x) = 0.2(15 − x ) + 1 if x ∈ [15, 20] 0 otherwise
0.2( x − 20) + 1 if x ∈ [15, 20) heavy( x) = 1 if x ∈ [ 20, 25] 0 otherwise
Then, [soil: swampy, sun: mild, water: 3] is a value of the type [soil: soil-type, sun: {real}, water: integer]. In POBs, for each attribute of an object there can be uncertainty about its value measured by lower-bound and upper-bound probability distribution functions over a set of values. For FPOBs, we adapt the definition of probabilistic tuple values for POBs to represent that uncertain information for fuzzy set values as well. Definition 5. Let A1, A2, …, Ak be pairwise different attributes from A and, for each i from 1 to k, Vi be a finite set of values of type τi, and αi, βi be probability distribution functions over Vi. Then ptv = [A1: 〈V1, α1, β1〉, A2: 〈V2, α2, β2〉, …, Ak: 〈Vk, αk, βk〉] is a fuzzy-probabilistic tuple value of type [A1: τ 1, A2: τ2, …, Ak: τk] over {A1, A2, …, Ak}. One writes ptv.Ai to denote 〈Vi, αi, βi〉. Example 5: Assume we know that the soil type of a thyme plant is loamy. However, we are not sure whether the plant is French thyme, Silver thyme, or Wooly thyme, with the same probability between 0.2 and 0.6 for each category.
Figure 3. Fuzzy set values of sunlight 1 heavy
mild medium
0
5
10
15
20
sunlight degrees
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 57
Then this information can be represented by the fuzzy-probabilistic tuple value [soil: 〈{loamy}, u, u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]. Here, “u” represents the uniform distribution function, and “0.6u” and “1.8u” denote the distribution functions α(x) = 0.6 × 1/3 = 0.2 and β(x) = 1.8 × 1/3 = 0.6 for every x from {french, silver, wooly}.
FPOB Schemas FPOB schemas are now defined the same as POB schemas, as follows: Definition 6. An FPOB schema is a quintuple (C, τ, ⇒, me, p), where: 1.
C is a finite set of classes.
2.
τ maps each class to a tuple type τ(c) representing the attributes and their types of that class.
3.
C , ⇒) is a directed acyclic graph, ⇒ is a binary relation on C such that (C whereby each edge c1 ⇒ c2 means c1 is an immediate subclass of c2.
4.
me maps each class c ∈ C to a partition of the set of all immediate subclasses of c, such that the classes in each cluster of the partition me(c) are mutually disjoint.
5.
C, ⇒) to a rational number p(c1 | c2) in [0, 1] p maps each edge c1 ⇒ c2 in (C measuring the conditional probability for an object picked at random uniformly from c2 belonging to c1.
Given c1 ⇒ c2 ⇒ … ⇒ ck, one can write c1 ⇒* ck, and, in particular, c ⇒* c for every c ∈ C . Example 6: An FPOB schema for the Plant example above may be defined as follows: C = {PLANTS, ANNUALS , PERENNIALS , VEGETABLES, HERBS, FLOWERS , ANNUALS_HERBS, PERENNIAL_ FLOWERS }. τ is given as in Table 3 (cf., Eiter et al., 2001). (C C , ⇒), me, and p are given as in Figure 1.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
58 Cao & Nguyen
Table 3. Type assignment of the plant example C
τ(c)
PLANTS
[name: string, soil: soil-type, water: integer]
ANNUALS
[name: string, soil: soil-type, water: integer, sun: {real}]
PERENNIALS
[name: string, soil: soil-type, water: integer, sun: {real}, exp-years: integer]
VEGETABLES
[name: string, soil: soil-type, water: integer, sun: {real}, exp-years: integer]
HERBS
[name: string, soil: soil-type, water: integer, sun: {real}, exp-years: integer, category: string]
FLOWERS
[name: string, soil: soil-type, water: integer, sun: {real}, exp-years: integer, category: string]
ANNUALS_HERBS
[name: string, soil: soil-type, water: integer, sun: {real}, exp-years: integer, category: string]
PERENNIALS_FLOWERS
[name: string, soil: soil-type, water: integer, sun: {real}, exp-years: integer, category: string]
An FPOB schema as defined above may be inconsistent when there is no set of objects that satisfies its class hierarchy and probability assignment. It is consistent if and only if it has a taxonomic and probabilistic model as in the following definition adapted from that of POBs. C , τ, ⇒, me, p) be an FPOB schema. An interpretation Definition 7. Let S = (C of S is a mapping ε from C to the set of all finite subsets of a set O of object identifiers. It is said to be a model of S if and only if: 1.
ε(c) ≠ ∅ for every c ∈ C
2. 3.
ε(c) ⊆ ε(d) for all c, d ∈ C such that c ⇒ d ε(c) ∩ ε(d) = ∅ for all c, d ∈ C such that c and d belong to the same cluster defined by me
4.
|ε(c)| = p(c | d).|ε(d)| for all c, d ∈ C such that c ⇒ d
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 59
Table 4. A model of an FPOB schema C
ε(c)
|ε(c)|
PLANTS
O1 ∪ O2 ∪ … ∪ O10
800
ANNUALS
O1 ∪ O2 ∪ O 3 ∪ O 4 ∪ O 5
480
PERENNIALS
O6 ∪ O7 ∪ O8 ∪ O9 ∪ O10
320
VEGETABLES
O1 ∪ O9
160
HERBS
O2 ∪ O5 ∪ O 6
240
FLOWERS
O3 ∪ O7 ∪ O10
320
ANNUALS_HERBS
O5
192
PERENNIALS_FLOWERS
O10
96
Example 7: As in an example given by Eiter et al. (2001), let S be the FPOB schema in Example 6 and O be a set of cardinality 800 partitioned into pairwise disjoint subsets O1, O2, …, O 10 having cardinalities 90, 27, 126, 45, 192, 21, 98, 35, 70, and 96, respectively. Then ε given in Table 4 is a model of S.
FPOB Inheritance and Instances FPOB Inherited Schemas In Definition 6 of FPOB schemas, the attributes specified for a class are only the top-level attributes of that class, which do not include those attributes inherited from its superclasses. In practice, different inheritance strategies can be employed to resolve multiple inheritance (Bertino & Martino, 1993; Meyer, 1997; Cao, 2001). Given an FPOB schema S = (C C, τ, ⇒, me, p), applying an inheritance strategy C , τ *, ⇒, me, p), which differs from on S induces another FPOB schema S* = (C S only in the type assignment. Specifically, for each c ∈ C , τ*(c) = [A1: τ(d1).A1, A2: τ(d2).A2, …, Ak: τ(dk).Ak], where each di is either c or a proper superclass of c and, respectively, Ai is a top-level attribute of c or one of di, which c inherits. An FPOB schema S is said to be fully inherited if and only if S = S*. From now on, we assume that all FPOB schemas are consistent and fully inherited.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
60 Cao & Nguyen
FPOB Instances As for POBs, given an FPOB schema, an FPOB instance is defined as a base of objects associated with their classes and fuzzy-probabilistic tuple values in accordance with the schema. The following definition is adapted from that of POBs. Definition 8. Let S = (C C , τ, ⇒, me, p) be an FPOB schema and O be a set of object identifiers. An FPOB instance over S is a pair (π, ν) where: 1.
π maps each c ∈ C to a finite subset of O such that, for different c1, c2 ∈ C, π(c1) ∩ π(c2) = ∅.
2.
For each c ∈ C, ν maps each o ∈ π(c) to a fuzzy-probabilistic tuple value of type τ(c).
We note that, in the definition above, π(c) denotes only the set of the identifiers of the objects that are defined in the class c. Meanwhile, the set of the identifiers of all the objects that belong to c (i.e., those that are defined in c or its proper subclasses) are denoted by π*(c) = ∪ {π(d) | d ∈ C and d ⇒ * c}. Also, one writes C ) to denote ∪{π(c) | c ∈ C}. π(C Example 8: An FPOB instance over the FPOB schema in Example 6 can be (π, ν), where π and π* are shown in Table 5 and ν in Table 6 (cf., Eiter et al., 2001).
Probabilistic Extents of Classes In classical object bases, the extent of a class comprises all the objects that belong to that class. In POBs as well as FPOBs, the probabilistic extent of a class specifies the probability for each object belonging to that class. The following definition is adapted from that of POBs. C, τ, Definition 9. Let (π, ν) be an FPOB instance over an FPOB schema S = (C ⇒, me, p). Then, for each class c ∈ C, the probabilistic extent of c, denoted by C ) to a set of rational numbers in [0, 1] as follows: ext(c), maps each o ∈ π(C 1.
If o ∈ π*(c) then ext(c)(o) = {1}.
2.
If o ∈ π*(d) and ε(c) ∩ ε(d) = ∅ for every model ε of S, then ext(c)(o) = {0}.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 61
Table 5. Object mappings π and π* of an FPOB instance
3.
C
π(c)
π∗(c)
PLANTS
{o1}
{o1, o2, o3, o4, o5, o6, o7}
ANNUALS
{}
{o2, o3, o5, o6, o7}
PERENNIALS
{}
{o4}
VEGETABLES
{}
{}
HERBS
{}
{o2, o3, o5, o6, o7}
FLOWERS
{}
{o4}
ANNUALS_HERBS
{o2, o3, o5, o6, o7}
{o2, o3, o5, o6, o7}
PERENNIALS_FLOWERS
{o4}
{o4}
Otherwise, ext(c)(o) = {p | p is the product of the edge probabilities on a path from c up to d where c ⇒* d with d being minimal and o ∈ π*(d)}.
Example 9: For the FPOB instance in Example 8, one has: ext(ANNUALS_HERBS)(o1) = {0.24} ext(ANNUALS_HERBS)(o2) = {1} ext(PERENNIALS_FLOWERS )(o1) = {0.12} ext(PERENNIALS_FLOWERS )(o2) = {0} Intuitively, as compared with relational databases, a POB/FPOB schema corresponds to a relational schema, and each object of a POB/FPOB instance corresponds to a tuple. However, two important differences are that objects can have methods and identifiers (Garcia-Molina, Ullman, & Widom, 2000).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
62 Cao & Nguyen
Table 6. Value mapping ν of an FPOB instance oid
ν(oid)
o1
[name: 〈{Lady-Fern, Ostrich-Fern}, u, u〉, soil: 〈{loamy}, u, u〉, water: 〈{25,…, 30}, u, u〉]
o2
[name: 〈{Cuban-Basil, Lemon-Basil}, u, u〉, soil: 〈{loamy, sandy}, 0.7u, 1.3u〉, water: 〈{20,…,30}, u, u〉, sun: 〈{mild, medium}, 0.8u,1.2u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
o3
[name: 〈{Mint}, u, u〉, soil: 〈{loamy}, u, u〉, water: 〈{20}, u, u〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
o4
[name: 〈{Aster, Salvia}, u, u〉, soil: 〈{loamy, sandy}, 0.6u, 1.4u〉, water: 〈{20,…, 25}, u, u〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈 {french, silver, wooly}, 0.6u, 1.8u〉]
o5
[name: 〈{Thyme}, u, u〉, soil : 〈{loamy}, u, u〉, water: 〈{20,…,25}, u, u〉, sun: 〈{mild, medium}, 0.8u, 1.2u〉, expyears: 〈{2, 3}, 0.8u, 1.2u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
o6
[name: 〈{Mint}, u, u〉, soil: 〈{loamy}, u, u〉, water: 〈{20}, u, u〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈{french, silver, wooly}, 0.6u, 1.4u〉]
o7
[name: 〈{Sage}, u, u〉, soil: 〈{sandy}, u, u〉, water: 〈{20, 21}, u, u〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈{red, tricolor}, 0.6u, 1.4u〉]
FPOB Selection Operation Selection Conditions As for relational databases and object bases, selection is a basic operation for FPOBs. Intuitively, the result of a selection query on an FPOB instance I over
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 63
an FPOB schema S is another FPOB instance I' over S such that the objects of the classes in I' and their attribute values satisfy the selection condition of the query. Before defining the FPOB selection operation, we present the formal syntax and semantics of selection conditions. We start with the syntax of path expressions and selection expressions. The following definition of path expressions is given by Eiter et al. (2001). Definition 10. Given a type τ = [A1: τ 1, A2: τ2, …, Ak: τk], path expressions are inductively defined for every i from 1 to k as follows: 1.
Ai is a path expression for τ.
2.
If Pi is a path expression for τi, then Ai.Pi is a path expression for τ.
Example 10: Given the types in Example 3, name, size.height, and size.width are path expressions for the type [name: string, size: [height: integer, width: integer]]. For selection expressions on FPOBs, we generalize the binary relations in selection expressions on POBs to the fuzzy ones, and add in the implication relation on fuzzy set values, as in the following definition. C , τ, ⇒, me, p) be an FPOB schema and X be a set of Definition 11. Let S = (C object variables. Then fuzzy selection expressions are inductively defined as having one of the following forms: 1. 2.
x ∈ c, where x ∈ X and c ∈ C . x.P θ v, where x ∈ X , P is a path expression, θ is a binary relation from {=, ≠, ≤, ≥, <, >, ⊆, ⊇, ∈, ∋, →, ←}, and v is a value.
3.
x.P1 =⊗ x.P2, where x ∈ X , P1 and P2 are path expressions, and ⊗ is a probabilistic conjunction strategy of combining the probabilities for x.P1 = v1 and x.P2 = v2 such that v1 = v2.
4.
φ ⊗ ψ, where φ and ψ are selection expressions over the same object variable, and ⊗ is a probabilistic conjunction strategy of combining the probabilities for φ and ψ being true.
5.
φ ⊕ ψ, where φ and ψ are selection expressions over the same object variable, and ⊕ is a probabilistic disjunction strategy of combining the probabilities for φ and ψ being true.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
64 Cao & Nguyen
Those of the first three forms are called atomic fuzzy selection expressions. Different probabilistic conjunction and disjunction strategies are given by Eiter et al. (2001). Example 11: In the Plant example above, the selection of “all objects that require a very mild sun” can be done using the atomic expression: x.sun → very mild where very mild is also a linguistic label of a fuzzy set on dom(real). Meanwhile, the selection of “all objects that require a very mild sun or over 21 units of daily water” can be expressed by the query: x.sun → very mild ⊕ x.water > 21 Selection conditions are now defined as selection expressions to be satisfied with a probability in a given interval, as for POBs. Definition 12. Fuzzy selection conditions are inductively defined as follows: 1.
If φ is a fuzzy selection expression and [l, u] is a subinterval of [0, 1], then (φ)[l, u] is a fuzzy selection condition.
2.
If α and β are fuzzy selection conditions, then ¬α, (α ∧ β), and (α ∨ β) are fuzzy selection conditions.
Example 12: In the Plant example, the selection of “all objects that require a very mild sun with a probability of at least 0.4 and over 21 units of daily water with a probability of at least 0.8” can be done using the following selection condition: (x.sun → very mild)[0.4, 1] ∧ (x.water > 21)[0.8, 1]
Semantics of Selection Conditions For defining the semantics of selection conditions, interpretations of path expressions and fuzzy selection expressions and conditions are introduced. First, we present the interpretation of path expressions given by Eiter et al. (2001).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 65
Definition 13. Given a type τ = [A1: τ1, A2: τ2, …, Ak: τk] and a value v = [A1: v1, A2: v2, …, Ak: vk], the interpretation of a path expression P for τ under v, denoted by v.P, is inductively defined as follows: 1.
If P = Ai, then v.P = vi.
2.
If P = Ai.Pi where Pi is a path expression for τi, then v.P = vi.Pi.
Example 13: In the Plant example, the interpretations of the path expressions name, size.height, and size.width under the value [name: Thyme, size: [height: 4, width: 12]] are the values Thyme, 4, and 12, respectively. Definition 14. Let S = (C C, τ, ⇒, me, p) be an FPOB schema, I = (π, ν) be an C ). The probabilistic interpretation with FPOB instance over S, and o ∈ π(C respect to S, I, and o, denoted by probS,I,o, is the partial mapping from the set of all fuzzy selection expressions to the set of all closed subintervals of [0, 1] that is inductively defined as follows: 1.
probS,I,o(x ∈ c) = [min(ext(c)(o)), max(ext(c)(o))].
2.
probS,I,o(x.P θ v) = [∑u∈Vα(u).prob(u.P' θ v), min(1, ∑u∈Vβ(u).prob(u.P' θ v))], where P = A.P', ν(o).A = 〈V, α, β〉.
3.
probS,I,o(x.P1 =⊗ x.P2) = [∑u∈Vα(u).prob(u1.P1' = u2.P2'), min(1, ∑u∈Vβ(u).prob(u1.P1' = u2.P2'))], where P1 = A1.P1' , ν(o).A1 = 〈V1, α1, β1〉, P2 = A2.P2', ν(o).A2 = 〈V2, α2, β2〉, and [α(u), β(u)] = [α 1(u 1), β 1 (u 1)]⊗[α 2(u 2), β 2(u 2)] for all u = (u 1, u 2) ∈ V = V 1 × V 2.
4.
probS,I,o(φ ⊗ ψ) = probS,I,o(φ)⊗probS,I,o(ψ).
5.
probS,I,o(φ ⊕ ψ) = prob S,I,o(φ)⊕probS,I,o(ψ).
Intuitively, probS,I,o(x ∈ c) is the interval of the probability for o belonging to c, probS,I,o(x.A.P' θ v) is the interval of the probability for the attribute A of o having a value u such that u.P' θ v. Also, probS,I,o(x.A1.P1' =⊗ x.A2.P2') is the interval of the probability for the attribute A1 and A2 of o (with mutual dependency reflected in the selected ⊗) having values u1 and u2, respectively, such that u1.P1' = u2.P2'. We note that P', P1', and P2' can be empty. Definition 14 is actually an extension of the probabilistic interpretation for POBs, where prob(u.P' θ v) and prob(u1.P1' = u2.P2') can have values only in {0, 1}, because attribute values are crisp. In the case of FPOBs, they are evaluated to values in [0, 1].
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
66 Cao & Nguyen
Example 14: For the FPOB instance in Example 8 and the fuzzy sets defining mild and medium as in Example 4, one has: probS,I,o2(x ∈ ANNUALS_HERBS) = [1, 1] probS,I,o2(x.water > 21) = [9/11, 9/11] = [0.82, 0.82] Meanwhile: probS,I,o2(x.sun → mild) =
[0.8 × u(mild) × prob(mild → mild) + 0.8 × u(medium) × prob(medium → mild), min(1, 1.2 × u(mild) × prob(mild → mild) + 1.2 × u(medium) × prob(medium → mild))]
=
[0.8 × 1/2 × 0.903 + 0.8 × 1/2 × 0.068, min(1, 1.2 × 1/2 × 0.903 + 1.2 × 1/2 × 0.068)]
=
[0.39, min(1, 0.59)] = [0.39, 0.59]
because: 1 1
x
prob( mild → mild ) = ∫ ∫
mild ∩ y mild x
0 0
1 1
= ∫∫ 0 0
[0,10 − 5 x] ∩ [0,10 − 5 y ] dxdy = 0.903 [0,10 − 5 x]
1 1
prob( medium → mild ) =
dxdy
mild
∫∫ 0 0
1 1
= ∫∫ 0 0
x
medium ∩ y mild x
dxdy
medium
[5 + 5 x, 20 − 5 x] ∩ [0,10 − 5 y ] [5 + 5 x, 20 − 5 x]
dxdy = 0.068
We recall that prob(A → A) is not nessarily equal to 1 when A is a fuzzy set (Baldwin, Martin, & Pilsworth, 1995). Similarly, the probabilistic interpretation of the above atomic fuzzy selection expressions with respect to other objects can be computed as given in Table 7.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 67
Table 7. Interpretation of atomic fuzzy selection expressions probS,I,o (x ∈
probS,I,o (x.sun→
probS,I,o
ANNUALS_HERBS)
mild)
(x.water > 21)
o1
[0.24, 0.24]
Undefined
[1.00, 1.00]
o2
[1.00, 1.00]
[0.39, 0.59]
[0.82, 0.82]
o3
[1.00, 1.00]
[0.90, 0.90]
[0.00, 0.00]
o4
[0.00, 0.00]
[0.90, 0.90]
[0.67, 0.67]
o5
[1.00, 1.00]
[0.39, 0.59]
[0.67, 0.67]
o6
[1.00, 1.00]
[0.90, 0.90]
[0.00, 0.00]
o7
[1.00, 1.00]
[0.90, 0.90]
[0.00, 0.00]
oid
The following definitions are adapted from Eiter et al. (2001) for fuzzy selection conditions in FPOBs. Definition 15. Let S = (C C, τ, ⇒, me, p) be an FPOB schema, I = (π, ν) be an C ). The satisfaction of fuzzy selection FPOB instance over S, and o∈π(C conditions under probS,I,o is defined as follows: 1.
probS,I,o |= (φ)[l, u] if and only if probS,I,o(φ) ⊆ [l, u].
2.
probS,I,o |= ¬φ if and only if probS,I,o |= φ does not hold.
3.
probS,I,o |= (φ ∧ ψ) if and only if prob S,I,o |= φ and prob S,I,o |= ψ.
4.
probS,I,o |= (φ ∨ ψ) if and only if probS,I,o |= φ or probS,I,o |= ψ.
Example 15: In the Plant example above, using the independence probabilistic conjunction strategy, one has: probS,I,o2(x∈ANNUALS_HERBS ⊗in x.sun → mild) =
[1.00 × 0.39, 1.0 × 0.59] = [0.39, 0.59] ⊄ [0.3, 0.5]
and probS,I,o2(x.sun → mild ⊗in x.water > 21) =
[0.39 × 0.82, 0.59 × 0.82] = [0.32, 0.48] ⊆ [0.3, 0.5]
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
68 Cao & Nguyen
Table 8. Interpretation of fuzzy selection expressions probS,I,o
probS,I,o
(x ∈ ANNUALS_HERBS ⊗in
(x.sun → mild ⊗in
x.sun → mild)
x.water > 21)
o1
Undefined
Undefined
o2
[0.39, 0.59]
[0.32, 0.48]
o3
[0.90, 0.90]
[0.00, 0.00]
o4
[0.00, 0.00]
[0.61, 0.61]
o5
[0.39, 0.59]
[0.26, 0.40]
o6
[0.90, 0.90]
[0.00, 0.00]
o7
[0.90, 0.90]
[0.00, 0.00]
oid
therefore: prob S, I,o2 |≠ (x ∈ prob
S, I,o2
ANNUALS_ HERBS
⊗in x.sun → mild)[0.3, 0.5]
|= (x.sun → mild ⊗in x.water > 21)[0.3, 0.5]
Similarly, the probabilistic interpretation of these two fuzzy selection expressions with respect to other objects can be computed as given in Table 8. C, τ, ⇒, me, p) be an FPOB schema, I = (π, ν) be an Definition 16. Let S = (C FPOB instance over S, and φ be a fuzzy selection condition over an object variable x. The selection on I with respect to φ, denoted by σφ(I), is the FPOB instance I' = (π', ν') over S such that π'(c) = {o ∈ π(c) | probS,I,o |= φ} and ν' is C). ν restricted to π'(C Example 16: In the Plant example above, suppose that: φ = (x.sun → mild)[0.39, 1.00] ∧ (x.water > 21)[0.80, 1.00] one has:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 69
σφ(I) = I' that contains only o2 That is because only o2 satisfies the fuzzy selection condition φ as shown below: probS,I,o2(x.sun → mild) = [0.39, 0.59] ⊆ [0.39, 1.00] and probS,I,o2(x.water > 21) = [0.82, 0.82] ⊆ [0.80, 1.00]
Other FPOB Algebraic Operations As for relational databases, other basic operations on object base instances are projection, renaming, Cartesian product, join, intersection, union, and difference. Those operations for POBs could be straightforwardly applied to FPOBs. For this chapter to be self-contained, their definitions and examples given by Eiter et al. (2001) are adapted and presented below.
Projection and Renaming A projection of an FPOB instance on a set of attributes is a new instance in which only the attributes in that set are considered for the type of each class and the value of each object. Definition 17. Let I = (π, ν) be an FPOB instance over an FPOB schema C , τ, ⇒, me, p) and A be a set of attributes. The projection of I on A, S = (C denoted by ΠA(I), is I' = (π', ν') over the FPOB schema ΠA(S) where: 1.
C , τ', ⇒, me, p) such that, for all c ∈ C , τ'(c) is obtained from Π A(S) = (C τ(c) = [B1: τ 1,…, Bk: τ k] by deleting all Bj : τ j with Bj ∉ A.
2.
π'(c) = π(c) for all c ∈ C .
3.
ν'(o) = ΠA(ν(o)) obtained from ν(o) = [B1: 〈V1, α1, β1〉,…, Bk: 〈Vk, αk, βk〉] C ). by deleting all Bj: 〈Vj, αj, βj〉 with Bj ∉ A, for all o ∈ π(C
Example 17: Let I = (π, ν) be the FPOB instance in Example 8, and A ={name, water}. Then the projection of I on A is the FPOB instance I' = (π', ν') on ΠA(S), Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
70 Cao & Nguyen
Table 9. ν' Resulting from projection oid
ν'(o id)
o1
[nam e: 〈{La dy-F ern, O strich -F ern}, u, u〉, w ater: 〈{25,… , 30 }, u , u〉]
o2
[nam e: 〈{Cu ban -Ba sil, L em on -Ba sil}, u , u〉, w ater: 〈{20 ,… ,3 0}, u, u 〉 ]
o3
[nam e: 〈{M int}, u, u〉, w ater: 〈{20}, u, u〉]
o4
[nam e: 〈{A ster, Sa lvia}, u , u〉, w ater: 〈{20 ,… , 25 }, u , u〉]
o5
[nam e: 〈{Th ym e}, u , u〉, w ater: 〈{20 ,… , 25}, u , u〉]
o6
[nam e: 〈{M int}, u, u〉, w ater: 〈{20}, u, u〉]
o7
[nam e: 〈{Sage}, u , u〉, w ater: 〈{20 , 21}, u, u〉]
where π' = π, and ν' is given in Table 9. The meaning of the renaming operation is clear, which is to rename some of the top-level attributes in an FPOB instance by new ones. Definition 18. Let S = (C C , τ, ⇒, me, p) be an FPOB schema and A be the set →
→
of all top-level attributes of S. A renaming expression has the form B ← C , →
→
where B = B1, B2,…, Bm is a list of distinct attributes from A, and C = C1, C2,…,C m is a list of distinct attributes from A - A. Definition 19. Let I = (π, ν) be an FPOB instance over an FPOB schema C , τ, ⇒, me, p) and N be a renaming expression. The renaming in I with S = (C respect to N, denoted by δN(I), is I' = (π', ν') over the FPOB schema δN(S) where: 1.
C , τ', ⇒, me, p) such that, for all c ∈ C , τ'(c) is obtained from τ(c) δN(S) = (C = [A1: τ1,…, Ak: τk] by replacing each attribute Aj = Bi for some i ∈ {1, 2, ..., m} by the new attribute Ci.
2.
π'(c) = π(c) for all c ∈ C .
3.
ν'(o) = δN(ν(o)) obtained from ν(o) = [A1: 〈V1, α1, β1〉,…, Ak: 〈Vk, αk, βk〉] by replacing each attribute Aj = Bi for some i ∈{1, 2, ..., m} by the new C ). attribute Ci, for all o ∈ π(C
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 71
Table 10. ν' Resulting from renaming oid
ν'(oid)
o1
[name2: 〈{Lady-Fern, Ostrich-Fern}, u, u〉, water2: 〈{25,…, 30}, u, u〉]
o2
[name2: 〈{Cuban-Basil, Lemon-Basil}, u, u〉, water2: 〈{20,…,30}, u, u 〉 ]
o3
[name2: 〈{Mint}, u, u〉, water2: 〈{20}, u, u〉]
o4
[name2: 〈{Aster, Salvia}, u, u〉, water2: 〈{20,…, 25}, u, u〉]
o5
[name2: 〈{Thyme}, u, u〉, water2: 〈{20,…, 25}, u, u〉]
o6
[name2: 〈{Mint}, u, u〉, water2: 〈{20}, u, u〉]
o7
[name2: 〈{Sage}, u, u〉, water2: 〈{20, 21}, u, u〉]
Example 18: Let I be the FPOB instance computed in Example 17. Then the renaming in I with respect to the renaming expression name, water ← name2, water2 is the FPOB instance I' = (π', ν'), where π' = π, and ν' is given in Table 10.
Cartesian Product We recall that, in relational databases, the Cartesian product of two relations is a new relation consisting of all tuples that are obtained by concatenating a tuple in the first relation with a tuple in the second relation. Similarly, the Cartesian product of two FPOBs should be a new one such that the property list of each object is obtained by concatenating the property list of an object in the first FPOB instance with the property list of an object in the second FPOB instance. Meanwhile, in relational algebra, the Cartesian product of two relational schemas is defined only if their sets of attributes are disjoint. Thus, in FPOB algebra, we define the Cartesian product only for two FPOB schemas that do not have any common top-level attribute. Also, the Cartesian product operation on both schemas and relations is commuC 1, τ 1 , ⇒, me 1, p 1) tative. For FPOB algebra, given two FPOB schemas S 1 = (C C 2 , τ 2, ⇒, me 2, p 2), that should mean S 1 × S 2 = S2 × S1, which implies and S 2 = (C C 2 × C1 = C 1 × C 2. The latter is achieved by using the following assumption.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
72 Cao & Nguyen
C , τ, ⇒, me, p), Assumption 1. It is assumed that for each FPOB schema S = (C the set of classes C is a classical relation over a classical relation schema R(S) = {A1, A2,…, Am} associated with S. That is, each class c ∈ C is considered as a tuple over R(S). C 2, τ 2, ⇒, Definition 20. The FPOB schemas S1 = (C C 1, τ1, ⇒, me1, p1) and S2 = (C me2, p2) are Cartesian product-compatible if and only if R(S1) and R(S2) are disjoint. C 2, τ2, ⇒2, me2, p2) be two Definition 21. Let S1 = (C C 1, τ 1, ⇒1, me1, p1) and S2 = (C Cartesian product-compatible FPOB schemas, and R1 = R(S1) and R2 = R(S2). The Cartesian product of S1 and S2, denoted by S1 × S2, is the FPOB schema C, τ, ⇒, me, p) such that: S = (C 1.
C = C 1 × C 2.
2.
For all classes c ∈ C, τ(c[R1], c[R2]) = [A1: τ1,…, Ak: τ k, Ak+1: τk+1,…, Ak+m: τ k+m], where τ 1(c[R1]) = [A1 : τ 1,…, Ak : τk] and τ2(c[R2]) = [Ak+1: τ k+1,…, Ak+m: τ k+m].
3.
C , ⇒) is defined as follows. For all c, d ∈ C : The directed acyclic graph (C c ⇒ d iff (c[R1] ⇒1 d[R1] ∧ c[R2] = d[R2]) ∨ (c[R2] ⇒2 d[R2] ∧ c[R1] = d[R1]).
4.
The partitioning me is defined as follows. For all c ∈ C : me(c) = {P1 × {c[R2]}P1 ∈ me1(c[R1])} ∪ {{c[R1])} × P2P2 ∈ me2(c[R2])}.
5.
The probability assignment p is defined as follows. For all c ⇒ d: p1(c[R1] | d[R1]) if c[R2] = d[R2] p(c | d) = p2(c[R2] | d[R2]) if c[R1] = d[R1].
Example 19: Let S1 and S2 be the FPOB schemas of the FPOB instances computed in Examples 17 and 18, respectively. Then the Cartesian product S1 × S2 C , τ, ⇒, me, p) is given as follows: = (C A partial view on C , me, and p is illustrated in Figure 4. C. τ(c) = [name: string, water: integer, name2: string, water2: integer] for every c∈C
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 73
Figure 4. Some classes in the Cartesian product of the plant example (pl,pl)
d
d 0.6 (pl,an)
0.4
0.2
(pl,pe)
(pl,ve)
....
(an,he)
0.6
0.4
0.3 (pl,he)
(pl,fl)
0.2
0.4
(an,pl)
(pe,pl)
(ve,pl)
0.3
0.4
(he,pl)
(fl,pl)
d
d 0.6
d
d
0.2
0.4 (pe,he)
(ve,he)
0.3 (he,he)
0.4 (fl,he)
0.4
0.8 (ah,pl)
0.3 0.3 (pf,pl)
...
Definition 22. Let I1 = (π1, ν1) and I2 = (π2, ν2) be two FPOB instances over C 1, τ1, ⇒1, me1, p1) and the Cartesian product-compatible FPOB schemas S1 = (C C 2, τ 2, ⇒2, me2, p2), respectively, and let R1 = R(S1) and R2 = R(S2). The S2 = (C Cartesian product of I1 and I2, denoted by I1 × I2, is defined as the FPOB instance (π, ν) over the FPOB schema S = S1 × S2, where: 1. 2.
π(c) = π1(c[R1]) × π2(c[R2]), for all c ∈ C 1 × C 2. ν(o) = ν1(o[R1]) × ν2(o[R2]), for all o ∈ π(C C 1 × C 2), where ν1(o[R1]) and ν2(o[R2]) are fuzzy-probabilistic tuple values over disjoint sets of attributes A1 and A2, respectively, and ν(o) is the fuzzy-probabilistic tuple value over A1 ∪ A 2 such that ν(o).A = ν1(o[R1]).A if A ∈ A1 or ν(o).A = ν2(o[R2]).A if A ∈ A 2.
Example 20: Let I1 and I2 be the FPOB instances computed in Examples 17 and 18, respectively. Then the Cartesian product I1 × I2 = (π, ν), where π, ν are given in Tables 11 and 12. Table 11. π Resulting from Cartesian product (partial view) c
π(c)
(pl, pl)
{(o1, o1)}
(an, pl)
{}
(ah, pl)
{(o2, o1), (o3, o1), (o5, o1), (o6, o1), (o7, o1)}
(pf, pl)
{(o4, o1)}
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
74 Cao & Nguyen
Table 12. ν Resulting from Cartesian product (partial view) oid
ν(oid)
(o1, o1)
[name: 〈{Lady-Fern, Ostrich-Fern}, u, u〉, water: 〈{25,…, 30}, u, u〉, name2: 〈{Lady-Fern, Ostrich-Fern}, u, u〉, water2: 〈{25,…, 30}, u, u〉]
(o2, o1)
[name: 〈{Cuban-Basil, Lemon-Basil}, u, u〉, water: 〈{20,…, 30}, u, u〉, name2: 〈{Lady-Fern, Ostrich-Fern}, u, u〉, water2: 〈{25,…, 30}, u, u〉]
(o3, o1)
[name: 〈{Mint}, u, u〉, water: 〈{20}, u, u〉, name2: 〈{Lady-Fern, Ostrich-Fern}, u, u〉, water2: 〈{25,…, 30}, u, u〉]
Join In relational databases, the join operation is a generalization of the Cartesian product operation. That is, in the join of two relations, the value of an attribute of a tuple in the first relation and the value of the same attribute, if any, in the second relation are combined. For that combination, the types of such a common attribute name in both relations must be identical as defined below for FPOBs. C 2, τ 2, ⇒, Definition 23. The FPOB schemas S1 = (C C 1, τ1, ⇒, me1, p1) and S2 = (C me2, p2) are join-compatible iff R(S1) and R(S2) are disjoint and, for all classes c1 ∈ C1 and c2 ∈ C 2, if an attribute A is defined for both τ 1(c1) and τ 2(c2) then τ 1(c1).A = τ 2(c2).A. C1, τ 1, ⇒, me1, p1) and S2 = (C C 2, τ 2, ⇒, me2, p2) be two Definition 24. Let S1 = (C join-compatible FPOB schemas, and R1 = R(S1) and R2 = R(S2). The join of S1 and
S2, denoted by S1><S2 is the FPOB schema S = (C C , τ, ⇒, me, p), where C , τ, ⇒, me are as in the definition of S1 × S2, and τ is defined such that, for all c ∈ C , the tuple type τ(c) = [A1 : τ 1,…, Am: τ m] contains exactly all Ai: τ i that belongs to either τ 1(c[R1]) or τ 2(c[R2]). The following definitions are for combination of fuzzy-probabilistic tuple values of objects in two FPOB instances. Definition 25. Let pt 1 = 〈V1, α1, β1〉 and pt 2 = 〈V2, α2, β2〉 be two fuzzyprobabilistic triples, and ⊗ be a probabilistic conjunction strategy. Then pt 1⊗pt2 Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 75
is the fuzzy-probabilistic triple 〈V, α, β〉 where V = {v ∈ V1 ∩ V2 | [α(v), β(v)] = [α1(v), β1(v)]⊗[α2(v), β2(v)] ≠ [0,0]}. Definition 26. Let ptv1 and ptv2 be two fuzzy-probabilistic tuple values over the sets of attributes A1 and A2, respectively, such that for all A ∈ A1 ∩ A 2, the values of ptv1.A and ptv 2.A are of the same type. The join of ptv1 and ptv2 under a
probabilistic conjunction strategy ⊗, denoted by ptv1><⊗ptv2 is the fuzzyprobabilistic tuple value ptv over A1 ∪ A2 defined by the following: ptv.A = ptv1.A for all attributes A ∈ A1 - A2 ptv.A = ptv2.A for all attributes A ∈ A2 - A1 ptv.A = ptv1.A⊗ptv2.A for all attributes A ∈ A 1 ∩ A2 We are now ready to define the join of two FPOB instances as follows. Definition 27. Let I1 = (π1, ν1) and I2 = (π2, ν2) be two FPOB instances over C1, τ 1, ⇒, me 1, p1) and S2 = (C C 2, τ2, ⇒, the join-compatible FPOB schemas S1 = (C me2, p2), and A 1 and A2 be the sets of top-level attributes of S1 and S2, respectively, and let R1 = R(S1) and R2 = R(S2). The join of I1 and I2 under a probabilistic conjunction strategy ⊗, denoted by I1><⊗I2, is defined as the FPOB instance (π, ν) over the FPOB schema S1><S2, where: 1.
π(c) = {(o 1,o 2) ∈ π 1(c[R 1]) × π 2 (c[R 2]) | for all A ∈ A 1 ∩ A 2, if (ν1(o1)><⊗ν2(o2)).A = 〈V, α, β〉, then V ≠ ∅}, for all c ∈ C 1 × C 2.
2.
ν(o) = ν1(o[R1])><⊗ν2(o[R2]), for all o ∈ π(C C1 × C2).
Example 21: Let I1 be the FPOB instance in Example 17 and I2 be the renaming in I with respect to the renaming expression water2 ← water. Then the join of I1 and I2 under the independence probabilistic conjunction strategy is I1><⊗inI2 = (π, ν), where π is given in Table 13 and ν in Table 14.
Intersection, Union, and Difference As for the intersection of two relations on the same schema, the intersection of two FPOB instances on the same FPOB schema is a new FPOB instance in which objects are common to both of the two instances, and the attribute values of each object are obtained by combining the respective attribute values of that
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
76 Cao & Nguyen
Table 13. π Resulting from join (partial view) c
π(c)
(pl, pl)
{(o1, o1)}
(an, pl)
{}
(ah, pl)
{(o2, o1), (o5, o1)}
(pf, pl)
{(o4, o1)}
Table 14. ν Resulting from join (partial view) oid
ν(oid)
(o1, o1)
[name: 〈{Lady-Fern, Ostrich-Fern}, u, u〉, water: 〈{25,…, 30}, u/6, u/6〉, name2: 〈{Lady-Fern, Ostrich-Fern}, u, u〉]
(o2, o1)
[name: 〈{Cuban-Basil, Lemon-Basil}, u, u〉, water: 〈 {25,…, 30}, u/11, u/11〉, name2: 〈{Lady-Fern, Ostrich-Fern}, u, u〉]
(o5, o1)
[name: 〈{Thyme}, u, u〉, water: 〈{25}, u/36, u/36〉, name2: 〈{Lady-Fern, Ostrich-Fern}, u, u〉]
(o4, o1)
[name: 〈{Aster, Salvia}, u, u〉, water: 〈{25}, u/36, u/36〉, name2: 〈{Lady-Fern, Ostrich-Fern}, u, u〉]
object in the two instances. First, the intersection of two fuzzy-probabilistic tuple values is defined as follows. Definition 28. Let ptv1 and ptv2 be two fuzzy-probabilistic tuple values over the same set of attributes A. The intersection of ptv 1 and ptv 2 under a probabilistic conjunction strategy ⊗, denoted by ptv1 ∩⊗ ptv2 is the fuzzy-probabilistic tuple value over A defined by ptv.A = ptv 1.A⊗ptv2.A for all attributes A ∈ A. Definition 29. Let I1 = (π1, ν1) and I2 = (π2, ν2) be two FPOB instances over C , τ, ⇒, me, p). The intersection of I1 and I2 under the same FPOB schema S = (C a probabilistic conjunction strategy ⊗, denoted by I1 ∩ ⊗I2, is the FPOB instance (π, ν) over the S, where: 1.
π(c) = π1(c) ∩ π2(c) for every c ∈ C .
2.
ν(o) = ν1(o) ∩ ⊗ν2(o) for every o ∈ π(C C).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 77
Table 15. Object mappings π1, π2, and π C
π1(c)
π2(c)
π(c)
PLANTS
{o1}
{o1}
{o1}
ANNUALS
{}
{}
{}
PERENNIALS
{}
{}
{}
VEGETABLES
{}
{}
{}
HERBS
{}
{}
{}
FLOWERS
{}
{}
{}
ANNUALS_HERBS
{o2, o3}
{o5}
{}
PERENNIALS_FLOWERS
{o4}
{o4}
{o4}
Table 16. Value mapping ν1 of FPOB instance I1 oid
ν(oid)
o1
[name: 〈{Lady-Fern, Ostrich-Fern}, u, u〉, soil: 〈{loamy}, u, u〉, water: 〈{25,…, 30}, u, u〉]
o2
[name: 〈{Cuban-Basil, Lemon-Basil}, u, u〉, soil: 〈{loamy, sandy}, 0.7u, 1.3u〉, water: 〈{20,…,30}, u, u〉, sun: 〈{mild, medium}, 0.8u,1.2u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
o3
[name: 〈{Mint}, u, u〉, soil: 〈{loamy}, u, u〉, water: 〈{20}, u, u〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
o4
[name: 〈{Aster, Salvia}, u, u〉, soil: 〈{loamy, sandy}, 0.6u, 1.4u〉, water: 〈{20,…, 25}, u, u〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈 {french, silver, wooly}, 0.6u, 1.8u〉]
Example 22: Let S be the FPOB schema in Example 6, and I 1 = (π 1, ν 1) and I 2 = (π2, ν 2) be the FPOB instances as defined in Tables 15, 16, and 17. Then the intersection of I1 and I2 under the independence probabilistic conjunction strategy is I1 ∪⊗inI2 = (π, ν), where π is given in Table 15 and ν in Table 18. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
78 Cao & Nguyen
Table 17. Value mapping ν2 of FPOB instance I2 oid
ν(oid)
o1
[name: 〈{Lady-Fern, Ostrich-Fern}, u, u〉, soil: 〈{loamy}, u, u〉, water: 〈{25,…, 30}, u, u〉]
o4
[name: 〈{Aster, Salvia}, u, u〉, soil: 〈{loamy, sandy}, 0.6u, 1.4u〉, water: 〈{20,…, 25}, u, u〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈 {french, silver, wooly}, 0.6u, 1.8u〉]
o5
[name: 〈{Thyme}, u, u〉, soil : 〈{loamy}, u, u〉, water: 〈{20,…,25}, u, u〉, sun: 〈{mild, medium}, 0.8u, 1.2u〉, expyears: 〈{2, 3}, 0.8u, 1.2u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
Table 18. ν Resulting from intersection oid
ν(oid)
o1
[name: 〈{Lady-Fern, Ostrich-Fern}, 0.5u, 0.5u〉, soil: 〈{loamy}, u, u〉, water: 〈{25,…, 30}, u/6, u/6〉]
o4
[name: 〈{Aster, Salvia}, 0.5u, 0.5u〉, soil: 〈{loamy, sandy}, 0.18u, 0.98u〉, water: 〈{20,…, 25}, u/6, u/6〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.12u, 1.08u〉, category: 〈 {french, silver, wooly}, 0.12u, 1.08u〉]
The union and difference operations are then defined similarly, on the basis of the union and difference operations on fuzzy-probabilistic triples and tuple values. Definition 30. Let pt 1 = 〈V1, α1, β1〉 and pt 2 = 〈V2, α2, β2〉 be two fuzzyprobabilistic triples, and ⊕ be a probabilistic disjunction strategy. Then pt 1⊕pt2 is the fuzzy-probabilistic triple 〈V, α, β〉 defined as follows: V = V 1 ∪ V 2.
[ (v ), (v )] =
[
1 (v ),
1 (v )]
if v V1
V2
[
2 (v ),
2 (v )]
if v V2
V1
[
1 (v ),
1 (v )]
[
2 (v ),
2 (v )]
if v V1 V2.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 79
Definition 31. Let ptv1 and ptv2 be two fuzzy-probabilistic tuple values over the same set of attributes A. The union of ptv1 and ptv2 under a probabilistic disjunction strategy ⊕, denoted by ptv1 ∪⊕ ptv2 is the fuzzy-probabilitsic tuple value over A defined by ptv.A = ptv1.A⊕ptv 2.A for all attributes A ∈ A. Definition 32. Let I1 = (π1, ν1) and I2 = (π2, ν2) be two FPOB instances over C, τ, ⇒, me, p). The union of I1 and I2 under a the same FPOB schema S = (C probabilistic conjunction strategy ⊕, denoted by I1 ∪⊕ I2, is the FPOB instance (π, ν) over the S, where: 1.
2.
π(c) = π1(c) ∪ π2(c) for every c ∈ C .
(o) =
1 (o)
if o
1 (C )
2 (C )
2 (o)
if o
2 (C )
1 (C )
1 (o)
2 (o)
for every o
if o
1 (C )
2 (C )
(C ).
Example 23: Let S be the FPOB schema in Example 6, and I1 = (π1, ν1) and I2 = (π2, ν2) be the FPOB instances on S in Example 22. Then the union of I1 and I2 under the ignorance probabilistic disjunction strategy is I1 ∪⊕ig I2 = (π, ν), where π is given in Table 19 and ν in Table 20. Table 19. π Resulting from union C
π(c)
PLANTS
{o1}
ANNUALS
{}
PERENNIALS
{}
VEGETABLES
{}
HERBS
{}
FLOWERS
{}
ANNUALS_ HERBS
{o2, o3, o5}
PERENNIALS_ FLOWERS
{o4}
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
80 Cao & Nguyen
Table 20. ν Resulting from union oid
ν(oid)
o1
[name: 〈{Lady-Fern, Ostrich-Fern}, u, 2u〉, soil: 〈{loamy}, u, u〉, water: 〈{25,…, 30}, u, 2u〉]
o2
[name: 〈{Cuban-Basil, Lemon-Basil}, u, u〉, soil: 〈{loamy, sandy},0.7u, 1.3u〉, water: 〈{20,…,30}, u, u〉, sun: 〈{mild, medium}, 0.8u,1.2u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
o3
[name: 〈{Mint}, u, u〉, soil: 〈{loamy}, u, u〉, water: 〈{20}, u, u〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
o4
[name: 〈{Aster, Salvia}, u, 2u〉, soil: 〈{loamy, sandy}, 0.6u, 2u〉, water: 〈{20,…, 25}, u, 2u〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.6u, 3u〉, category: 〈 {french, silver, wooly}, 0.6u, 3u〉]
o5
[name: 〈{Thyme}, u, u〉, soil : 〈{loamy}, u, u〉, water: 〈{20,…,25}, u, u〉, sun: 〈{mild, medium}, 0.8u, 1.2u〉, expyears: 〈{2, 3}, 0.8u, 1.2u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
Definition 33. Let pt 1 = 〈V1, α1, β1〉 and pt 2 = 〈V2, α2, β2〉 be two fuzzyprobabilistic triples, and be a probabilistic difference strategy. Then pt 1pt2 is the fuzzy-probabilistic triple 〈V, α, β〉 defined as follows: V = V1 - {v ∈ V1 ∩ V2 | [α1(v), β1(v)][α2(v), β2(v)] = [0, 0]}.
if v V V2
[ 1(v),
1(v)]
[ 1(v),
1(v)]
[ (v), (v)] = [ 2(v),
2(v)]
if v V V2.
Definition 34. Let ptv1 and ptv2 be two fuzzy-probabilistic tuple values over the same set of attributes A. The difference of ptv1 and ptv2 under a probabilistic difference strategy , denoted by ptv1-ptv2 is the fuzzy-probabilistic tuple value over A defined by ptv.A = ptv 1.Aptv2.A for all attributes A∈A.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 81
Definition 35. Let I1 = (π1, ν1) and I2 = (π2, ν2) be two FPOB instances over C , τ, ⇒, me, p), and A be the sets of top-level the same FPOB schema S = (C attributes of S. The difference of I1 and I2 under a probabilistic difference strategy , denoted by I1-I2, is the FPOB instance (π, ν) over the S, where: 1.
C ) ∩ π2(C C ) | (ν1(o)-ν2(o)).A = 〈∅, _, _〉 for some π(c) = π1(c) - {o ∈ π1(C A ∈ A} for every c ∈ C. 1 (o)
2.
if o
1 (C )
2 (C )
(o) = 1 (o)
2 (o)
for every o
if o
1 (C )
2 (C ).
(C ).
Example 24: Let S be the FPOB schema in Example 6, and I 1 = (π1, ν1) and I 2 = (π2, ν 2) be the FPOB instances on S in Example 22. Consider I2' = (π2, ν2') that is different from I2 only in ν2'(o1).soil = 〈{loamy, sandy}, u, u〉. Then the difference of I1 and I2' under the independence probabilistic difference strategy is I1-inI2' = (π, ν), where π is given in Table 21 and ν in Table 22. We note that o4 ∉ π(PERENNIALS_FLOWERS ) because ν(o4).sun = 〈∅, _, _〉. Table 21. π Resulting from difference C
π(c)
PLANTS
{o1}
ANNUALS
{}
PERENNIALS
{}
VEGETABLES
{}
HERBS
{}
FLOWERS
{}
ANNUALS_HERBS
{o2, o3}
PERENNIALS_FLOWERS
{}
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
82 Cao & Nguyen
Table 22. ν Resulting from difference oid
ν(oid)
o1
[name: 〈{Lady-Fern, Ostrich-Fern}, 0.5u, 0.5u〉, soil: 〈{loamy}, 0.5u, 0.5u〉, water: 〈{25,…, 30}, 5u/6, 5u/6〉]
o2
[name: 〈{Cuban-Basil, Lemon-Basil}, u, u〉, soil: 〈{loamy, sandy},0.7u, 1.3u〉, water: 〈{20,…,30}, u, u〉, sun: 〈{mild, medium}, 0.8u,1.2u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
o3
[name: 〈{Mint}, u, u〉, soil: 〈{loamy}, u, u〉, water: 〈{20}, u, u〉, sun: 〈{mild}, u, u〉, expyears: 〈{2, 3, 4}, 0.6u, 1.8u〉, category: 〈{french, silver, wooly}, 0.6u, 1.8u〉]
Conclusion We presented an extension of the POB model with vague and imprecise values. In order to integrate fuzzy set values into the probabilistic framework of POBs, we employed a probability-based voting model of fuzzy sets and introduced a probabilistic interpretation of relations on them. The definitions of FPOB schemas, instances, and algebraic operations were then presented, generalizing those of POBs. The obtained algebra provides a formal basis for development of fuzzy and probabilistic object bases, as relational algebra does for relational databases. A prototype of this model was demonstrated, and we are investigating its full-scale implementation to be applied to build object bases for real-world problems.
References Baldwin, J. M., Lawry, J., & Martin, T. P. (1996). A note on probability/ possibility consistency for fuzzy events. In Proceedings of the Sixth International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 521–525). Granada, Spain.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy and Probabilistic Object Bases 83
Baldwin, J. F., Martin, T. P., & Pilsworth, B. W. (1995). Fril — Fuzzy and evidential reasoning in artificial intelligence. Taunton: Research Studies Press/John Wiley. Bertino, E., & Martino, L. (1993). Object-oriented database systems: Concepts and architectures. Reading, MA: Addison-Wesley. Blanco, I., Marín, N., Pons, O., & Vila, M. A. (2001). Softening the objectoriented database model: Imprecision, uncertainty and fuzzy types. In Proceedings of the First International Joint Conference of the International Fuzzy Systems Association and the North American Fuzzy Information Processing Society (pp. 2323–2328). Vancouver, Canada. Bordogna, G., Pasi, G., & Lucarella, D. (1999). A fuzzy object-oriented data model managing vague and uncertain information. International Journal of Intelligent Systems, 14, 623–651. Cao, T. H. (2001). Uncertain inheritance and recognition as probabilistic default reasoning. International Journal of Intelligent Systems, 16, 781–803. Cao, T. H., & Nguyen, H. (2002). Towards fuzzy and probabilistic object bases. In Proceedings of the Third International Conference on Intelligent Technologies and the Third Vietnam–Japan Symposium on Fuzzy Systems and Application (pp. 35–41). Hanoi, Vietnam. Cao, T. H., & Rossiter, J. M. (2003). A deductive probabilistic and fuzzy objectoriented database language. Fuzzy Sets and Systems, 140, 129–150. Cao, T. H., Rossiter, J. M., Martin, T. P., & Baldwin, J. F. (2002). On the implementation of Fril++ for object-oriented logic programming with uncertainty and fuzziness. In Bouchon-Meunier, B. et al. (Eds.), Technologies for Constructing Intelligent Systems, Studies in Fuzziness and Soft Computing (vol. 90, pp. 393–406). Heidelberg: Physica-Verlag. Cross, V. V. (2003). Defining fuzzy relationships in object models: Abstraction and interpretation. International Journal of Fuzzy Sets and Systems, 140, 5–27. De Tré, G. (2001). An algebra for querying a constraint defined fuzzy and uncertain object-oriented database model. In Proceedings of the First International Joint Conference of the International Fuzzy Systems Association and the North American Fuzzy Information Processing Society (pp. 2138–2143). Vancouver, Canada. Dubitzky, W., Büchner, A. G., Hughes, J. G., & Bell, D. A. (1999). Towards concept-oriented databases. Data & Knowledge Engineering, 30, 23–55. Eiter, T., Lu, J. J., Lukasiewicz, T., & Subrahmanian, V. S. (2001). Probabilistic object bases. ACM Transactions on Database Systems, 26, 264–312.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
84 Cao & Nguyen
Gaines, B. R. (1978). Fuzzy and probability uncertainty logics. Journal of Information and Control, 38, 154–169. Garcia-Molina, H., Ullman, J. D., & Widom, J. (2000). Database system implementation. Upper Saddle River, NJ: Prentice Hall. George, R., Buckles, B. P., & Petry, F. E. (1993). Modelling class hierarchies in the fuzzy object-oriented data model. Fuzzy Sets and Systems, 60, 259– 272. Itzkovich, I., & Hawkes, L. W. (1994). Fuzzy extension of inheritance hierarchies. Fuzzy Sets and Systems, 62, 143–153. Lakshmanan, L. V. S. et al. (1997). ProbView: A flexible probabilistic database system. ACM Transactions on Database Systems, 22, 419–469. Meyer, B. (1997). Object-oriented software construction. Upper Saddle River, NJ: Prentice Hall. Nguyen, H. (2003). An algebra to handle fuzzy and probabilistic object bases. Master’s thesis, Faculty of Information Technology, Ho Chi Minh City University of Technology. Rossazza, J. -P., Dubois, D., & Prade, H. (1997). A hierarchical model of fuzzy classes. In R. De Caluwe (Ed.), Fuzzy and uncertain object-oriented databases: Concepts and models (pp. 21–61). Singapore: World Scientific. Van Gyseghem, N., & De Caluwe, R. (1997). The UFO database model: Dealing with imperfect information. In R. De Caluwe (Ed.), Fuzzy and uncertain object-oriented databases: Concepts and models (pp. 123–185). Singapore: World Scientific. Yazici, A., & George, R. (1999). Fuzzy database modelling. Studies in fuzziness and soft computing (vol. 26). Heidelberg: Physica-Verlag. Zadeh, L. A. (1978). PRUF — A meaning representation language for natural languages. International Journal of Man-Machine Studies, 10, 395– 460.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 85
Chapter III
Generalization Data Mining in Fuzzy Object-Oriented Databases Rafal Angryk Tulane University, USA Roy Ladner Naval Research Laboratory, USA Frederick E. Petry Tulane University & Naval Research Laboratory, USA
Abstract In this chapter, we consider the application of generalization-based data mining to fuzzy similarity-based object-oriented databases (OODBs). Attribute generalization algorithms have been most commonly applied to relational databases, and we extend these approaches. A key aspect of generalization data mining is the use of a concept hierarchy. The objects of the database are generalized by replacing specific attribute values by the next higher-level term in the hierarchy. This will then eventually result in generalizations that represent a summarization of the information in the database. We focus on the generalization of similarity-based simple fuzzy attributes for an OODB using approaches to the fuzzy concept hierarchy Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
86 Angryk, Ladner, & Petry
developed from the given similarity relation of the database. Then consideration is given to applying this approach to complex structurevalued data in the fuzzy OODB.
Introduction Data mining and knowledge discovery have increasing importance as the amount of data from various sources has rapidly increased. Awash in such volumes of data, data mining techniques attempt to make sense of this data by formulating information of value for decision making. This can vary from deciding on commercial sales promotions to environmental planning to national security decisions. Much of the current work is in the context of conventional relational databases. In this chapter, we will discuss how to apply one valuable data mining approach — attribute-oriented generalization — to a similarity-based fuzzy OODB.
Background In this section, we survey the general area of data mining, discuss some of the relevant work in fuzzy data mining, and then describe the specific technique of attribute-oriented induction for generalization, which is the focus of this chapter. Additionally, we describe the fuzzy object-oriented model based on similarity relationships that is the context in which we investigate data generalization.
Data Mining Data mining or knowledge discovery generally refers to a variety of techniques that have developed in the fields of databases, machine learning, and pattern recognition. The intent is to uncover useful patterns and associations from large databases. Although we are primarily interested here in specific algorithms for knowledge discovery, we will first review the overall process of data mining (Feelders, Daniels, & Holsheimer, 2000). The initial steps of data mining are concerned with preparation of data, including data cleaning intended to resolve errors and missing data and integration of data from multiple heterogeneous sources. Next are the steps needed to prepare for actual data mining. These include selection
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 87
of the specific data relevant to the task and transformation of this data into a format required by the data mining approach. These steps are sometimes considered to be those in the development of a data warehouse, i.e., an organized format of data available for various data mining tools. There is a wide variety of specific knowledge discovery algorithms that were developed (Han & Kamber, 2000). These discover patterns that can then be evaluated based on some “interestingness” measure used to prune the huge number of available patterns. Finally, as true for any decision aid system, an effective user interface with visualization and alternative representations must be developed for presentation of the discovered knowledge. Specific data mining algorithms can be considered as belonging to two categories: descriptive and predictive data mining. In the descriptive category are class description, association rules, and classification. Class description can provide characterization or generalization of data or comparisons between data classes to provide class discriminations. Data generalization is a process of grouping data, enabling transformation of similar item sets, stored originally in a database at the low (primitive) level, into more abstract conceptual representations. This process is a fundamental element of attribute-oriented induction, a descriptive database mining technique, allowing compression of the original data set into a generalized relation, which provides concise and summarative information about the massive set of task-relevant data. Association rules correspond to correlations among the data items (Agrawal, Imielinski, & Swami, 1993). They are often expressed in rule form, showing attribute-value conditions that commonly occur at the same time in some set of data. An association rule of the form X \Y can be interpreted as meaning that the tuples in the database that satisfy the condition X also are “likely” to satisfy Y, so that the “likely” implies this is not a functional dependency in the formal database sense. Finally, a classification approach analyzes the training data (data with known class membership) and constructs a model for each class based on the features in the data. Commonly, the outputs generated are decision trees or sets of classification rules. These can be used for the characterization of the classes of existing data and to allow the classification of data in the future, and so can also be considered predictive. Predictive analysis is also a very developed area of data mining. One common approach is clustering. Clustering analysis identifies the collections of data objects that are similar to each other. The similarity metric is often a distance function given by experts or appropriate users. A good clustering method produces high-quality clusters to yield low intercluster similarity and high intracluster similarity. Prediction techniques are used to predict possible missing data values or distributions of values of some attributes in a set of objects. First, one must find the set of attributes relevant to the attribute of interest and then
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
88 Angryk, Ladner, & Petry
predict a distribution of values based on the set of data similar to the selected objects. A large variety of techniques is used, including regression analysis, correlation analysis, genetic algorithms, and neural networks, to mention a few. Finally, a particular case of predictive analysis is time-series analysis. This technique considers a large set of time-based data to discover regularities and interesting characteristics. One can search for similar sequences or subsequences, then mine sequential patterns, periodicities, trends, and deviations.
Fuzzy Data Mining An early and continuing significant application of fuzzy sets has been in pattern recognition, especially fuzzy clustering algorithms (Bezdek, 1974). Hence, much of the effort in fuzzy data mining has been made by using fuzzy clustering and fuzzy set approaches in neural networks and genetic algorithms (Hirota & Pedrycz, 1999). In fuzzy set theory, an important consideration is the treatment of data from a linguistic viewpoint. From this, an approach was developed that uses linguistically quantified propositions to summarize the content of a database by providing a general characterization of the analyzed data (Yager, 1991; Kacprzyk, 1999; Dubois & Prade, 2000; Feng & Dillon, 2003). Fuzzy gradual rules for data summarization were also considered (Cubero et al., 1999). A common organization of data for data mining is the multidimensional data cube in data warehouse structures. Treating the data cube as a fuzzy object has provided another approach for knowledge discovery (Laurent et al., 2000). Fuzzy data mining for generating association rules was considered by a number of researchers. There are approaches using the set-oriented mining (SETM) algorithm (Shu et al., 2001) and other techniques (Bosc & Pivert, 2001), but most have been based on the Apriori algorithm (Delgado et al., 2003). Extensions included fuzzy set approaches to quantitative data (Zhang, 1999; Kuok et al., 1998), hierarchies or taxonomies (Chen et al., 2000; Lee, 2001), weighted rules (Gyenesei, 2001), and interestingness measures (de Graaf et al., 2001; Gyenesei, 2001; Au & Chan, 2003).
Generalization Data Mining The basis of a generalization data mining approach rests on three aspects (Han & Kamber, 2000): the set of data relevant to a given data mining task; the expected form of knowledge to be discovered; and the background knowledge, which usually supports the whole process of knowledge acquisition. Generalization of data is typically performed with utilization of concept hierarchies, which
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 89
in ordinary databases are considered to be part of background knowledge, and are indispensable for the process. Despite the progress in research on data mining algorithms, the phase of data generalization remains a crucial activity. The choice of data to be analyzed as well as of the concepts for its generalization has a fundamental influence on retrieved results, regardless of applied knowledge acquisition techniques. Although certain dependencies among data can be discovered at the primitive concept level, much stronger and often far more interesting dependencies can be determined at a higher concept level. With data generalization executed at the initial stage of data mining, the process of knowledge extraction can be more effective and bring concise results directly at the abstraction level desired by a user. Moreover, many relations occurring at the lower level may not match the requirement of minimum support assigned by data analysts to eliminate infrequent regularities, whereas after summarization via generalization they may occur often enough to have significant meaning. The idea of using concept hierarchies for attribute-oriented induction in data mining was investigated by several research groups (Han et al., 1992; Han, 1995; Carter & Hamilton, 1998; Hilderman et al., 1999). Generalization of database objects is performed on an attribute-by-attribute basis, applying a separate concept hierarchy for each of the generalized attributes included in the relation of task-relevant data. The basic steps and guidelines for attribute-oriented generalization in an OODB are summarized below (Han, Nishio, & Kawano, 1994): 1.
An initial query to the fuzzy OODB with a given similarity threshold provides the starting generalization class G 0, which contains the set of data that is relevant to the user’s generalization interest.
2.
Generalization should be performed on the smallest decomposable components (or attributes) of the data objects in each generalization class G i.
3.
If there is a large set of distinct values for an attribute but there is no higherlevel concept provided for the attribute, the attribute should be removed in the generalization process.
4.
If there a higher-level concept exists in the concept tree for an attribute value of an object, the substitution of the value by its higher-level concept generalizes the object. Minimal generalization should be enforced by ascending the tree one level at a time.
5.
Two generalized objects may become similar enough to be merged (see the next section for merging of objects in a fuzzy OODB). So we include an added attribute, count, to keep track of how many objects were merged to form the current generalized object. The value of the count of an object should be carried to its generalized object, and the counts should be accumulated when merging identical objects in generalization.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
90 Angryk, Ladner, & Petry
6.
The generalization is controlled by providing levels that specify how far the process should proceed. If the number of distinct values of an attribute in the given class is larger than the generalization threshold value, further generalization on this attribute should be performed. If the number of objects of a generalized class is larger than the generalization threshold value, the generalization should proceed further.
Attribute generalization should not be mistaken for simple record summarization. Summaries of data usually have a more simplified character and tend to omit data that do not occur originally in large quantities in order to simplify the final report. Gradual generalization through concept hierarchies allows, in contrast, detailed tracking of all data objects and can lead to the discovery of interesting patterns among data at the lowest possible abstraction level of their occurrence, decreasing, at the same time, the risk of omitting them due to overgeneralization. The appropriate attribute-oriented generalization allows extraction of knowledge on a specific abstraction level but without omitting even rare attribute values. It might occur that such atypical values, despite being initially (at a low level of the generalization hierarchy) infrequent, can sum up to impressive cardinalities when generalized to an efficiently high abstraction level, which can then sometimes strongly influence the suspected proportions among the original data. Depending on the approach and the intention of data analysts, generalization of collected data can be treated as a final step of data mining (e.g., summary tables are presented to users, allowing them to interpret overall information) or as an introduction to further knowledge extraction (e.g., extraction of abstract association rules directly from the generalized data).
Fuzzy Object-Oriented Model The OODB model and object-oriented programming languages arose out of the necessity of dealing with the complexity of large software systems. Objectoriented systems view the universe as consisting of objects and try to model the interaction between objects. The object-oriented model is characterized by its properties of abstraction, encapsulation, modularity, hierarchy, typing concurrency, and persistence. The object-oriented model is a natural successor to record-based models with explicit mechanisms to overcome their disadvantages (Bertino & Martino, 1991). The object-oriented data model (OODM) models composite objects, thereby capturing the IS-PART-OF concept, and relationships directly. Data are organized into classes, and classes are organized into an inheritance hierarchy. This
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 91
methodology is useful in capturing similarities among classes and data and abstracting them to higher levels. An object is completely specified by its identity, behavior, and state. The state of an object consists of the values of its attributes. Its behavior is specified by the set of methods that operate on the state. An object identifier maintains the identity of an object, thereby distinguishing it from all others. The use of object identifiers permits three different types of object equality (Khoshafian & Copeland, 1986): 1.
Identity (=): The identity predicate corresponds to the equality of references or pointers in conventional languages.
2.
Shallow equality (se): Two objects are shallow equal if their states or contents are identical, i.e., corresponding instance variables need not be the same object, contents must be identical objects.
3.
Deep equality (de): This ignores object identities and checks whether two objects are instances of the same class (i.e., same structure or type) and whether the values of the corresponding base objects are the same.
It is clear that identity is stronger than shallow equality, and shallow equality is stronger than deep equality. If identity holds, the same can be said of shallow and deep equality; if shallow equality holds, so does deep equality. The most powerful aspect of an OODM is its ability to model inheritance. A class may inherit all the methods and attributes of its superclass. When a class inherits from one superclass, this is known as single inheritance. The situation in which a class inherits from more than one superclass is called multiple inheritance, and the inheritance structure forms a lattice. The class–subclass relationships form a class hierarchy similar to a generalization–specialization relationship. Another hierarchy that may originate at an attribute is the class composition hierarchy (Kim, 1989). The class composition hierarchy is distinct and orthogonal to the class hierarchy.
A Fuzzy Class Hierarchy In this approach (George, Buckles, & Petry, 1993), two levels of imprecision may be represented: first, the impreciseness of object membership in class values (fuzzy class extents); and second, the fuzziness of object attribute values. The class composition schema is enhanced to incorporate the similarities between object instances, and the effects of the merge operator on class memberships were considered.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
92 Angryk, Ladner, & Petry
A class is characterized by structure, methods and extension so a class is a pair Ci = (ti, ext (ti)), where t is a type. Next, Ci is a subclass of Ci' (Ci ⊆s C i') iff: 1.
The structure of Ci' is less equally defined (more general) in comparison to Ci.
2.
A class possesses every method owned by its superclasses, though the methods may be refined in the class.
A class hierarchy models class-subclass relationships and may be represented as: Ci ⊆s Ci + 1 ⊆s ... ⊆s Cn where Cn represents the root (basic) class, and Ci is the most refined (leaf) class. Analysis of class-subclass relations indicates that they can be broadly divided into two different types: 1.
Specialization subclasses (also referred to as partial subclass or objectoriented subclass), where the subclass is a specialization of its immediate superclass, i.e., computer science is a specialization of engineering.
2.
Subclasses that are subsets of its immediate superclass, i.e., the class of employees is a subset subclass of the class of persons.
A fuzzy hierarchy exists whenever it is judged subjectively that a subclass or instance is not a full member of its immediate class. Consideration of a fuzzy representation of the class hierarchy should take into account the different requirements and characteristics of the class-subclass relations. We associate with a subclass a grade of membership in its immediate class Ci ⊆s C i+1, represented as µCi(Ci+1). A subclass is represented now by a pair (Ci, µ(C i+1)), the second element of which represents the membership of C i in its immediate class C i+1. The class hierarchy is now: (oi, µ(C i)) ⊆s (Ci, µ(Ci +1)) ⊆s (C i +1, µ(C i+2)) ⊆s ... ⊆s (Cn, µ(C n+1)) The nature of class-subclass relationships also depends on the type of ISA links existing between the two. It is possible to have strong and weak ISA relationships between a class and its subclass. In a weak ISA relationship, the membership of a class in its superclasses is monotonically nonincreasing, while for the strong ISA link, the membership is nondecreasing. A fuzzy hierarchy possesses the following properties:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 93
1.
Membership of an instance/subclass in any of the superclasses in its hierarchy is constant, monotonically nonincreasing, or monotonically nondecreasing. If the membership is constant, the hierarchy is a subset hierarchy; if nonincreasing, a weak ISA specialization hierarchy; and if nondecreasing, a strong ISA specialization hierarchy
2.
For a weak ISA specialization hierarchy and a strong ISA specialization hierarchy: µCi(C n) = f ( µCi(Ci+1), µC (Ci+2),..., µCn-1 (C n)). i+1
The function f, which is application dependent, may be a product, min, max, etc. 3.
For two objects o and o' such that o, o' ∈ ext(Ci), if o de o' or o se o', then µo(C i) = µo'(Ci). In other words, two objects have the same membership in a class (and all its superclasses) if they are value equal.
We prescribed a fuzzy hierarchy in which each instance/subclass is described as a member in its immediate superclass with a degree of membership. And, we described the membership of an instance in a class as function of the membership of the instance in the immediate classes that lie between the instance and the class of interest. However, this may not be possible because the hierarchies are not always “pure” and mixed hierarchies are more the rule. In some applications, it might be necessary to assume that the membership of an object (class) in its class (superclass) is list directed. Thus, the expression for the class hierarchy can be generalized to account for the different types of links that can exist within an object hierarchy: (oi, { µ(Ci), µ(Ci+1),..., µ(Cn) } ) ⊆s (Ci, {µ (Ci+1), µ(C i+2),... µ(C n)}) ⊆s ... ⊆s (Cn, µ(C n+1))
Fuzzy Class Schema The OODM permits data to be viewed at different levels of abstraction based on the semantics of the data and their interrelationships. By extending the model to incorporate fuzzy and imprecise data, we allow data in a given class to be viewed through another layer of abstraction, this time one based on data values. This ability of the data model to chunk information further enhances its utility. In developing the fuzzy class schema, the merge operator is defined, which
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
94 Angryk, Ladner, & Petry
combines two object instances of a class into a single object instance, provided predefined level values are achieved. The merge operator at the same time maintains the membership relationship existing between the object/class and its class/superclass. Assume for generality two object members of a given class Ci with list-directed class/superclass memberships: o = (i,
, <µo(Ci), µo(C i+1),..., µo(Cn)>) o' = (i' , ,<µo'(C i),µo'(Ci+1),...,µo'(Cn)>) So o is a fuzzy object in C i if o ∈ ext(Ci) and µo(C i) takes values in the range [0,1]. Now we must consider how the data values as described by similarity relations behave (Petry, 1996). Assume attribute akj of class Ci with a noncomposite domain D j. By definition of fuzzy object, the domain of akj is dkj ⊆ D j. So the similarity threshold of D j is: Thresh(Dj) = min { min x,y∈djk [ s(x,y) ] } where o ∈ ext(Ci) and x, y are atomic elements. The threshold of a composite object is undefined. A composite domain is constituted of simple domains (at some level), each of which has a threshold value, i.e., the threshold for a composite object is a vector. The threshold value represents the minimum similarity of the values an object attribute may take. If the attribute domain is strictly atomic for all objects of the class (i.e., cardinality of aij is 1), then the threshold = 1. As the threshold value ranges toward 0, larger chunks of information are grouped together, and the information conveyed about the particular attribute of the class decreases. A level value given a priori determines the objects that may be combined by the set union of the respective domains. Note that the level value may be specified via the query language with the constraint that it may never exceed the threshold value.
Merging Objects For object oi and oi', assume ∀akj, the domain (akj) is noncomposite: o i '' = Merge(o i , o i ') = (i'', , <µo''(Ci),µo''(Ci+1),..,µo'' (Cn)>)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 95
where okj'' = (ikj'', {ikj, ikj'}) and µo''(Cm) = f ((Cm), (Cm')) ∀m, m = 1,...n such that ∀ val(ikj), val(ikj') ∈ dij ∪ d ij ': min[s(val(i kj ), val(i kj')) > Level(D j )] and Level(D j) ≤ Thres(D j). The merge operator permits a reorganization of the objects belonging to a class scheme by grouping them according to the similarity of an attribute object to another. As in the definition of threshold, the definition can be extended to composite objects. Two objects in an OODBMS can be nonredundant even if they are shallow equal. By introducing fuzziness into the model, however, we weaken this property. Two objects that are shallow equal are redundant, as are objects exhibiting deep equality. But equality alone does not determine redundancy, and the following is the characteristic of redundancy: Two objects oi and oi' are redundant iff ∀j, j = 1, 2, ..., m and Level(Dj) given a priori ∀ val(ikj), val(ikj') ∈ dij ∪ dij': min[s(val(ikj), val(ikj')) > Level(D j)] This property of redundancy (Buckles & Petry, 1982) is directly responsible for the property of value abstraction exhibited by the fuzzy database. It also ensures that the results of database operations are unique.
Other Fuzzy Object-Oriented Approaches For OODBs, Zicari (1990) considered issues of incompleteness, albeit without use of fuzzy concepts. In particular, incomplete data in an object are handled by the introduction of explicit null values in a similar manner to the relational and nested relational models. Several researchers have been developing fuzzy OODB approaches and studying related issues for a number of years (de Clauwe, 1997; Lee et al., 1999; Pasi & Yager, 1999; Bordogna et al., 2000; de Tre et al., 2000; Marin et al., 2000; Cao, 2001; Koyuncu & Yazici, 2003; Ma, 2000, 2004). Significant applications of fuzzy object modeling are in the areas of complex spatial data and GIS (George et al., 1992; Morris & Petry, 1998; Cross & Firat, 2000).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
96 Angryk, Ladner, & Petry
Generalization in Fuzzy OODB The starting point for all generalization approaches must be based on the most frequently encountered attribute values — single-valued nonnumeric and numeric data values. We will extensively consider the issues related to generalization for single-valued data and then show how this may extend to structured data and class hierarchy issues.
Attribute Generalization and Concept Hierarchies For the purpose of attribute-oriented generalization, the concept of hierarchy is critical and in an environment of fuzzy data may lead to different interpretations for generalization. Each concept hierarchy reflects background knowledge about the domain to be generalized. These hierarchies should permit gradual, similarity-based, aggregation of attribute values in the objects. Typically, a hierarchy is built in the bottom-up manner, progressively increasing the abstraction of the generalization concepts at each new level. Creation of new concept levels in generalization hierarchies is accompanied by an increase of the concept abstraction and the decrease of cardinality (each higher level includes less data descriptors, but the descriptors have more general meanings). Hierarchical grouping (Han, 1995) was based on tree-like generalization hierarchies, where each of the concepts at the lower level of the generalization hierarchy was allowed to have just one abstract concept at the level directly above it. Fuzzy ISA hierarchies were later applied to data summarization (Lee & Kim, 1997), allowing a single concept (attribute value) to partially belong to more than one of the concepts placed at the next abstract level (direct abstracts). However, this and a similar approach (Raschia & Moudaddib, 2002) lack certain properties (exact count/vote propagation) that we find are needed in the attribute-oriented generalization. Because of the nature of fuzzy OODBs, we can restructure the original data (by merging objects considered to be identical at a certain α-cut level, according to a given similarity relation) in order to begin attribute-oriented generalization from a desired level of detail, the initial set G0. This approach, when removing unnecessary detail, must be applied with caution. When merging objects according the equivalence at the given similarity level (e.g., by using queries with a high threshold level), we are not able to keep track of the number of original objects to be merged to one object. This may result in a significant change of balance among the objects in a class and lead to the erroneous (not reflecting reality) information presented later in the form of support and confidence of the extracted knowledge. This problem, which we refer to as a count dilemma in the
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 97
count propagation, can easily be avoided by performing extraction of initial working class G0 at a detailed level (i.e., α = 1.0), where only identical values are merged (e.g., Bleached and Light Blond will be unified), but no considerable number of objects would be lost as the result of such redundancy removal. Another issue that must be emphasized is an exact count propagation dilemma (also derived from the principle of count propagation). When generalizing data for data mining purposes, we have to preserve the number of objects and the relationships between them in identical proportions at each level of generalization. In other words, we have to assure that each object from the original class will be counted once at each of the levels of the generalization hierarchy. This leads to the two following properties, which must be maintained at each level of the generalization hierarchy: 1.
The set of concepts at each level of hierarchy should cover all of the attribute values that occurred in the original database (so we are guaranteed not to lose the number of objects when generalizing their values).
2.
Never allow any attribute value (or its abstract) to be counted more or less than once at each level of the generalization hierarchy. (When we allow a concept to partially belong to more than one of its direct abstracts, we have to check each time that the sum of fractional memberships is equal to 1.0). This aspect is especially important when we plan to apply attribute-oriented generalization as a pre-analysis tool, to compress the initial data set to a form more appropriate for the application of computationally complex data mining algorithms (e.g., association rules mining).
For the purpose of further analysis, we distinguish three basic types of generalization hierarchies: 1.
Crisp concept hierarchy (Han, 1995; Hilderman et al., 1999): Here each attribute variable (concept) at each level of the hierarchy can have only one direct abstract (its direct generalization) to which it fully belongs. (There is no consideration of the degree of relationship, e.g., {master of art, master of science, doctorate} ⊂ graduate, {freshman, sophomore, junior, senior} ⊂ undergraduate.) This is as shown in the tree in Figure 1.
2.
Fuzzy concept hierarchy (Lee & Kim, 1997; Raschia & Mouaddib, 2002): The hierarchy of concepts here reflects the degree with which one concept belongs to its direct abstract and more than one direct abstract of a single concept is allowed. Because of the lack of guarantee of exact count propagation, such a hierarchy seems to be more appropriate for simplified data summarization, or for the cases when subjective results are to be emphasized (when we purposely want to modify the roles or influences of certain objects). Utilization of the four popular text editors could be
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
98 Angryk, Ladner, & Petry
Figure 1. Crisp concept hierarchy ANY
undergraduate
freshman
sophomore
junior
graduate
senior
M.A.
M.S.
Ph.D.
generalized as follows (Lee & Kim, 1997). We denote fuzzy generalization of concept a to its direct abstract b with membership degree c as a p b|c:
•
First level of abstraction: {emacs p editor| 1.0; emacs p documentation| 0.1; vi p editor| 1.0; vi p documentation| 0.3;word p documentation | 1.0; word p spreadsheet| 0.1; wright p spreadsheet| 1.0}
•
Second level of hierarchy: {editor p engineering| 1.0; documentation p engineering | 1.0; documentation p business| 1.0; spreadsheet p engineering | 0.8; spreadsheet p business| 1.0}
•
Third level of hierarchy: {engineering p any | 1.0; business p any | 1.0}
3.
Consistent fuzzy concept hierarchy (recently proposed in Angryk & Petry, 2003): Each degree of membership is normalized to preserve an exact count propagation for each object when being generalized.
Extraction of Concept Hierarchies from Similarity Relations Here we consider the nature of similarity relations as a mechanism for attributeoriented generalization. Commonly, the generalization of concepts in data mining is based on the two types of ontological relations: (1) “Part-Of” (e.g., wheels, oil, oil filter, and brake pads sold by Wal-Mart could be generalized to “auto-service items”) and (2) “Is-A” (e.g., red, auburn, ruby, and scarlet could be described in general as “reddish colors”). “Part-Of” emphasizes the similarity of concepts to their abstract, while “Is-A” accentuates the similarity occurring between the values from lower level of abstraction, trying then to define the descriptor fitting its character. In practice we may find loose hybrids of these relations, because the structure of generalization hierarchy strongly depends on the character of the
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 99
Table 1. Proximity table for a domain HAIR COLOR Black Black Dark brown Auburn Red Blond Bleached
1.0 0.8 0.7 0.7 0.5 0.5
Dark Brown 0.8 1.0 0.7 0.7 0.5 0.5
Auburn
Red
Blond
Bleached
07 0.7 1.0 0.8 0.5 0.5
0.5 0.7 0.8 1.0 0.5 0.5
0.5 0.5 0.5 0.5 1.0 0.8
0.5 0.5 0.5 0.5 0.8 1.0
data mining task or personal preferences of the analyst. Each of these relations among concepts can be reflected in a similarity relation, because the user or datamining analyst can be allowed to modify the values in the similarity table in the individual’s user view of the database to represent the similarity between the concepts (attribute values) in the context of interest. The existence of a similarity relation modeled for a particular domain can lead to the extraction of a crisp concept hierarchy, allowing attribute-oriented generalization. Let Sα be the α-cut of the similarity relation S, presented in Table 1. It can be shown (Zadeh, 1970) that if S is a similarity relation on a given domain Dj (which is a single attribute in our case), then ∀α ∈ (0,1] each Sα creates equivalence classes in the domain D j. Now, let Πα denote the equivalence class partition induced on domain Dj by Sα. Clearly, Πα' is a refinement of Πα if α' ≥ α. A nested sequence of partitions Πα1, Πα2,…, Παk may be represented diagrammatically in the form of a partition tree. The nested sequence of partitions in the form of a tree has a structure identical to the crisp concept hierarchy for data mining generalization purposes (Figure 2). The increase of abstraction in the partition tree is denoted by decreasing values of α; lack of abstraction during generalization (0-abstraction level at the bottom of generalization hierarchy) complies with the 1-cut of the similarity relation ( α = 1.0), and can be denoted as S1.0. An advantage of attribute-oriented generalization with OODBs using similarity relations is that such an hierarchy is implicit in the object-oriented fuzzy model and can be extracted automatically, even by a user who has no background knowledge about the particular domain. Experienced analysts not satisfied with an existing similarity relation may then define their own similarity tables in user views to better reflect their knowledge about the attribute values. The only difference in Figure 2 from crisp concept hierarchies is their lack of abstract concepts used as labels characterizing the sets of generalized (grouped) concepts. In our example, we could generalize the values blond and bleached to one common descriptor BLONDISH, auburn and red to REDDISH, and black and dark brown to DARKISH (to maintain consistency of the naming
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
100 Angryk, Ladner, & Petry
BLACK
BLACK
BLACK
BLACK
D .B R O W N A U B U R N
D .B R O W N A U B U R N
D .B R O W N
D .B R O W N
AUBURN
AUBURN
RED
RED
RED
RED
BLOND
BLEACHED
α = 0 .5
BLOND
BLEACHED
α = 0 .7
BLOND
BLEACHED
α = 0 .8
BLOND
B L E A C H E D α = 1 .0
ABSTRACTION LEVEL
Figure 2. Partition tree of domain HAIR COLOR for similarity relation (Table 1)
convention at the first level of abstraction). At the next level of the generalization hierarchy, we can keep the concept BLONDISH, because there is no change in its components; however, according to the taxonomy presented in Figure 2, the concepts DARKISH and REDDISH should be generalized and should have a new descriptor, which we call DARK to emphasize the change. A term ANY is usually placed at the highest level of concept hierarchy, to emphasize that the name describes all values possibly occurring in the particular domain. When defining abstract names for generalized sets of attribute values, we need to remember that the lower cut of the similarity relation (smaller values of α) represents a higher abstraction of generalization descriptors. Due to the nested character of partitions as a result of α-cuts of a similarity relation, to specify a complete set of abstract descriptors it is sufficient to choose one value of the attribute per equivalence class partition at each level of the hierarchy, represented by α in Table 2. This is sufficient to build the generalization hierarchy in Figure 3. Because the similarity relation can generate only a nested sequence of equivalence partitions via a decrease in similarity level, we cannot extract a fuzzy concept hierarchy from the similarity table. The disjoint character of equivalence classes generated from the similarity relation does not allow any concept in the
Table 2. Abstract descriptors, for the generalization hierarchy in Figure 2, where abstraction level is represented by value of α Original Attribute Value Black Red Blond Black Blond Black
Abstraction Level 0.8 0.8 0.8 0.7 0.7 0.5
Abstract Descriptor DARKISH REDDISH BLONDISH DARK BLONDISH ANY
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 101
α=0.5
ANY
DARK
DARKISH
BLACK
D.BROWN
REDDISH
AUBURN
RED
BLONDISH
α=0.7
BLONDISH
α=0.8
BLOND
BLEACHED α=1.0
ABSTRACTION LEVEL
Figure 3. Crisp generalization hierarchy formed using Tables 1 and 2
hierarchy to have more than one direct abstract at every level of the generalization hierarchy. A similarity table can be utilized to form a crisp generalization hierarchy. Such an hierarchy can be successfully applied as a foundation to the development of a fuzzy concept hierarchy. Data-mining analysts can extend the crisp hierarchy with additional edges to represent partial membership of the lower-level concepts in their direct abstract descriptors. Depending on the assigned memberships, reflecting preferences of the user, they can create consistent or inconsistent fuzzy concept hierarchies.
Utilizing Similarity Relations to Define Abstract Concepts A similarity relation can be interpreted in terms of fuzzy similarity classes S(x) (Zadeh, 1970), where the membership of attribute variables in the class S(x) is equal to the similarity level between these variables and the fuzzy similarity class. In other words, the grade of membership of y in the fuzzy class S(x), denoted by µS(x)(y), is xSy. Based on this consideration, one can define abstract concepts by choosing their basic representative attribute values (i.e., typical representative specializers) and then using a similarity table to extract a more precise definition of such abstract classes. For such extraction, we assume a certain level of similarity (α), which should be interpreted as a level of precision reflected in our abstract concept definition. Typically, the more abstract the concepts to be used in data generalization, the less certain experts are at assigning particular lower-level concepts to them; often, some values can be easily generalized to abstracts, but others may raise doubts among experts. In analyzing the problem of imprecise information, it was noted (Dubois & Prade, 1991) that each attribute has a domain (allowed values),
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
102 Angryk, Ladner, & Petry
a range (actually occurring values), and a typical range (most common values), and we apply this classification to the generalization process. With an abstract concept we can usually identify its typical direct specializers, the elements clearly belonging to it (e.g., we all would probably agree here that black hair can be generalized to the descriptor DARK with 100% accuracy). This can be represented as a core of the fuzzy set (abstract concept). However, there are also lower-level concepts that cannot be definitely assigned to only one of their direct abstracts (e.g., assigning blond fully to the abstract concept LIGHT hair is problematic because there are many people with dark blond hair). We term such cases possible direct specializers, concepts in the group of lower-level concepts characterized by the given abstract descriptor (fuzzy set) with membership µ ≤ 1. These are the support of a fuzzy set and are interpreted as the range of the abstract concept. Now we define each abstract concept as a set of its typical original attribute values with the level of doubt about its other possible specializers reflected by the value of α. Then we select the fuzzy similarity class created from the α-cut of similarity relation for these predefined typical specializers and analyze if this fits our needs. For instance, define the abstract concept LIGHT hair by the attribute variable bleached with the level of similarity α = 0.8 to spread the range of this abstract descriptor (LIGHT is predefined as the similarity class BLEACHED0.8). From the similarity relation presented in Table 1, we can derive: LIGHT = BLEACHED0.8 = {bleached|1.0; blond|0.8} Of course, each of the abstract concepts can be defined by more than one typical representative element (in such a case we may also choose an intersection operator, as best fits our preferences). Assume the descriptor DARK to be principally defined by the following original values of the HAIR COLOR domain: black, d.brown, and auburn. Assuming the similarity level to be 0.7, we would obtain: DARK = MAX(BLACK0.7; D. BROWN0.7; AUBURN0.7) ={black|1.0;d.brown|1.0;auburn|1.0;red|0.8} Using both of these abstract concepts, with assumption that only DARK and LIGHT colors occur at the given level of HAIR COLOR generalization, we construct the fuzzy generalization hierarchy (Figure 4). The hierarchy in Figure 4 is called a simplified fuzzy concept hierarchy, because the fractional memberships of low-level concepts to their abstract descriptors
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 103
Figure 4. Simplified fuzzy generalization hierarchy for the attribute HAIR COLOR
L IG H T
0.8
0.
8
AUBURN
RED
BLOND
1. 0
D .B R O W N
1. 0
BLACK
0
1. 0
DARK
1.
ABSTRACTION LEVEL
ANY
BLEACHED
make it similar to the fuzzy concept hierarchy described previously. Each of the original attribute values belongs to only one direct abstract, creating a simplified (crisp-hierarchy-like) structure. For instance, define an abstract concept BLACKISH as the α-cut from the similarity table for black at the level 0.7: BLACKISH = BLACK0.7 = {black|1.0; d.brown|0.8; auburn|0.7;red|0.7} Simultaneously introduce the abstract class BROWNISH at the same α-level: BROWNISH = D.BROWN0.7 = {black|0.8; d.brown|1.0; auburn|0.7;red|0.7} We can derive the fuzzy concept hierarchy and even modify the generalization model to become consistent through the normalization of derived memberships: BLACKISH = BLACK0.7 = {black|
1.0 1.8
;d.brown|
0.8 1.8
; auburn|
0.7 1.4
;red|
0.7 1.4
}=
{black|0.6;d.brown|0.4; auburn|0.5;red|0.5} BROWNISH = D.BROWN0.7 = {black|
0.8 1.0 0.7 ; d.brown| ; auburn| 1.8 1.8 1.4
;red|
0.7 1.4
}=
{black|0.4;d.brown|0.6; auburn|0.5;red|0.5} Despite the formally correct appearance, this mechanism may be inappropriate. We characterized two new generalization concepts (BLACKISH and BROWNISH) with a low level of imprecision (each had only one typical direct specializer), simultaneously choosing a relatively high degree of abstraction (α = 0.7) when extracting α-cuts from the similarity relation. This resulted in two fuzzy similarity
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
104 Angryk, Ladner, & Petry
classes (BLACK 0.7 and D.BROWN 0.7) that were overlapping and led to the consistent fuzzy concept hierarchy in Figure 5 (derived through the normalization of membership degrees). Extraction of two fuzzy classes from the similarity table at the similarity level where they were considered to be equivalent (black and d.brown belong to the same equivalence class partition at the similarity level 0.7), despite being formally possible, often may not be semantically meaningful. This situation may occur when the abstract concepts are characterized incorrectly at the particular level of generalization (which is the case here) or the similarity relation represents the similarity between these concepts in the perspective not compatible with the context represented in the particular generalization hierarchy. It makes no sense to define two or more general concepts at a level of abstraction so high that they are interpreted as identical. This rationale found its natural reflection in the distribution of memberships presented in the consistent fuzzy concept hierarchy (Figure 5), where both of the introduced abstract concepts have almost identical compositions of their direct specializers. Some guidelines are needed when characterizing abstract concepts via their typical direct specializers and trying to extract their full definition (range of possible direct specializers) using a similarity table: 1.
We need to assure that the intuitively assumed value of α extracts the cut (subset) of attribute values that corresponds closely to the definition of the abstract descriptor for which we were looking. The strategy for choosing the most appropriate level of α-cut when extracting the abstract concept definitions arises from the guideline of minimal generalization (the minimal concept tree ascension strategy described in the second section). Based on this strategy, we would recommend always choosing a definition extracted at the highest possible level of similarity (biggest α), where all predefined typical components of the desired abstract descriptor are already embraced (where they occur for the first time).
D.BROWN
0.4 .6 0
0.5
BROWNISH
0.5 AUBURN
0.5
0.5
BLACK
0.4
0.6
DARKISH
RED
ABSTRACTION LEVEL
Figure 5. Consistent fuzzy concept hierarchy for the attribute HAIR COLOR
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 105
2.
The problem of selecting appropriate representative elements without external knowledge about a particular attribute remains; however, it can now be supported by the analysis of the values stored in the similarity table. Choosing typical values and then extracting a detailed definition with all possible components of the desired abstract concepts from the similarity table seems to be easier than describing generalized components in detail.
3.
Moreover, we should be aware that if low-level concepts, predefined as typical components of the particular abstract descriptor, do not occur in the common similarity class, then the contexts of the generalized descriptor and of the similarity relation are not in agreement, and revision of the similarity table (or the abstract) is recommended.
4.
We cannot directly specify a restriction stating that all abstract concepts in the generalization hierarchy have to be at the same level of similarity, when extracted from the similarity relation. Moreover, definitions extracted to the example presented in Figure 4 show that this situation is acceptable. However, when using this approach, we should generally not put at the same level of generalization hierarchy the abstract descriptors that overlap with others. This can easily occur when trying to place an abstract defined via the original concepts on the given level of the generalization hierarchy. This abstract is already represented on that level by the generalized concept derived from the equivalence class partition with the higher similarity level. We have to remember that the abstract concepts derived from the similarity relation have nested character, and placing one concept simultaneously with the other may lead to the partial overlapping of partitions (because it is its actual refinement), which contradicts the character of similarity relation.
The approach described here seems to allow us to form only flat (one-level) generalization hierarchies or to derive the generalized concepts at the first level of abstraction in the concept hierarchy. Each abstract concept defined with this method is a generalization of original attribute values, and therefore cannot be placed at the higher level of the concept hierarchy. However, there is no obstacle preventing these concepts from being further generalized. The lack of ability to derive multilevel hierarchical structures does not prevent this approach from being appropriate, and actually convenient, for rapid data summarization or something we term “selective attribute-oriented generalization.” To summarize the given data set, we may prefer to not perform gradual (hierarchical) generalization but replace it with a one-level hierarchy covering a whole domain of attribute values. Such an appropriately built “flat hierarchy” would represent the majority of dependencies between the original low-level concepts, which are to be generalized, by the propagation of fractions of counts
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
106 Angryk, Ladner, & Petry
coming from each attribute value, instead of having to perform detailed hierarchical generalization. In selective generalization, we generalize all attribute values from a specific point of view, which is dictated by the character of the data mining task. Assume that we are interested in association rules regarding only people who have dark hair. Using the similarity relation, we derive the following: DARKISH = MAX(BLACK0.7;D.BROWN 0.7) ={black|1.0;d.brown|1.0;auburn|0.7;red|0.7} This reflects the following interpretation: All people who have black or dark brown hair are considered to have DARKISH hair, and 70% of redheads and people with auburn hair have it in a dark shade. This is sufficient to explain the difference between selective generalization and the application of data selection when building the initial data-mining class G0. In both cases, we omit all objects with hair; however, in the case of selective generalization, 70% of each count represented by each object with red or auburn hair color remains. This is obviously not equivalent to the extraction of all objects with values “red” and “auburn” and then randomly choosing 70% of them for further generalization. With selective generalization, we do not omit the objects but decrease their influence to an appropriate representation of their importance for the given datamining problem. We should finally point out that consistent fuzzy hierarchies are not appropriate tools for selective attribute-oriented generalization. In this case, we do not want to have normalization of counts’ values to preserve exact count dilemma, we instead want to preserve an unbalanced relation between the objects, as this reflects dependencies occurring in real-life data. The ordinary fuzzy hierarchies seem to be the most appropriate for such purposes. Although we focused on nonnumeric data in this discussion of fuzzy concept hierarchies, the generalization of numeric attributes can be performed in a similar manner. Of course, the numeric hierarchy can be based on similarity relationships for fuzzy numbers, such as was already developed for fuzzy databases (Buckles & Petry, 1984; Petry 1996), and used as described above for nonnumeric data. In the case of numeric data, it is possible to analyze the data distribution characteristics. It may then not be necessary to have predefined concept hierarchies. For example, consider an income range study in which the incomes can be clustered into several groups, {< 20K, 20–35K, 35–45K, 45–50K, >50K}, based on some statistical clustering tool. Obviously, further clustering can be
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 107
done on these groupings to form a multilevel hierarchy. Linguistic terms can be assigned to groups, {very low, low, medium, high, very high}, to provide labels in the hierarchy for generalization. This may be a crisp hierarchy, but it is also possible to formulate a fuzzy hierarchy by techniques such as use of fuzzy agglomerative clustering (Yager, 2000).
Generalization of Structured Data Values and Class Hierarchies In general, we may have complex structured data such as set and list valued data or data with nested structures. First let us consider an attribute that may be multior set-valued. Each value in a set can first be generalized into its higher-level concept. For example, if we have the multivalued attribute “skills” for an employee, we might have the set of values: {German, Programming, Pilot}. If the next level in a concept hierarchy were to classify skills as mental or physical, then we would have the set {(Physical Skills, countp), (Mental Skills, countm)}, where countp is the value 1, and countm 2, but each is scaled as appropriate depending on the type of the concept hierarchy being utilized. For more complex data, we still base the approach on set-valued data as above. A list-valued attribute can be generalized in the same manner as that for the list elements, except that a generalized form of the list order must be used in the generalization process. For structured data, we can consider that same approach but must evaluate alternatives to structure generalization. When generalizing individual attribute values, we may maintain the shape of the structure or provide some generalization of the structure, such as flattening the structure or removing low-level values and summarizing them. Recall also that we have a fuzzy class hierarchy: (oi, µ(C i)) ⊆s (Ci, µ(Ci+1)) ⊆s (Ci+1, µ(Ci+2)) ⊆s ... ⊆s (C n, µ(Cn+1)), so that when we generalize the object oi, we must account for the degree of membership in its particular class. This can be done scaling the object’s count, oi.count, by the membership µ(C i) for the current class of oi. If the generalization of oi moves up through the hierarchy, then the appropriate weighting must be taken into account for oi.count.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
108 Angryk, Ladner, & Petry
Conclusions We considered in detail the issues relative to concept hierarchies for attribute generalization, as the use of a concept hierarchy is the essential component of the generalization process. As we have seen, there are several approaches that can be taken depending on the exact intention of the data-mining application. This allows one to be more flexible in dealing with fuzzy objects in the similarity-based fuzzy OODB model we described, in particular, due to the ability to create hierarchies from the given similarity relationships for the data domains. There are several directions that can be profitably followed in this area for OODBs that we have not considered to date. Two of particular interest that we are currently studying are the issues of generalization of methods and the use of aggregation as a structuring mechanism. As an application area, the problem of generalization of multimedia data, especially spatial data (Ladner, Petry, & Cobb, 2003), in a fuzzy OODB is of particular interest. Also, we have been considering the extension of fuzzy hierarchy development in a database utilizing proximity relationships (Angryk & Petry, 2003) and plan on extending the fuzzy OODM to accommodate generalization via proximity relations.
ACKNOWLEDGMENTS We would like to thank the Naval Research Laboratory’s Base Program, Program Element No. 0602435N for sponsoring this research.
References Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM-SIGMOD International Conference on Management of Data (pp. 207–216). New York: ACM Press. Angryk, R., & Petry, F. (2003). Consistent fuzzy concept hierarchies for attribute generalization. In Proceedings IASTED International Conference on Information and Knowledge Sharing (IKS 2003) (pp. 158–193). Angryk, R., & Petry, F. (2003). Data mining fuzzy databases using attributeoriented generalization. In Proceedings of the IEEE International Con-
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 109
ference Data Mining Workshop on Foundations and New Directions in Data Mining (pp. 8–15). Melbourne, FL. Au, W., & Chan, K. (2003). Mining fuzzy association rules in a bank-account database. IEEE Transactions on Fuzzy Systems, 11(2), 238–248. Bertino, E., & Martino, L. (1991). Object-oriented database management systems: Concepts and issues. IEEE Computer, 24, 65–81. Bezdek, J. (1974). Cluster validity with fuzzy sets. Journal of Cybernetics, 3, 58–72. Bordogna, G., Leporati, A., Lucarella, D., & Pasi, G. (2000). The fuzzy objectoriented database management system. In G. Bordogna, & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 209–236). Heidelberg: PhysicaVerlag. Bosc, P., & Pivert, O. (2001). On some fuzzy extensions of association rules. In Proceedings of IFSA-NAFIPS 2001 (pp. 1104–1109). Piscataway, NJ: IEEE Press. Buckles, B., & Petry, F. (1982). A fuzzy representation for relational data bases. International Journal of Fuzzy Sets and Systems, 7, 213–226. Buckles, B., & Petry, F. (1984). Extending the fuzzy database with fuzzy numbers. Information Sciences, 34, 45–55. Cao, T. (2001). Uncertain inheritance and recognition as probabilistic default reasoning. International Journal of Intelligent Systems, 16, 781–803. Carter, C., & Hamilton, H. (1998). Efficient attribute-oriented generalization for knowledge discovery from large databases. IEEE Transactions on Knowledge and Data Engineering, 10(2), 193–208. Chaudhri, A., & Lommis, M. (Eds.). (1998). Object databases in practice. New York: Prentice Hall. Chen, G., Wei, Q., & Kerre, E. (2000). Fuzzy data mining: Discovery of fuzzy generalized association rules. In G. Bordogna, & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 45–66). Heidelberg: Physica-Verlag. Cross, V., & Firat, A. (2000). Fuzzy objects for geographical information systems. International Journal of Fuzzy Sets and Systems, 113, 19–36. Cubero, J., Medina, J., Pons, O., & Vila, M. (1999). Data summarization in relational databases through fuzzy dependencies. Information Sciences, 121(3–4), 233–270. de Clauwe, R. (Ed.). (1997). Fuzzy and uncertain object-oriented databases: Concepts and models. Singapore: World Scientific. de Graaf, J., Kosters, W., & Witteman, J. (2001). Interesting fuzzy association rules in quantitative databases. In Principles of Data Mining and
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
110 Angryk, Ladner, & Petry
Knowledge Discovery LNAI 2168 (pp. 140–151). Heidelberg: SpringerVerlag. de Tre, G., de Clauwe, R., & Van der Cruyssen, B. (2000). A generalized objectoriented database model. In G. Bordogna, & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 155–182). Heidelberg: Physica-Verlag. Delgado, M., Marin, N., Sanchez, D., & Vila, M. (2003). Fuzzy association rules: General model and applications. IEEE Transactions on Fuzzy Systems, 11(2), 214–225. Dubois, D., & Prade, H. (2000). Fuzzy sets in data summaries — outline of a new approach, In Proceedings of the Eighth International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 1035–1040). Madrid, Spain. Dubois, D., Prade, H., & Rossazza, J. (1991). Vagueness, typicality and uncertainty in class hierarchies. International Journal of Intelligent Systems, 6, 167–183. Feelders, A., Daniels, H., & Holsheimer, M. (2000). Methodological and practical aspects of data mining. Information and Management, 37, 271– 281. Feng, L., & Dillon, T. (2003). Using fuzzy linguistic representations to provide explanatory semantics for data warehouses. IEEE Transactions on Knowledge and Data Engineering, 15(1), 86–102. George, R., Buckles, B., Petry, F., & Yazici, A. (1992). Uncertainty modeling in object-oriented geographical information systems. In 1992 Proceedings of Conference on Database & Expert System Applications (pp. 77–86). Heidelberg: Springer-Verlag. George, R., Buckles, B., & Petry, F. (1993). Modeling class hierarchies in the fuzzy object-oriented data model. Int. J. of Fuzzy Sets and Systems, 60, 259–272. Gyenesei, A. (2001a). A fuzzy approach for mining quantitative association rules. Acta Cybernetica, 15, 305–320. Gyenesei, A. (2001b). Interestingness measures for fuzzy association rules. In Principles of data mining and knowledge discovery — LNAI 2168 (pp. 152–164). Heidelberg: Springer-Verlag. Han, J. (1995). Mining knowledge at multiple concept levels. In Proceedings of the Fourth International Conference on Information and Knowledge Management (pp. 19–24). New York: ACM Press. Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. San Diego, CA: Academic Press.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Generalization Data Mining 111
Han, J., Nishio, S., & Kawano, W. (1994). Knowledge discovery in objectoriented and active databases. In F. Fuchi, & T. Yokoi (Eds.), Knowledge building and knowledge sharing (pp. 221–230). Singapore: IOS Press. Han, J., Nishio, S., Kawano, H., & Wang, W. (1998). Generalization-based data mining in object-oriented databases using an object-cube model. Data and Knowledge Engineering, 25(1–2), 55–97. Hilderman, R., Hamilton, H., & Cercone, N. (1999). Data mining in large databases using domain generalization graphs. Journal of Intelligent Information Systems, 13(3), 195–234. Hirota, K., & Pedrycz, W. (1999). Fuzzy computing for data mining. In Proceedings of the IEEE, 87, 1575–1599. Kacprzyk, J. (1999). Fuzzy logic for linguistic summarization of databases. In Proceedings of the Eighth International Conference on Fuzzy Systems (pp. 813–818). Seoul, Korea. Kacprzyk, J., & Zadrozny, S. (2000). On combining intelligent querying and data mining using fuzzy logic concepts. In G. Bordogna, & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 67–81). Heidelberg: PhysicaVerlag. Khoshafian, S., & Copeland, G. (1986). Object identity. In Proceedings of the OOPSLA ’86 Conference (pp. 406–416). New York: ACM Press. Kim, W. (1989). A model of queries for object-oriented databases. In Proceedings of 15th International Conference on Very Large Databases (pp. 45–54). Koyuncu, M., & Yazici, A. (2003). IFOOD: An intelligent fuzzy object-oriented database architecture. IEEE Transactions Knowledge and Data Engineering, 15(5), 1137–1154. Kuok, C., Fu, A., & Wong, H. (1998). Mining fuzzy association rules in databases. ACM SIGMOD Record, 27, 41–46. Ladner, R., Petry, F., & Cobb, M. (2003). Fuzzy set approaches to spatial data mining of association rules. Transactions on GIS, 7(1), 123–138. Laurent, A., Bouchon-Meunier, B., Doucet, A., Gancarski, S., & Marasal, C. (2000). Fuzzy data mining from multidimensional databases. Studies in Fuzziness and Soft Computing, 54, Proceedings of ISCI (pp. 245–256). Lee, D., & Kim, M. (1997). Database summarization using fuzzy ISA hierarchies. IEEE Transactions On Sysems, Man, and Cybernetics — Part B, 27(1), 68–78. Lee, J., Xue, N., Hsu, K., & Yang, J. (1999). Modeling imprecise requirements with fuzzy objects. Inf. Sci., 118, 101–119.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
112 Angryk, Ladner, & Petry
Lee, K. (2001). Mining generalized fuzzy quantitative association rules with fuzzy generalization hierarchies. In Proceedings of IFSA-NAFIPS 2001 (pp. 2977–2982). Piscataway, NJ: IEEE Press. Ma, Z., Zhang, W., Ma, W., & Chen, G. (2001). Conceptual design of fuzzy object-oriented databases using extended entity–relationship model. International Journal of Intelligent Systems, 16, 697–711. Ma, Z., Zhang, W., & Ma, W. (2004). Extending object-oriented databases for fuzzy information modeling. To appear in Information Systems. Marín, N., Vila, M., & Pons, O. (2000). Fuzzy types: A new concept of type for managing vague structures. International Journal of Intelligent Systems, 15, 1061–1085. Morris, A., Petry, F., & Cobb, M. (1998). Fuzzy object-oriented database modeling of spatial data. In Proceedings IPMU Conference (pp. 604– 611). Paris: EDK Press. Pasi, G., & Yager, R. (1999). Calculating attribute values using inheritance structures in fuzzy object-oriented data models. IEEE Transactions on Systems, Man, and Cybernetics — Part B, 29(4), 556–564. Petry, F. (1996). Fuzzy databases: Principles and applications. Boston, MA: Kluwer Academic Publishers. Raschia, G., & Mouaddib, N. (2002). SAINTETIQ: A fuzzy set-based approach to database summarization. Fuzzy Sets and Systems, 129, 137–162. Shu, J., Tsang, E., & Yeung, D. (2001). Query fuzzy association rules in relational databases. In Proceedings of IFSA-NAFIPS 2001 (pp. 2989– 2993). Piscataway, NJ: IEEE Press:. Yager, R. (1991). On linguistic summaries of data. In G. Piatesky-Shapiro, & Frawley (Eds.), Knowledge discovery in databases (pp. 347–363). Boston, MA: MIT Press. Yager, R. (2000). Intelligent control of the hierarchical agglomerative clustering process. IEEE Transactions on Systems, Man, and Cybernetics — Part B, 30(6), 835–845. Zadeh, L. (1970). Similarity relations and fuzzy orderings. Information Sciences, 3, 177–200. Zhang, W. (1999). Mining fuzzy quantitative association rules. In Proceedings of IEEE International Conference on Tools with Artificial Intelligence (pp. 99–102). Piscataway, NJ: IEEE Press. Zicari, R. (1990). Incomplete information in object-oriented databases. SIGMOD RECORD, 19, 33–40.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 113
Chapter IV
FRIL++ and Its Applications J. M. Rossiter University of Bristol, UK & Bio-Mimetic Control Research Center, The Institute of Physical and Chemical Research (RIKEN), Japan T. H. Cao Ho Chi Minh City University of Technology, Vietnam
Abstract We introduce a deductive probabilistic and fuzzy object-oriented database language, called FRIL++, which can deal with both probability and fuzziness. Its foundation is a logic-based probabilistic and fuzzy objectoriented model where a class property (i.e., an attribute or a method) can contain fuzzy set values, and uncertain class membership and property applicability are measured by lower and upper bounds on probability. Each uncertainly applicable property is interpreted as a default probabilistic logic rule, which is defeasible, and probabilistic default reasoning on fuzzy events is proposed for uncertain property inheritance and class recognition. The design, implementation, and basic features of FRIL++ are presented. FRIL++ can be used as both a modeling and a programming language, as demonstrated by its applications to machine learning, user modeling, and modeling with words herein.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
114 Rossiter & Cao
Introduction For modeling real-world problems and constructing intelligent systems, the integration of different methodologies and techniques has been the quest and focus of significant interdisciplinary research effort. The advantages of such a hybrid system are that the strengths of its partners are combined and are complementary to each other’s weakness. In particular, object orientation provides a hierarchical data abstraction scheme and an information hiding and inheritance mechanism; probabilistic/fuzzy reasoning provides measures and rules for representing and reasoning with uncertainty and imprecision in the real world; logic programming provides a declarative way for problem specification and well-founded semantics for formal reasoning. However, research on combining all three modeling and computing paradigms appears to be sporadic. In Eiter et al. (2001), the authors developed algebra to handle object bases with uncertainty, where conditional probabilities for an object of a class being a member of its subclasses are given, and membership of an object to a class is expressed by a probability value, but fuzzy values are not allowed in class properties. Meanwhile, there have been many fuzzy object-oriented models developed, such as those of Bordogna et al. (1999), George et al. (1993), Itzkovich and Hawkes (1994), Rossazza et al. (1997), and Van Gyseghem and De Caluwe (1997), but they are not deductive. Yazici and George (1999) present a deductive fuzzy object-oriented model that, however, does not address uncertain applicability of properties. In Dubitzky et al. (1999), each property of a concept is assumed to have a probability degree for it occurring in exemplars of that concept. However, the method therein for computing a membership degree of an object to a concept, based on matching the object’s properties with the uncertainly applicable properties of the concept, is in our view not justifiable. Also, the work does not address the problem of how inheritance is performed under the membership and applicability uncertainty. Recently, Blanco et al. (2001) and De Tré (2001) sketched general models to manage different sources of imprecision and uncertainty, including probabilistic ones, on various levels of an object-oriented database model. However, no foundation was laid to integrate probability theory, and fuzzy logic in case probability was used to represent uncertainty. In Cross (2003), the author reviewed existing proposals and presented recommendations for the application of fuzzy set theory in a flexible generalized object model. In this chapter, we summarize the main features of a logic-based probabilistic and fuzzy object-oriented model where a class property can contain fuzzy sets
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 115
interpreted as families of probability distributions, and uncertain class membership and property applicability are measured by lower and upper bounds on probability. On the basis of this model, we present the development of FRIL++, which extends FRIL (Baldwin et al., 1995) with object-oriented features, as a modeling and programming language for probabilistic and fuzzy object-oriented deductive databases and knowledge bases, in the same way as predicate logic programming languages (e.g., Datalog) have been used for classical deductive databases and knowledge bases. Various applications of FRIL++ are then demonstrated. The next section presents the logic-based probabilistic and fuzzy object-oriented model. In the following section, we introduce probabilistic default reasoning and its application to fuzzy events as a suitable approach to uncertain property inheritance and class recognition. We then present our solutions for uncertain inheritance of attributes, uncertain inheritance of methods, and uncertain recognition of classes. Subsequent sections present the implementation and the basic features of FRIL++. In the final two sections, we present our application of FRIL++ to machine learning, user modeling, and modeling with words. Finally, we conclude the chapter and suggest future research.
Probabilistic and Fuzzy Object-Oriented Model As in the classical object-oriented model, a class is represented by a finite set of properties. A property is either an attribute or a method. The model we are introducing is logic-based, and attributes and methods are represented by Hornlike clauses. In the classical object-oriented model, each object is certainly a member of a class, and all properties of a class certainly apply to its objects. However, in the real world, such membership and applicability can be uncertain. Moreover, attribute values can be more imprecise than ones expressible by intervals. Arguing for flexible modeling, Van Gyseghem and De Caluwe (1997) introduced the notion of fuzzy property as an intermediate between the two extreme notions of required property and optional property. Each fuzzy property of a class is associated with possibility degrees of applicability of the property to the class. Recently, Dubitzky et al. (1999) addressed the issue by contrasting the prototype concept model with the classical model. A severe defect of the classical concept model is noted by the fact that there is no commonly agreed set of defining (i.e., necessary and sufficient) properties for many natural, scientific, artificial, and
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
116 Rossiter & Cao
ontological concepts. Rather, each property of a concept is assumed to have a probability degree for it occurring in exemplars of that concept. Here, we propose uncertain class membership and property applicability to be represented by support pairs defining probability lower and upper bounds, as in FRIL (Baldwin et al., 1995), a logic programming language that handles both probability and fuzziness. Specifically, in this probabilistic and fuzzy objectoriented model, each attribute in a class C has the following form: ψ [l, u] where ψ is a fuzzy atom, that is, a predicate with argument values that can be fuzzy sets, and l, u ∈ [0, 1] are interpreted as l ≤ Pr(ψ | C) ≤ u. We assume that Pr(ψ | ¬C) is unknown. Similarly, each method has the following form: ψ ← φ [l1, u1] [l2, u2] where ψ is a fuzzy atom, φ is a conjunction of fuzzy atoms, and l1, u1, l2, u2 ∈ [0, 1] are interpreted as l1 ≤ Pr(ψ | φ, C) ≤ u1 and l2 ≤ Pr(ψ | ¬φ, C) ≤ u2. We also assume that Pr(ψ | φ, ¬C) and Pr(ψ | ¬φ, ¬C) are unknown. For a class hierarchy in a probabilistic and fuzzy object-oriented system, we assume that a class is totally subsumed by any of its superclasses, or, in other words, a class totally subsumes any of its subclasses. This assumption is discussed in detail in Cao (2001). The totally subsuming subclass relation imposes a constraint on membership degrees of an object to classes as stated in the following assumptions (Cao & Creasy, 2000): 1.
If an object is a member of a class with some positive characteristic degree, then it is a member of any superclass of that class with the same degree.
2.
If an object is a member of a class with some negative characteristic degree, then it is a member of any subclass of that class with the same degree.
As a consequence of this subsumption assumption, if an object is a member of a class with a support pair [l, u], then it is a member of any superclass of that class with the support pair [l, 1], and a member of any subclass of that class with the support pair [0, u]. This is in agreement with Rossazza et al. (1997), for instance,
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 117
who stated that the membership degree of an object to a class is at least equal to its membership degree to a subclass of that class. In fact, if C1 is a subclass of C2, then Pr(C1) ≤ Pr(C2).
Probabilistic Default Reasoning on Fuzzy Events A well-known fundamental problem in object-oriented modeling is one of multiple inheritance, that is, how to combine the same property inherited from different classes. For example, one can have property fly[.9, .95] in a class BIRD, expressing that 90% to 95% of birds can fly. At the same time, one can also have fly[0, .05] in a class PENGUIN, expressing that at most 5% of penguins can fly. Given PENGUIN being a subclass of B IRD, the problem is that a penguin has two support pairs for its property fly, namely, [.9, .95] from BIRD and [0, .05] from PENGUIN, which are inconsistent with each other as [.9, .95]∩[0, .05] = []. One may say that, in this case, [0, .05] overrides [.9, .95], but such a simple solution is not adequate. For instance, how would we deal with the case when an object is not certainly a penguin or such support pairs are from classes without being a subclass to one another? For the general case, there would be two following extreme solutions to the problem. The most pessimistic one is to assume that no given Pr(p | C) is applicable to a specific object, and Pr(p | C, E) where E is a set of evidences, has to be used instead. The most optimistic one is to assume that any Pr(p | C) remains valid when applied to a specific object, and thus multiple answers are combined by conjunction. The drawback of the most pessimistic solution is that, in general, we have no knowledge of Pr(p | C, E). For instance, to obtain it from Pr(p | C) using the total probability theorem Pr(p | C) = Pr(p | C, E) × Pr(E | C) + Pr(p | C, ¬ E) × Pr(¬E | C), one must know at least Pr(E | C). Meanwhile, the drawback of the most optimistic solution is that it often leads to inconsistency. Between these two extreme approaches there is one of default reasoning (Geffner & Pearl, 1992). The basic idea of default reasoning is to consider a set of rules as defeasible, so that when they are inconsistent with particular evidences, only selective consistent subsets of the set are used for inference. A selection of consistent subsets relies on a priority ordering among the default rules and a preference ordering, based on that priority ordering, among the subsets of the set. An early work of the default reasoning approach to inheritance and recognition is that of Shastri (1989), which is based on the principle of maximum entropy for resolving conflicting information. The work, however, has the shortcomings that Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
118 Rossiter & Cao
inheritance is performed only with certain membership of objects to classes, and recognition just selects a class that is considered as best matched with an object rather than provides different membership degrees of the object to different classes. Also, only class attributes, not class methods, are considered therein. Recently, Lukasiewicz (2000) extended classical default reasoning to probabilistic default reasoning and showed that the latter is intractable in the general case. The computational complexity is mainly due to checking consistency and performing global inference on a probabilistic knowledge base. So, in applying that framework to uncertain inheritance and recognition for probabilistic and fuzzy object-oriented systems, we propose an approximation using Jeffrey’s rule (Jeffrey, 1965) and its inverse for a weaker notion of consistency and for local inference, in order to reduce the computational complexity. A probabilistic default theory is defined to be a pair (T, D), where T is a set of formulas to be always satisfied, and D is a set of defaults. Each formula in T or D has the form (ψ | φ)[l, u], expressing l ≤ Pr(ψ | φ) ≤ u. When φ = true, we simply write ψ[l, u], and when l = u = 1, we may write (ψ | φ) only. The main characteristics of a default reasoning system are its priority ordering and preference ordering. A priority ordering, denoted by p, is an irreflexive and transitive binary relation on D. Given a model M, let DM = {d ∈ D | M satisfies d}. A model M is said to be preferred to a model M* iff (i.e., if and only if) D M* ≠ D M and ∀d* ∈ D M*\DM ∃d ∈ DM\D M*: d* p d. A model M is called a preferred model iff there is no model being preferred to M. A subset D* of D is said to be in conflict with a default (ψ | φ)[l, u] iff T ∪ (D* ∪ {(ψ | φ)[l, u]}) ∪ {φ} is inconsistent. A priority ordering is said to be admissible iff every subset D* ⊆ D in conflict with a default d ∈ D contains d* such that d* p d. A formula F is a default consequence of an evidence set E iff, for every admissible priority ordering, every preferred model of T ∪ E is a model of F. The admissibility of a priority ordering is to guarantee that if (ψ | φ)[l, u] ∈ D and E = {φ} (i.e., only φ is known), then (ψ | φ)[l, u] is a default consequence of E. As proven in Cao (2001), this definition of default consequence can be equivalently restated in terms of preferred default subsets instead of preferred models. For every D 1, D 2 ⊆ D, D1 is said to be preferred to D 2 iff T ∪ D 1 ∪ E is consistent, and 1., T ∪ D2 ∪ E is inconsistent, or 2., D1 ≠ D2 and ∀d ∈ D2\D1 ∃d* ∈ D1\D2: d p d*. For every D 1 ⊆ D, D 1 is called a preferred default subset of D iff there is no D 2 ⊆ D being preferred to D1; in particular, D 1 is preferred to D 2 if D2 ⊂ D 1 and T ∪ D1 ∪ E is consistent. Then a formula F is a default consequence of E iff, for every admissible priority ordering and every preferred default subset D* of D, F is a logical consequence of T ∪ D* ∪ E.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 119
The practical significance of this later definition is that one needs to consider only the preferred default subsets and deduction on them in order to obtain default consequences. Specifically, if P1, P2, ..., Pn are all the preferred default subsets of D and, for every i from 1 to n, Fi is a logical consequence of T ∪ Pi ∪ E, then F1 ∨ F2 ∨ ... ∨Fn is a default consequence of E. In particular, with F i ≡ ψ[l i, u i] for every i from 1 to n, one has ψ[l, u] is a default consequence of E where [l, u] = ∪ i=1,n[l i , u i], that is, l = min i=1,n{l i } and u = maxi=1,n{ui}. For fuzzy events characterized by fuzzy sets, in this work, we apply the voting model interpretation of fuzzy sets (Baldwin et al., 1995; Gaines, 1978), whereby, given a fuzzy set A on a domain U, each voter has a subset of U as his or her own crisp definition of the concept that A represents. The membership function value µA(u) is then the proportion of voters whose crisp definitions include u. As such, A defines a probability distribution on the power set of U across the voters, and thus a fuzzy proposition “x is A” defines a family of probability distributions of the variable x on U. Fuzzy events are said to be consistent with each other iff the intersection of their characterizing fuzzy sets is a normal fuzzy set (i.e., one with a maximal membership function value of 1). Baldwin et al. (1995, 1996) describe the conditioning operations over fuzzy sets and the tractable calculation of the expected fuzzy set used in this default reasoning framework.
Property Inheritance and Class Recognition In the classical object-oriented model, without exceptions, a class fully inherits all the properties of its superclasses, and thus, an object certainly has all properties of the classes to which it is a member. In the uncertain object-oriented model, due to uncertain applicability of a property and uncertain membership of an object to a class, inheritance naturally becomes uncertain. The problem of uncertain inheritance can be formalized in the framework of default reasoning as follows. For a particular attribute named ψ, suppose that there are n classes C1, C 2, ..., C n with attributes ψ(A1)[l1, u1], ψ(A2)[l2, u2], ..., ψ(An)[ln, un], respectively, where A1, A2, ..., An are fuzzy sets on the same domain. Then one has the default theory (T, D) where T = {(Ci | Cj) | C j is a subclass of Ci, 1 ≤ i, j ≤ n} and D = {(ψ(A i ) | C i )[l i , u i] | 1≤ i ≤ n}. Also, suppose an evidence set E = {C i [α i , β i] | 1 ≤ i ≤ n}∪{ψ(A0)[l0, u0]} where each [αi, βi] is a support for an object of discourse O being a member of Ci, while ψ(A0)[l0, u0] is a prior attribute given to O. The problem is to derive A such that ψ(A) being applicable to O is a default consequence of E. We assume that E is consistent with T, whereby, if Ci[αi, βi], Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
120 Rossiter & Cao
Cj[αj, βj] ∈ E and (Ci | Cj) ∈ T, then αj ≤ β i, in accordance with the constraint Pr(C j) ≤ Pr(Ci) mentioned previously. As presented in the preceding section, default reasoning with respect to a default theory comprises the following main steps: 1.
Determine admissible priority orderings on the set of defaults.
2.
For each admissible priority ordering, compute preferred default subsets.
3.
For each preferred default subset, derive a logical consequence.
As shown in Lukasiewicz (2000), all three steps are intractable in the probabilistic case. The computational complexity is mainly due to checking consistency and performing global inference on a probabilistic knowledge base. In applying that framework to uncertain inheritance for the uncertain object-oriented model, we propose an approximation for default consequences correspondingly as follows: 1.
Consider only one priority ordering based on the class specificity ordering.
2.
Use a weaker notion of consistency for computing preferred default subsets.
3.
Apply local inference using Jeffrey’s rule for deriving logical consequences. Details are explained below.
Let D be partitioned into D0, D1, ..., Dk such that, for every i and j from 1 to n, if Cj is a subclass of C i, (ψ(Ai) | Ci)[li, ui] ∈ D s and (ψ(Aj) | C j)[lj, uj] ∈ Dt, then s < t. Intuitively, D0 comprises the defaults for ψ of the classes that are not subclasses of any other; D1 comprises the defaults for ψ of the classes that are the immediate subclasses of those classes; and so on. The priority ordering p is then defined such that d p d* iff d ∈ Ds, d* ∈ D t, and s < t. For every i from 1 to n, Jeffrey’s rule gives: Pr(ψ(Ai)) = Pr(ψ(Ai) | Ci). Pr(Ci) + Pr(ψ(Ai) | ¬Ci). Pr(¬Ci) with li ≤ Pr(ψ(Ai) | C i) ≤ ui, αi ≤ Pr(Ci) ≤ βi, and Pr(¬C i) = 1 - Pr(Ci). On the assumption that only 0 ≤ Pr(ψ(Ai) | ¬Ci) ≤ 1 is known, one obtains: li.αi ≤ Pr(ψ(Ai)) ≤ ui.αi + (1 - αi) That is, O inherits ψ(Ai)[li.αi, ui.αi + (1-αi)] from each Ci, which can be transformed into ψ(Bi), where Bi is the expected fuzzy set of Ai[li.αi, ui.αi + (1 - αi)]. We note
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 121
that, in general, lower and upper bounds of Pr(ψ(Ai)) also depend on βi, but not in this case when Pr(ψ(Ai) | ¬Ci) is unknown. Let B0 be the expected fuzzy set of A0[l0, u0], and, for every i from 1 to n, Ai* = Bi ∩ B0. Our notion of weak consistency is now introduced as follows. Let D* be a subset of D. Without loss of generality, assume that D* = {(ψ(Ai) | Ci)[li, ui] | 1 ≤ i ≤ m ≤ n}. Then T ∪ D* ∪ E is said to be w-consistent wrt (i.e., with respect to) ψ iff ∩i=1,mAi* is a normal fuzzy set. For computing the preferred default subsets of D, instead of considering the subsets of D that are consistent with T and E, one now considers those that are w-consistent wrt ψ with T and E. As such, the preferred default subsets of D can be obtained in the two following steps: 1.
Find the largest (wrt ⊆) consistent subsets of {Ai* | 1 ≤ i ≤ n}, the intersection of the fuzzy sets in each of which is a normal fuzzy set.
2.
Compare those consistent subsets to select the ones that none of the others is preferred to, based on the priority ordering on D defined above.
The multiple-inherited attribute ψ(A) for O is then with A being the union of those intersection fuzzy sets obtained from the preferred default subsets. The reason for taking only the largest consistent subsets in Step 1 is that, as noted previously, a consistent set of defaults is always preferred to its proper subsets. For this step, we employ the algorithm in Dubois et al. (2000), which has the computational complexity O(n2), and shows that the maximal number of the consistent subsets is n. For the second step, as shown in Cao (2001), each comparison takes time proportional to the sizes of the two involved subsets, while the number of the comparisons is of the square order of the number of the consistent subsets. Because the maximal size and the maximal number of the consistent subsets are n, the computational complexity of this step is O(n3). Thus, the overall computational complexity of the above multiple inheritance procedure is O(n3). The proposal for uncertain inheritance of attributes presented above can be extended for uncertain inheritance of methods as follows. Let C1, C 2, ..., Cn be the classes that contain methods with heads that are the same y. For each i from 1 to n, let the set of those methods in C i be {ψ(Aiq) ← φ iq [l iq1, u iq1][l iq2, u iq2] | 1 ≤ q ≤ m i}, and denote ∪q=1,mi{(ψ(Aiq) | φiq, C i)[liq1, uiq1], (ψ(Aiq) | ¬φiq, C i)[liq2, uiq2]} by Si. We now consider each Si as an elementary default. Then one has the default theory (T, D), where T = {(C i | C j) | C j is a subclass of C i , 1 ≤ i, j ≤ n}, and D = {S i | 1 ≤ i ≤ n}. Also, suppose an evidence set E = {Ci[αi, βi] | 1 ≤ i ≤ n}∪S0, where S0 = ∪q=1,m0{(ψ(A0q) | φ0q)[l0q1, u0q1], (ψ(A0q) | ¬φ0q)[l0q2, u0q2]}. Here
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
122 Rossiter & Cao
each [αi, βi] is a support for an object of discourse O being a member of Ci, while S0 gives prior methods to O. For a priority ordering p, D is also partitioned into D 0, D1, ..., Dk in a similar way as in the case of uncertain inheritance of attributes. That is, for every i and j from 1 to n, if C j is a subclass of Ci, Si∈Ds, and Sj∈Dt, then s < t; S p S* iff S ∈ Ds, S* ∈ D t, and s < t. Suppose that ψ(A) ← φ [l1, u1] [l2, u2] is a method in class C and [α, β] is a support pair for an object of discourse O being a member of C. Jeffrey’s rule gives: Pr(ψ(A)) = Pr(ψ(A) | φ, C).Pr(φ, C) + Pr(ψ(A) | ¬φ, C).Pr(¬φ, C) + Pr(ψ(A) | φ, ¬C).Pr(φ, ¬C) + Pr(ψ(A) | ¬φ, ¬C).Pr(¬φ, ¬C) And, one obtains the lower bound x and the upper bound y for Pr(ψ(A)) as proved in Cao (2001) as follows: 1.
x = max{l2.α, l1.α - (l1 - l2).(1 - Pr(φ)min)} if l 2 ≤ l1, or x = max{l1.α, l2.α - (l2 - l1).Pr(φ)max} otherwise.
2.
y = 1 - max{(1 - u1).α, (1 - u2).α - (u1 - u2).Pr(φ)max)} if u2 ≤ u1, or y = 1 - max{(1 - u2).α, (1- u1).α - (u2 - u1).(1 - Pr(φ)min)} otherwise.
Then the combination of ψ(A) obtained from different methods in different classes can also be carried out as a multiple inheritance of an attribute, as presented previously. For the computation of Pr(φ)min and Pr(φ)max in the above expressions, suppose that φ is a conjunction of ϕ1, ϕ2, ..., ϕk. One has: Pr(φ)min = max{0, Pr(ϕ1) + Pr(ϕ2) + ... + Pr(ϕk) - (k - 1)} Pr(φ)max = min{Pr(ϕ1), Pr(ϕ2), ..., Pr(ϕk)} For every i from 1 to k, suppose that ϕi = ϕi(Ai) is in φ and ϕi(Bi) is the final multiple-inherited attribute for O with respect to ϕi, where Ai and Bi are fuzzy sets on the same domain. Then lower and upper bounds of Pr(ϕi), from which Pr(φ)min and Pr(φ)max can be evaluated, are the lower and upper bounds of the conditional probability Pr(Ai | Bi) as introduced previously. The uncertain recognition problem can be regarded as the inverse of the uncertain inheritance problem. It can be stated as follows: given an object having a set of properties associated with support pairs, derive support pairs for that object being members of the classes having that set of properties. Default
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 123
reasoning can also be applied to combine the derived support pairs to be consistent with the subclass relation between the classes. In this case, we consider only uncertain recognition based on attributes. Cao (2001) described uncertain class recognition within the proposed default reasoning framework.
Implementation of FRIL++ The probabilistic and fuzzy object-oriented model presented above provides a formal basis for the design and implementation of FRIL++ (Baldwin et al., 2000; Cao et al., 2002; Cao et al., 2001; Rossiter et al., 2000), the object-oriented extension of FRIL (Baldwin et al., 1995), a PROLOG-like logic programming language dealing with both probability and fuzziness. Like any other objectoriented system, a FRIL++ system is associated with a class hierarchy. Besides particular classes for the domain of the system, there is a special class, namely, FRIL++, which is common to all FRIL++ systems. The class FRIL++ is at the top of a class hierarchy, containing all FRIL++ built-in predicates, which can be inherited by all classes in a FRIL++ system. As in Moss (1994), objects are also treated as classes situating at the bottom of a FRIL++ class hierarchy, so that they can have their own properties, which may not be defined in any class. The reason for this is that in reality, a class can describe only a finite set of common properties of a group of objects, which may have other properties. Furthermore, in FRIL++, objects can be changed not only in the values of its properties, but also in its properties themselves, i.e., being added or deleted, as happens in the real world. In McCabe (1992), object-oriented logic programs were translated into normal logic programs of a logic programming system, such as Prolog, to be executed by the theorem prover of the system. In order to employ FRIL’s probabilistic and fuzzy theorem prover, we follow this approach in the implementation of FRIL++ by writing a compiler, using FRIL to translate a FRIL++ source program into a FRIL target program to be executed by FRIL. Following McCabe (1992), the execution of an object-oriented logic program is considered as having two phases, namely, the label phase and the body phase. In the label phase, the system determines the actual classes with definitions for the currently called property that are to be executed. Then, once those classes are determined, the system enters the body phase to execute the property as defined in the bodies of the classes. Corresponding to these label phase and body phase are label clauses and body clauses of a target program, which is a normal logic program, translated from an
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
124 Rossiter & Cao
object-oriented logic program. The label clauses perform inheritance, providing entry points to the definitions that are to be executed for a property call. Meanwhile, the body clauses are the translation of definitions of class properties. However, due to uncertain class membership and uncertain property applicability, there are important differences between the classical object-oriented model and the uncertain one, which is out of the scope of McCabe (1992): 1.
In the classical model, an object as an instance of a class inherits properties only from that class or its superclasses. Whereas, because in the uncertain model an object can be a partial member of a class, it can partially (i.e., with uncertainty degrees) inherit properties from any class.
2.
In the classical model, a property of a class is fully applicable to every object of the class. Whereas, in the uncertain model, that applicability can be uncertain, and, moreover, an associated uncertainty degree is not determinable at translation time if the membership degree of an object to a class can change at run time.
They make a difference between the translation of an uncertain object-oriented logic program and that of a classical one, for both of the label clauses and the body clauses. In the classical object-oriented model, with overriding inheritance, an object or a class does not inherit properties from its superclasses if they have their own definitions of those properties. In the uncertain case, from the point of view of default reasoning, the properties from the superclasses could still be inherited as long as they are inconsistent with those defined in the object or the class. On the one hand, for a FRIL++ program to behave in the same way as a classical object-oriented program when there is no uncertainty involved, we adopt overriding inheritance as a default. That is, a property (possibly associated with a support pair) in an object or a class is assumed to override properties of the same names in their superclasses. On the other hand, we provide FRIL++ with built-in predicates for combining multiple-inherited properties in a user-defined way, including the default reasoning one presented above. In the uncertain case, the uncertain membership of an object to a class raises a new issue regarding overriding inheritance. Specifically, if an object is not a full member of a class, then the question is whether a property that the object inherits from the class would override properties of the same name in superclasses of the class. In FRIL++, we assume that overriding inheritance is effective only with the full membership, i.e., with the support pair (1 1). As such, if the membership degree of an object to a class can change at run time, overriding inheritance is not determinable at translation time. Therefore, in order to gain execution efficiency at run time, we distinguish static objects from
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 125
dynamic objects, so that membership degrees of the former to classes cannot be changed after they are created. Thus, overriding inheritance can be determined at translation time.
Basic Features of FRIL++ FRIL’s syntax is Lisp-like, with list as the primary data and program structure. A FRIL atom, i.e., a predicate, has the following form: (predicate-name arg1 arg2 ... argN) where values of arg1, arg2, ... and argN can be fuzzy sets. The form of a FRIL clause is as follows: (h-atom b-atom1 b-atom2 ... b-atomN) : supp where h-atom is the head and b-atom1, b-atom2, ..., b-atomN are the body of the clause. Meanwhile, supp is either (l1 u1) or ((l1 u1) (l2 u2)) representing support pairs for the clause; the default values of (l1 u1) and (l2 u2) are (1 1) and (0 1), respectively. A FRIL++ program, which contains class definitions and logical clauses, has the same list format as a FRIL program. A FRIL++ class definition contains the following sections: superclass declaration, constant declaration, part declaration, and property definition. The superclass section declares the immediate superclasses of the class. The constant section declares the constant labels and their values associated with the class, which can be inherited or overridden as can class properties. The part section declares the identifiers and classes of the objects to be included as parts of an instance of the class. The property section defines the properties (i.e., attributes and methods) of the class, which are represented by logical clauses, as presented previously. Constants, parts, and properties can have either one of the visibility modifiers public, protected, or private as in C++ (Stroustrup, 1997), with public as the default. For an example, we use the following simple class hierarchy: PERSON TALLMAN TALLNOTSLIMMAN
TALLNOTFATMAN
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
126 Rossiter & Cao
The definitions of the classes are written in the following FRIL++ program: ((public class Person extends (Universal)) (constants (tall [0:0 1.5:0 1.8:1 2.5:1] ) (notSlim [0:1 16:1 22:0 28:1 45:1]) (notFat [0:1 22:1 28:0 45:0]) (properties ((height _ )) ((weight _ )) ((bodyMassIndex B) (height H) (times H H H2) (weight W) (times B H2 W)) ((Person H W) (setprop ((height H)) ) (setprop ((weight W)) )) )) ((public class TallMan extends (Person)) (properties ((handsome)) : (.9 1) ((isa TallMan) (height H) (match tall H)) )) ((public class TallNotSlimMan extends (TallMan)) (properties ((handsome)) : (0 .5) ((isa TallNotSlimMan) (isa TallMan) (bodyMassIndex B) (match notSlim B)) ))
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 127
((public class TallNotFatMan extends (TallMan)) (properties ((isa TallNotFatMan) (isa TallMan) (bodyMassIndex B) (match notFat B)) )) ((public class MainClass extends (Universal)) (properties ((main) (new John ((Person 1.75 70)) ) (qs ((John.handsome)) ) (new Bill ((Person 1.75 85)) ) (qs ((Bill.handsome)) )) )). In the class PERSON, the constant section declares the fuzzy sets that define the linguistic labels tall, notSlim, and notFat. Here, [0:0 1.5:0 1.8:1 2.5:1] represents the fuzzy set on [0, 2.5], with a membership function that takes value 0 on [0, 1.5], value 1 on [1.8, 2.5], and is linearly increasing on [1.5, 1.8]; [0:1 16:1 22:0 28:1 45:1] represents the fuzzy set on [0, 45] with a membership function that takes value 1 on [0, 16] and [28, 45], is linearly decreasing on [16, 22], and linearly increasing on [22, 28]; [0:1 22:1 28:0 45:0] represents the fuzzy set on [0, 45] with a membership function that takes value 1 on [0, 22], value 0 on [28, 45], and is linearly decreasing on [22, 28]. The property bodyMassIndex defines the body mass index of a person given his or her height and weight. The property person is a constructor for initializing properties of a new object of the class PERSON. The properties isa in the classes T ALL M AN , T ALL N O T S LIM M AN , and TALLNOT FAT MAN are methods for computing support pairs for an object that is a member of these classes. There, match is a FRIL++ built-in predicate that computes the conditional probability of its first argument given its second argument. We note that in FRIL++, isa properties are placed in respective classes just for better readability of a program. Logically, however, they belong to the universal class, as mentioned previously, to which every object has full membership (1 1). The property ((handsome)) : (.9 1) in the class TALLMAN expresses that “At least 90% of tall men are handsome.” Meanwhile, ((handsome)) : (0 .5) in the class TALLNOT SLIMMAN expresses that “At most 50% of men who are tall and not slim are not handsome.”
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
128 Rossiter & Cao
The property main in the class MAINC LASS provides the entry point for executing a FRIL++ program. In this example, John is created as a person of height 1.75 and weight 70, and a support pair for him being a handsome man is computed by the FRIL++ built-in support query qs. Similarly, Bill is created as a person of height 1.75 and weight 85, and a support pair for him being a handsome man is computed. As such, John is a member of TALLMAN and TALLNOTS LIMMAN with the support pairs [.833, 1] and [.119, 1], respectively, and thus inherits handsome[.75, 1] from TALLMAN and handsome[0, .941] from TALLNOT SLIMMAN. So the support pair for John being handsome is [.75, 1] ∩ [0, .941] = [.75, .941]. Meanwhile, Bill is a member of TALL MAN and TALLNOTSLIMMAN with the support pairs [.833, 1] and [.799, 1], respectively, and thus inherits handsome[.75, 1] from TALLMAN and handsome[0, .601] from TALLN OTS LIMMAN. In this case, because [.75, 1] ∩ [0, .601] = [] and ((handsome)) : (0 .5) in TALLNOT SLIMMAN is assumed to have a higher priority than ((handsome)) : (.9 1) in TALLMAN, the support pair for Bill being handsome is [0, .601], using default reasoning.
FRIL++ for Machine Learning Machine learning has become an important area of artificial intelligence, which allows computers to acquire knowledge automatically or semiautomatically, i.e., to learn from experience, in order to do right things for a particular task. Fuzzy set theory and fuzzy logic have been applied in this area, using soft partitions defined by fuzzy sets on attribute domains, enhancing the acquired knowledge transparency and the performance of existing machine-learning algorithms that use crisp partitions (e.g., Baldwin et al., 1998). Briefly explaining, the better transparency is due to the use of linguistic labels for partitions, while the better performance is due to the tolerance of soft partitions in learning processes. However, as shown by theoretical results, there is no best learning algorithm for all tasks. That was the motivation of Kohavi et al. (1996) when developing MLC++ to help choose appropriate algorithms for a particular task, by comparing different ones and creating new algorithms, and especially by combining existing ones. It exploits the advantages of the object-oriented methodology, which are information encapsulation and hiding organized in class hierarchies, using C++ to build a library of different components of a machine-learning system. In particular, learning algorithms are categorized into classes to be compared or combined with each other. Inspired by that work, we used FRIL++ to develop a similar system for fuzzy machine learning.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 129
In Mitchell (1997), machine learning is described as a process of constructing computer programs from training experience. Using “knowledge” as an umbrella term, we view such computer programs and experience as knowledge represented in different forms. A machine learner is then a kind of knowledge processor, namely, inducer, that induces knowledge in a high-level form, such as in rule bases from knowledge in a low-level form such as relational data tables. A tester of machine learning is another kind of knowledge processor that operates on the knowledge induced by a machine learner to evaluate its learning performance. Therefore, a machine-learning process can be viewed as involving objects of three classes — KNOWLEDGE BASE , INDUCER , and TESTER — with their main properties as illustrated in Figure 1. The class KNOWLEDGEB ASE can be divided into subclasses as depicted in Figure 2. Meanwhile, INDUCER can be placed in the hierarchy of knowledge processor classes in Figure 3, with different subclasses of fuzzy logic-based, Bayesian network-based, neural network-based, and support vector machine-based machine learning algorithms. This object-oriented view allows us to incrementally develop a toolkit for machine learning in particular and knowledge processing in general.
Figure 1. Three main classes of objects in machine learning INDUCER Induction Parameters Induction Method
KNOWLEDGEBASE Content Input/Output Edit Query
TESTER Testing Parameters Testing Method
Figure 2. Hierarchy of knowledge bases KNOWLEDGEBASE
RELATIONALKB DATATABLE
RULEBASE
DEDUCTIVEKB DECISIONTREE
GRAPHKB CONCEPTUALGRAPH
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
130 Rossiter & Cao
Figure 3. Hierarchy of knowledge processors KNOWLEDGEPROCESSOR INDUCER FLBASED
DATABROWSER
BNBASED
DEDUCER NNBASED
ABDUCER SVMBASED
FUZZYID3
We used FRIL++ to implement a particular class of fuzzy machine-learning techniques called data browser (Baldwin & Martin, 1995), as shown in Figure 3, which learns fuzzy rules from relational data tables. The four main classes of the data browser are DATATABLE, R ULEB ASE, D ATABROWSER, and TESTER. Objects of the class D ATATABLE are relational data tables used as training or testing data sets, while those of the class RULEBASE are sets of fuzzy rules learned from relational data. The class DATAB ROWSER implements the fuzzy machine-learning technique that computes frequency distributions of the given values of input attributes with respect to an output attribute in a training data table, and then converts those frequency distributions into fuzzy sets (Baldwin et al., 1995) for the antecedents of the corresponding fuzzy rule. Meanwhile, the class TESTER implements a procedure for testing a learned fuzzy rule base against a testing data table. The implemented data browser was demonstrated on two well-known benchmark problems in machine learning — the ellipse and the face problems. The ellipse problem is to learn the points inside an ellipse and those points outside, based on their two-dimensional coordinates. The data browser approach is to learn that by producing two fuzzy rules for the inside and outside points, respectively, in the following forms: ellipse_point is inside ← (x_coordinate is A) ∧ (y_coordinate is B) ellipse_point is outside ← (x_coordinate is A) ∧ (y_coordinate is B) where A and B are fuzzy sets on the respective partitions of the x and y coordinates. In this example, there are 121 training instances and 127 testing instances, and the domain [-1.5 1.5] of the x and y coordinates is partitioned into 10 equal triangle fuzzy sets with an overlapping degree of 0.5. The obtained accuracy is 96.85%.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 131
The face problem is to learn which faces are male and which faces are female, based on measurement of 18 attributes of human faces. In this example, there are 138 training instances and 30 testing instances. The attribute domains are partitioned into 20 equal triangle fuzzy sets with an overlapping degree of 0.5. The obtained accuracy is 83.33%. The following FRIL++ codes show the structures and main properties of the above-mentioned classes of the data browser, and the main class for running the ellipse and face examples: ((public class DataTable extends (RelationalKB)) (public (parts /* A data table is associated with an attribute schema of class AttributeSchema */ (schema AttributeSchema) )) (private (properties /* The content of a data table is a list of instances, each of which corresponds to a row in the table. Each instance is an object of class Instance */ ((instance _rowIndex _instObj)) )) (public (properties /* The number of rows of a data table */ ((num_row _naturalNumber)) /* To get an instance of a data table */ ((get_instance INSTANCE) …. ) /* To display a data table */ ((display) …. ) /* Constructor – constructs a data table from a data file */ ((DataTable DATA_FILE) …. ) ))) ((public class RuleBase extends (DeductiveKB)) (public (parts /* A rule base is associated with an attribute schema of class AttributeSchema */
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
132 Rossiter & Cao
(schema AttributeSchema) )) (private (properties /* The content of a rule base is a list of rules. Each rule is an object of class Rule */ ((rule _index _ruleObj)) )) (public (properties /* The number of rules in a rule base */ ((num_rule _naturalNumber)) /* To get a rule in a rule base */ ((get_rule RULE) …. ) /* To display a rule base */ ((display) …. ) ))) ((public class DataBrowser extends (FlBased)) (public (properties /* To induce a rule base from a data table, given one output (or categorizing) attribute and a list of input attributes */ ((induce DATA_TABLE (OUT_ATTR | IN_ATTR_LIST) RULE_BASE) …. ) ))) ((public class Tester extends (Universal)) (public (properties /* To test a rule base on a data table with respect to a given output attribute */ ((test RULE_BASE DATA_TABLE OUT_ATTR) …. ) ))) ((public class MainClass extends (Universal)) (public (properties ((main) /* To create a data browser */
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 133
(new myDataBrowser ((DataBrowser)) ) /* To create a tester */ (new myTester ((Tester)) ) /* To run the ellipse example */ (ellipse_exe) /* To run the face example */ (face_exe) ) ((ellipse_exe) /* To create a training data table */ (new ellipseTrainTable ((DataTable “ellipse_train”)) ) /* To display the created training data table */ (ellipseTrainTable.display) /* To specify output and input attributes */ (eq _outAttr ellipse_point) (eq _inAttrList (x_coordinate y_coordinate) ) /* To generate fuzzy set labels and partition input attribute domains */ (forall ((List.member X _inAttrList)) ((gensym “ELABEL” S) (ellipseTrainTable.(schema).get_attribute X A) (A.partition triangle S 10 0.5)) ) /* To induce fuzzy rules */ (myDataBrowser.induce ellipseTrainTable ( _outAttr | _inAttrList) _ellipseRuleBase) /* To display the induced rule base */ ( _ellipseRuleBase.display) /* To create a testing data table */ (new ellipseTestTable ((DataTable “ellipse_test”)) ) /* To display the created testing data table */ (ellipseTestTable.display) /* To test the induced rule base */ (myTester.test _ellipseRuleBase ellipseTestTable _outAttr)) ((face_exe)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
134 Rossiter & Cao
(new faceTrainTable ((DataTable “face_train”)) ) (faceTrainTable.display) (eq _outAttr class) (eq _inAttrList (attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute8 attribute9 attribute10 attribute11 attribute12 attribute13 attribute14 attribute15 attribute16 attribute17 attribute18) ) (forall ((List.member X _inAttrList)) ((gensym “FLABEL” S) (faceTrainTable.(schema).get_attribute X A) (A.partition triangle S 20 0.5)) ) /* Learning */ (myDataBrowser.induce faceTrainTable ( _outAttr | _inAttrList) _faceRuleBase) ( _faceRuleBase.display) /* Testing */ (new faceTestTable ((DataTable “face_test”)) ) (myTester.test _faceRuleBase faceTestTable _outAttr) ))).
FRIL++ for User Modeling In recent years, user modeling has become a major topic of academic and commercial research. This focus has been driven by a combination of two factors: first, the construction of huge databases of information about our daily lives; and second, by the desire of organizations to use these data to understand the people they deal with and, hence, to improve their services. In this section, we present a new approach to incremental user recognition in fuzzy environments, where user classification is updated within an objectoriented epistemological model. First we examine and generalize the FILUM approach to flexible user modeling (Martin, 2000). We then extend Einhorn and Hogarth’s anchor and adjustment method (Hogarth & Einhorn, 1992), derived from a study of human behavior, from the point value representation of belief and
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 135
evidence to the case where belief and evidence are imprecise, expressed by subintervals of [0, 1].
User Recognition Problem There are two main questions to ask when modeling users: 1.
How do we generate the appropriate user models?
2.
How do we classify a user into appropriate models?
The problem of user recognition centers on the temporal aspect of user behavior. We have some set of known user types {U1,…,Un}, the behaviors of which we know and to which we provide a corresponding set of services. An unknown user u at time t behaves in the fashion bt, where behavior is commonly the outcome of some crisp or fuzzy choice, such as whether or not to buy expensive wine. We wish to determine the similarity of u to each {U1,…,Un} in order to provide the appropriate service to u at time t. We must repeat this process as t increases. In an object-oriented environment, we construct a hierarchy of n user classes, {C 1,…, Cn}, and we try to determine the support St(u ∈ Cm) for user u belonging to user class C m at time t. This support is some function f of the current behavior bt and the history of behaviors {b1,…, bt-1}. This is shown more generally in Equation 1. St (u ∈ Cm) = f ({b1, ..., bt})
(1)
We can solve this problem at time t if we have the whole behavior series up to t. Unfortunately, at time t + 1, we will have to do the whole calculation again. Where t is very large, the storage of the whole behavior series and the cost of the support calculation may be too expensive. An alternative approach is to view the support St(u ∈ Cm) as some belief in the statement “user u belongs to class Cm”; this belief is updated whenever a new behavior is encountered. This belief updating approach is more economical in space, because the whole behavior series no longer needs to be stored. In computation, this approach is more efficient, because we now must calculate some function g of just the previous St-1(u ∈ Cm) and the latest behavior bt. This belief updating approach is shown more generally in Equation 2. St (u ∈ Cm) = g (St-1(u ∈ C m), bt)
(2)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
136 Rossiter & Cao
In this section, we examine the case where belief is represented by a support pair, which is a subinterval of [0, 1].
Simple User Recognition Example Let us take the example where we classify food consumers into one of the classes CANDY EATER, COOKIE EATER, or CAKEEATER. We may wish to represent these consumer classes in a FRIL++ class hierarchy, as shown in Figure 4. In the same way, an hierarchy can be constructed for the food these consumers eat, as shown in Figure 5. The consumer classes also define the prototypical behaviors of these consumers through the following statements:
• • •
a candy-eater eats lots of candy most of the time a cookie-eater eats lots of cookies most of the time a cake-eater eats lots of cake most of the time
In a simple representation, we could use a conditional probability interval to represent the “most” qualifier. For example, if we find that the statement “eats Figure 4. A consumer class hierarchy PERSON CONSUMER CANDYEATER
COOKIEEATER
CAKEEATER
Figure 5. A food class hierarchy FOOD
SWEETFOOD
CANDY
COOKIE
CAKE
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 137
lots of candy most of the time” is true for eight or more cases out of every 10 candy-eaters, then we can assign an interval [0.8, 1] to the conditional probability Pr(eats | lots of candy). This approach gives us the following FRIL++ class definition for the class CandyEater: ((public class CandyEater extends (Consumer) ) (public (properties ((eats X) (X.isa Candy) (X.quantity lots )) : (0.8 1) ))) Now consider a new food consumer u who makes a decision whether or not to eat food x and we wish to determine u’s membership to the classes C ANDYEATER, COOKIE EATER, and C AKEEATER. The only information we have is the decision that u made with respect to eating food x. We can determine memberships by comparing u’s decision with the decision that would be made by a prototypical member of each of the classes CANDYE ATER, C OOKIEE ATER, and CAKEEATER, given food x. Food x may be an uncertain member of any or all of the classes CANDY , COOKIE, and C AKE. For example, if x is a sweet iced biscuit, then x is clearly a member of the class COOKIE but may also have nonzero membership to the class CANDY . The remainder of this section is concerned with the case where we wish to update u’s membership to the classes CANDYEATER, COOKIEEATER, and CAKEEATER, as u chooses whether or not to eat each item of food in the ordered stream x 1,…, x n .
Belief Updating for User Recognition When a new behavior is encountered, it is interpreted as some evidence for or against the statement “user u belongs to class Cm.” When updating beliefs in response to new evidence, we can evaluate the evidence in two ways. Either we take the evidence to be absolute and update our beliefs to a degree defined entirely by the new evidence, or we can take the evidence in the context of our current beliefs and update our beliefs relatively. In this section, we will examine the FILUM updating method, which is an absolute belief updating model, and Einhorn’s and Hogarth’s anchor and adjustment belief revision, which is relative.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
138 Rossiter & Cao
Generalized FILUM User Recognition The FILUM flexible incremental learning approach (Martin, 2000) relies on a moving average to calculate current support Sn+1 for a hypothesis given the previous support Sn and new evidence xn+1 as shown in Equation 3. Note that s(xn+1) is the support for the given hypothesis provided by evidence xn+1:
S n+1 =
nS n + s(xn+1 ) n +1
(3)
This approach is notable in its inflexibility with regard to the weight of impact of new evidence. That is, new evidence always has a weight 1/(n + 1), and current belief has a weight n/(n + 1). A more flexible generalization that can be used to give a higher or lower weighting to new evidence is shown in Equation 4:
nl S n + n1-l s (xn+1 ) S n+1 = nl + n1-l
(4)
Where λ would typically lie in the interval [0, 1]. If λ = 1 we have Equation 3, where current belief is n times as important as new evidence. If λ = 0, we have an expression that weights new evidence n times as important as current belief. This flexibility may be important in cases where we know that users change their behavior often and must therefore be reclassified quickly. The advantage of the FILUM approach is its simplicity. It also updates support where evidence is presented as either a support pair or a point value. Disadvantages include the inflexibility of the model and the large primacy bias.
Anchor and Adjustment Belief Revision If we are to classify human users, it would seem prudent to look at how humans might perform this classification task. Hogarth and Einhorn have done much work on models of belief updating that bear some relation to human behavior (Einhorn & Hogarth, 1985; Hogarth & Einhorn, 1992). They suggested that the strength of current belief can have a major effect on how new evidence updates that belief. For example, the stronger the belief a person has in the trustworthiness of a friend, the greater the reduction in this belief when the friend commits an act of dishonesty. The typical pattern of behavior is shown in Figure 6. Here Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 139
Figure 6. Order effects in anchor and adjustment Belief e+ ee+ et
the negative evidence e- has two differing effects depending on how large the belief was before e- was presented. Likewise, there are two differing effects from the same positive evidence e+. The anchor and adjustment belief revision model by Hogarth and Einhorn (1992) updates a belief given new evidence through two processes. Equation 5a shows how belief Sk is updated given new negative evidence. Equation 5b shows how the same belief Sk is updated given new positive evidence. Sk = Sk-1 + αSk-1 (s(xk) - R)
for s(xk) ≤ R
(5a)
Sk = Sk-1 + β (1 - Sk-1) (s(xk) - R) for s(xk) > R
(5b)
R is a reference point for determining if the support s(xk) for evidence xk is positive or negative, and typically R = 0 or R = Sk-1. The constants α and β define
Figure 7. Order effects in interval anchor and adjustment Belief e+ e
-
e+ et
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
140 Rossiter & Cao
how sensitive the model is to negative or positive evidence, respectively, where 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1.
Anchor and Adjustment with Interval Supports Because belief and support in our uncertain environment can be presented as a support pair, we must consider the implications of an interval representation on the anchor and adjustment model. For a piece of evidence e with the associated support pair [l, u], we can view l as the positive evidence associated with e and 1-u as the negative evidence associated with e. The general principle is that, given a current belief [n, p] and a piece of evidence with support [l, u], belief increases by an amount proportional to 1-p and belief decreases by an amount proportional to n. We can apply Equations 5a and 5b to the support pair to yield Equations 6a to 6d, where S- and S+ are the lower bound and the upper bound of belief, respectively: S-k = S-k-1+ αS-k-1(s- (xk) - R-) S k = S k-1+ β(1 - S -
-
+
+
+
+
S k= S S k= S
+
-
k-1 +
-
)(s (xk) - R )
k-1
+ αS k-1(s (xk) - R )
k-1
+ β(1 - S
-
+
+
+
k-1
for s- (xk) ≤ R-
(6a)
for s (xk) > R
-
(6b)
for s (xk) ≤ R
+
(6c)
for s (xk) > R
+
(6d)
-
+
+
)(s (xk) - R )
+
Note that R- is a reference point for determining if the lower bound of the presented evidence is positive or negative with respect to the lower bound of belief, and R+ is the corresponding reference point for the upper bound of belief. Here, we choose R- = S-k-1 and R+ = S+k-1, where 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1. Figure 7 shows the order effects of this interval belief updating model. The precise effects of negative evidence e- and positive evidence e+ are determined by α and β, respectively. The effect of new evidence is dependent on the most recent belief only, and not on t. This is a known characteristic of the anchor and adjustment model. This recency behavior contrasts with the primacy bias of the FILUM approach. This new interval version of Hogarth’s and Einhorn’s belief updating model has a number of advantages over the FILUM method. Recency characteristics allow the anchor and adjustment model to reclassify users quickly. The order effects of this model are related to human behavior, and this seems to be an important consideration when we are recognizing human users. In addition, this method allows us to control the effects of positive and negative evidence separately. This last feature may be especially important in medical user modeling applications,
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 141
where false-negative classifications have far more serious consequences than false-positive classifications.
Iterated Prisoner’s Dilemma (IPD) Problem in FRIL++ The n-player iterated prisoner’s dilemma problem (Axelrod, 1985) is a good test bed for user recognition due to the production of streams of behavior that are a result of user interactions in pairs. The problem is most easily understood by looking at the noniterated problem with n = 2. Two prisoners are due to be sentenced. They each have the choice to cooperate together or to defect. If the players both cooperate, they will both serve three years. If they both defect, they will both serve one year. If they choose to behave differently, then the defecting player will serve zero years but the cooperating player will serve five years. The iterated problem simply continues the game after each round. A wide range of strategies are possible, ranging from trusting behavior (always cooperate) to defective behavior (always defect) and including more complex strategies such as conditional cooperation (cooperating unless the opponent’s last m behaviors were defect). The n-player prisoner’s dilemma is a difficult problem from which to identify user classes, because a single player p interacts with unknown and randomly selected partners. As a result, the behavior stream generated by p is not determined exclusively by the class of p. If we were to construct a class hierarchy of prisoners in FRIL++, it could resemble Figure 8. The subclasses of prisoner are the classes that define prototypical prisoners and their behaviors. The goal of user recognition in this problem is to determine the class of an unknown prisoner from the unknown prisoner’s past and current behaviors. The behaviors of these prototypical prisoners are described in Table 1. An example of a FRIL++ class definition for a prototypical prisoner is given in Rossiter et al. (2001a). A population of 10 prisoners is created, and a game of 75 rounds is initiated. Each round involves picking pairs of prisoners at random from the population until none is left, and for each pair recording the behaviors the prisoners exhibit (defect,
Figure 8. A class hierarchy for the prisoner’s dilemma problem PERSON PRISONER COOPERATIVE
UNCOOPERATIVE
TITFORTAT
RANDOM
RESPD
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
142 Rossiter & Cao
Table 1. Behavior classes Behavior
Description
Cooperative
Always cooperate with opponent
Uncooperative
Always defect against opponent
Tit-for-tat
Cooperate unless last opponent defected
Random
Equal random chance of defect or cooperate
Respd
Defect unless the last six opponents chose to cooperate
cooperate, etc). From the past history of each player, and using the techniques described earlier (with α = β = 0.3), they are classified into the five behavior classes. The winning class is taken as the class in which minimum membership (i.e., the lower bound of the membership interval) is greatest. If the winning class matches the actual class in Table 2, then the classification is recorded as a success. To recreate the situation where user behavior changes, after 60 rounds, the behaviors of all 10 prisoners are changed, as shown in the third column of
Table 2. The prisoner population Individual
Behavior before 60th round
Behavior after 60th round
1
Random
Cooperative
2
Random
Uncooperative
3
Cooperative
Tit-for-tat
4
Cooperative
Respd
5
Uncooperative
Random
6
Uncooperative
Respd
7
Tit-for-tat
Cooperative
8
Tit-for-tat
Random
9
Respd
Tit-for-tat
10
Respd
Uncooperative
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 143
Table 3. Classification results Before 60th round
After 60th round
Interval anchor and FILUM adjustment
Interval anchor and FILUM adjustment
63.6%
57.3%
63.3%
22.2%
Table 2. After this point, the game is continued for 15 rounds. We compare classification results using the interval anchor and adjustment belief updating method with the FILUM method described in Martin (2000). The whole process is repeated five times, and the mean of the results is taken. As can be seen from Table 3, classification results before the 60th round (the point of behavior change) are similar between the two methods. After the 60th round, however, there is a marked difference in the results, with a large fall in the performance of the FILUM approach. These results show the primacy effects present in the FILUM method and the recency effects characteristic of the interval anchor and adjustment approach. We highlight these effects as important points to consider when implementing user recognition in any specific user modeling application. Results from the iterated prisoner’s dilemma test bed suggest that the recency bias of the anchor and adjustment approach is more suitable to the problem of object-oriented user modeling, where the behaviors of users change over time. Future work in this area will consider the cases where user behavior is represented by fuzzy sets. For example, a user buys a large number of inexpensive items. More investigation is also needed in determining ranges for the values of R- and R+ in the interval anchor and adjustment approach.
FRIL++ for Modeling with Words In this section, we discuss how uncertain object-oriented logic programming may be used for the implementation of object-oriented modeling with words (Rossiter et al., 2001b). We consider modeling with words to be an extension of computing with words. Where computing with words can be thought of as performing operations using linguistic labels, we interpret modeling with words to mean the generation of models using linguistic labels. In modeling with words, the modeling
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
144 Rossiter & Cao
process as well as the final model can be based upon a calculus of linguistic labels. The goal of modeling with words is the generation of linguistic models from a combination of data and background information. Commonly, the background information can be elicited from domain-specific experts in the form of linguistic rules. An important feature of modeling with words is that the models generated must in some way be insightful. By “insightful,” we mean that some useful information can be gained by examining the model without having to apply the model to any classification or prediction problem. Modeling with words uses simple linguistic variables and sentences to build models that can be interpreted by all, including those with no technical training. This is in contrast with many conventional machine-learning paradigms, where insight into the learned model is restricted by the representation, which is typically numeric (e.g., x = 0.98), comparative (e.g., n < p), or algebraic (e.g., z = an + bn2). These representations are comprehensible to those experts trained to understand them, but are frequently incomprehensible to nonexperts. We might say that numeric, comparative, and algebraic representations result in “black box” models, which require some degree of technical skill to interpret. With linguistic models, on the other hand, some of the blackness of the model is cleared and the goal is to produce “glass box,” or transparent, models. A typical approach to modeling with words involves modeling individual words as information granules, as proposed by Zadeh (1996). The modeling of granular information can also be modulated by studies into computing with perceptions (Zadeh, 1999). The resulting granules correspond to a vocabulary of words that can then be used for modeling with words. Unfortunately, the restrictions of granular computation and computing with perceptions result in a vocabulary that is also restricted. As a result of this restricted vocabulary, the models generated are not, in fact, perfectly transparent. Rather, the models are “grey,” “murky,” or “foggy” in nature. Clearly, this is less than ideal and may result in some reduction in model comprehension. Even so, we can say that a murky insight into a model is better than no insight. In other words, we will accept the restriction imposed by this restricted vocabulary in order to at least gain some insight into the linguistic model, and hence, the problem domain. The restricted vocabulary described above enables us to create simple linguistic sentences such as “the tree is tall.” In the real world, however, humans find it natural to classify real-world concepts into taxonomical hierarchies, or at least into a set of related ontological specifications. We therefore propose extending modeling with words with taxonomical (and, hence, ontological) information. Consider the simple hierarchy of trees in Figure 9. In our extended modeling with words framework, we can now create slightly more sophisticated linguistic
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 145
Figure 9. A tree hierarchy TREE DECIDUOUS
EVERGREEN
sentences such as “the tree is a tall evergreen,” where “evergreen” is an ontological specification that is more specific than that described by the class “trees.” The suggested extension of modeling with words with taxonomical information implies that our modeling environment contains taxonomical concepts, or more specifically, class hierarchies. When considering taxonomical hierarchies, it is common to think of a class definition as a theory. Because we can introduce uncertainty into a theory in the form of linguistic terms, a class definition can be thought of as a linguistic construct as well as a taxonomical construct. To this end, we propose an object-oriented framework for modeling with words using linguistic descriptors and taxonomical hierarchies. This object-oriented approach to modeling with words enables rich models to be generated, while at the same time promotes the compactness and efficiency of the resulting models.
Object-Oriented Modeling with Words An object-oriented approach to modeling with words has the following features.
Clear Representation An hierarchical representation of classes reflects our natural taxonomic view of the real world. Take, for example, the scientific classification of all living organisms. The top-most superclass is called ORGANISM ; the next level in the hierarchy defines the domains EUKARYA, EUBACTERIA, and A RCHAEA; the next level defines the kingdoms (e.g., ANIMALIA); and the next defines phylus (e.g., VERTEBRATE); and so on until we reach the species MAN. We apply class hierarchies to all parts of our lives, even when we do not have specific scientific knowledge such as in the previous example. For example, we may classify trees into LARGETREE and SMALLTREE. We may then split LARGETREE into QUITELARGETREE and VERY LARGE TREE. The important thing to see here is that the linguistic terms commonly used in computing with words (large, small, very large, etc.) may also be integral to class descriptions.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
146 Rossiter & Cao
Scalability of Knowledge Representation Modeling with words has been successful in many small-scale toy problems. The question now arises: how can modeling with words be scaled to larger, real-world problems? In object-oriented modeling with words, we naturally have a measure of scale, namely, our perspective of the hierarchy. If we build a model that has hundreds of classes in our hierarchy, we can focus on the appropriate level of the hierarchy for the appropriate linguistic description of the model for which we are looking. Summarizing the model can be done at as many levels as there are levels in the hierarchy. A complex summary involves knowledge from lower down the hierarchy, while a more general summary involves knowledge from the top of the hierarchy. Figure 10 illustrates the perspective projection of classes from the top and bottom of the complex hierarchy on the left onto the simple hierarchy of trees on the right. In this example, we summarized the complex relationships between the classes of TREE , SPRUCE , and O AK.
Power of Inheritance, Overriding, Encapsulation, and Information Hiding From the knowledge representation point of view, inheritance helps reduce unnecessary repetition in the class hierarchy. This aids model conciseness. Overriding, on the other hand, enables us to form a richer hierarchical model, and in extreme cases, to implement forms of nonmonotonic reasoning. Take for example the hierarchy of birds. We might say that all birds can fly. Yet if we define a subclass of bird called penguin, we find that all penguins cannot fly. A nonmonotonic contradiction exists, because penguins inherit the ability to fly from the bird superclass, and yet penguins cannot fly. Here we need the concept
Figure 10. Perspective and scalability TREE TREE perspective projection SPRUCE SPRUCE
OAK
OAK
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 147
of overriding to mitigate the contradiction. The problem of nonmonotonic inference is discussed in more detail in preceding sections. Encapsulation (the grouping of methods and attributes within a class) and information hiding (restricting access to properties within a class) are features that can make the final model more robust when used in anger. These are programming aids that are useful in modeling with words, where the models are produced to solve real-world problems.
Uncertain Classes and Objects Any object-oriented system for modeling with words needs to be able to represent concepts using words. The system needs to model the uncertainty that is inherent in the way humans use words. To this end, a class consists of a set of properties, each of which can involve some degree of uncertainty. Properties can be methods (they do things) or attributes (they represent facts). Attributes can be defined by fuzzy sets representing words, probability values, or any other established uncertainty representation. Methods, on the other hand, may call upon uncertain attributes and may thus define uncertain actions. Given a vocabulary of words, a suitable calculus based on these words, and the uncertain object-oriented techniques described in this chapter, it is clear that we can implement object-oriented modeling with words in FRIL++. We propose FRIL++ as a useful tool for object-oriented modeling with words. This approach seeks to combine modeling with words with uncertain class hierarchies to give a richer and more powerful mechanism for the representation of high-level expert knowledge and the induction of insightful models from data.
Conclusions We introduced a logic-based probabilistic and fuzzy object-oriented model in which each class property is represented by a fuzzy rule weighted by probability lower and upper bounds. We then proposed probabilistic default reasoning on fuzzy events as a suitable approach to uncertain property inheritance and class recognition problems. The intractable steps of general probabilistic default reasoning are reduced to polynomial time ones, using Jeffrey’s rule and its inverse for a weaker notion of consistency and for local inference. On the formal basis of this model, we designed and implemented FRIL++ as the object-oriented extension of FRIL, a logic programming language dealing with both probability and fuzziness. We presented the basic features of FRIL++ with
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
148 Rossiter & Cao
an example, and showed the important differences between the translation of a probabilistic and fuzzy object-oriented logic program and that of a classical one, due to uncertain class membership and property applicability. FRIL++ can thus be used as a modeling and programming language for probabilistic and fuzzy object-oriented deductive databases and knowledge bases, in the same way as predicate logic programming languages have been used for classical deductive databases and knowledge bases. In particular, we presented the application of FRIL++ to machine learning, user modeling, and modeling with words. For machine learning, FRIL++ has been used to build a library of classes of fuzzy machine learning algorithms so that they can be compared or combined with each other, as there is no best learning algorithm for all tasks. For user modeling, prototypical user classes can be modeled in FRIL++ classes, that have properties that can be inherited by a user with uncertainty degrees depending on the user’s membership to those classes. For modeling with words, we propose FRIL++ as a good language for objectoriented development and implementation. On the other hand, we are also revising FRIL++, optimizing the compiler and adding more utilities to the language, in order to make it a powerful tool for modeling and constructing intelligent systems.
References Axelrod, R. (1985). The evolution of cooperation. New York: Basic Books. Baldwin, J. F., & Martin, T. P. (1995). Refining knowledge from uncertain relations — a fuzzy data browser based on fuzzy object-oriented programming in FRIL. In Proceedings of the Fourth IEEE International Conference on Fuzzy Systems (pp. 27–34). Baldwin, J. F., Lawry, J., & Martin, T. P. (1996). A note on probability/possibility consistency for fuzzy events. In Proceedings of the 6th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, 521-526. Baldwin, J. F., Lawry, J., & Martin, T. P. (1996). Efficient algorithms for semantic unification. In Proceedings of the Sixth International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 527–532). Baldwin, J. F., Lawry, J., & Martin, T. P. (1998). The application of generalised fuzzy rules to machine learning and automated knowledge discovery. International Journal of Uncertainty Fuzziness and Knowledge-Based Systems, 6, 459–487. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 149
Baldwin, J. F., Martin, T. P., & Pilsworth, B. W. (1995). FRIL — Fuzzy and evidential reasoning in artificial intelligence. Hertfordshire, United Kingdom: Research Studies Press. Baldwin, J. F., Cao, T. H., Martin, T. P., & Rossiter, J. M. (2000). Towards soft computing object-oriented logic programming. In Proceedings of the Ninth IEEE International Conference on Fuzzy Systems (pp. 768–773). Blanco, I., Marín, N., Pons, O., & Vila, M. A. (2001). Softening the objectoriented database model: Imprecision, uncertainty & fuzzy types. In Proceedings of the First International Joint Conference of the International Fuzzy Systems Association and the North American Fuzzy Information Processing Society (pp. 2323–2328). Bordogna, G., Pasi, G., & Lucarella, D. (1999). A fuzzy object-oriented data model managing vague and uncertain information. International Journal of Intelligent Systems, 14, 623–651. Cao, T. H. (2001). Uncertain inheritance and recognition as probabilistic default reasoning. International Journal of Intelligent Systems, 16, 781–803. Cao, T. H., & Creasy, P. N. (2000). Fuzzy types: A framework for handling uncertainty about types of objects. International Journal of Approximate Reasoning, 25, 217–253. Cao, T. H., Rossiter, J. M., Martin, T. P., & Baldwin, J. F. (2002). On the implementation of FRIL++ for object-oriented logic programming with uncertainty and fuzziness. In B. Bouchon-Meunier et al. (Eds.), Technologies for constructing intelligent systems, studies in fuzziness and soft computing (Vol. 90, pp. 393–406). Heidelberg: Physica-Verlag. Cao, T. H., Rossiter, J. M., Martin, T. P., & Baldwin, J. F. (2001). Inheritance and recognition in uncertain and fuzzy object-oriented models. In Proceedings of the First International Joint Conference of the International Fuzzy Systems Association and the North American Fuzzy Information Processing Society (pp. 2317–2322). Cross, V. V. (2003). Defining fuzzy relationships in object models: Abstraction and interpretation. International Journal of Fuzzy Sets and Systems, 140, 5–27. De Tré, G. (2001). An algebra for querying a constraint defined fuzzy and uncertain object-oriented database model. In Proceedings of the First International Joint Conference of the International Fuzzy Systems Association and the North American Fuzzy Information Processing Society (pp. 2138–2143). Dubitzky, W., Büchner, A. G., Hughes, J. G., & Bell, D. A. (1999). Towards concept-oriented databases. Data & Knowledge Engineering, 30, 23– 55. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
150 Rossiter & Cao
Dubois, D., Fargier, H., & Prade, H. (2000). Multiple-sources information fusion — a practical inconsistency-tolerant approach. In Proceedings of the Eighth International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 1047– 1054). Einhorn, H. J., & Hogarth, R. M. (1985). Ambiguity and uncertainty in probabilistic inference. Psychological Review, 93, 433–461. Eiter, T., Lu, J. J., Lukasiewicz, T., & Subrahmanian, V. S. (2001). Probabilistic object bases. ACM Transactions on Database Systems, 26, 264–312. Gaines, B. R. (1978). Fuzzy and probability uncertainty logics. Journal of Information and Control, 38, 154–169. Geffner, H., & Pearl, J. (1992). Conditional entailment: Bridging two approaches to default reasoning. Artificial Intelligence, 53, 209–244. George, R., Buckles, B. P., & Petry, F. E. (1993). Modelling class hierarchies in the fuzzy object-oriented data model. International Journal for Fuzzy Sets and Systems, 60, 259–272. Hogarth, R. M., & Einhorn, H. J. (1992). Order effects in belief updating: The belief-adjustment model. Cognitive Psychology, 24, 1–55. Itzkovich, I., & Hawkes, L. W. (1994). Fuzzy extension of inheritance hierarchies. International Journal for Fuzzy Sets and Systems, 62, 143–153. Jeffrey, R. (1965). The logic of decision. New York: McGraw-Hill. Kohavi, R., Sommerfield, D., & Dougherty, J. (1996). Data mining using MLC++: A machine learning library in C++. In Tools with Artificial Intelligence (pp. 234–245). Washington: IEEE Computer Society Press. Lukasiewicz, T. (2000). Probabilistic default reasoning with conditional constraints. Proceedings of the Eighth International Workshop on NonMonotonic Reasoning, Special Session on Uncertainty Frameworks in Non-Monotonic Reasoning. Martin, T. P. (2000). Incremental learning of user models — an experimental testbed. In Proceedings of the Eighth International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 1419–1426). McCabe, F. G. (1992). Logic and objects. New York: Prentice Hall. Mitchell, T. M. (1997). Machine learning. New York: McGraw-Hill. Moss, C. (1994). Prolog++: The power of object-oriented and logic programming. Reading, MA: Addison-Wesley. Rossazza, J. -P., Dubois, D., & Prade, H. (1997). A hierarchical model of fuzzy classes. In R. De Caluwe, Fuzzy and uncertain object-oriented databases: Concepts and models (pp. 21–61). Singapore: World Scientific. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
FRIL++ and Its Applications 151
Rossiter, J. M., Cao, T. H., Martin, T. P., & Baldwin, J. F. (2000). A FRIL++ compiler for soft computing object-oriented logic programming. In Proceedings of the Sixth International Conference on Soft Computing (pp. 340–345). Rossiter, J. M., Cao, T. H., Martin, T. P., & Baldwin, J. F. (2001a). User recognition in uncertain object-oriented user modelling. In Proceedings of the 10th IEEE International Conference on Fuzzy Systems. Rossiter, J. M., Cao, T. H., Martin, T. P., & Baldwin, J. F. (2001b). Objectoriented modelling with words. In Proceedings of the 10 th IEEE International Conference on Fuzzy Systems, Workshop on Modelling with Words. Shastri, L. (1989). Default reasoning in semantic networks: A formalization of recognition and inheritance. Artificial Intelligence, 39, 283–355. Stroustrup, B. (1997). The C++ programming language (3rd ed.). Reading, MA: Addison-Wesley. Van Gyseghem, N., & De Caluwe, R. (1997). The UFO database model: Dealing with imperfect information. In R. De Caluwe (Ed.), Fuzzy and uncertain object-oriented databases: Concepts and models (pp. 123–185). Singapore: World Scientific. Yazici, A., & George, R. (1999). Fuzzy database modelling. Studies in fuzziness and soft computing (Vol. 26). Heidelberg: Physica-Verlag. Zadeh, L. A. (1996). Fuzzy logic = computing with words. IEEE Transactions on Fuzzy Systems, 4, 103–111. Zadeh, L. A. (1999). From computing with numbers to computing with words — from manipulation of measurements to manipulation of perceptions. IEEE Transactions on Circuits and Systems, 45, 105–119.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
152 Rossiter & Cao
SECTION II
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 153
Chapter V
Fuzzy Information Modeling with the UML Zongmin Ma Université de Sherbrooke, Canada
Abstract Computer applications in nontraditional areas have put requirements on conceptual data modeling. Some conceptual data models, being the tool of design databases, were proposed. However, information in real-world applications is often vague or ambiguous. Currently, less research has been done in modeling imprecision and uncertainty in conceptual data models. The UML (Unified Modeling Language) is a set of object-oriented modeling notations and is a standard of the Object Data Management Group (ODMG). It can be applied in many areas of software engineering and knowledge engineering. Increasingly, the UML is being applied to data modeling. In this chapter, different levels of fuzziness are introduced into the class of the UML and the corresponding graphical representations are given. The class diagrams of the UML can hereby model fuzzy information. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
154 Ma
Introduction One of the major areas of research in databases has been the continuous effort to enrich existing database models with a more extensive collection of semantic concepts. Databases have gone through the development from hierarchical and network databases to relational databases. As computer technology moves into nontraditional applications such as CAD/CAM, knowledge-based systems, multimedia, and Internet systems, many feel the limitations of relational databases in these data-intensive application systems. Therefore, some nontraditional data models for databases, such as the entity-relationship (ER) data model (Chen, 1976), the object-oriented data model, and the logic data model, being the tool of modeling databases, have been proposed. One of the semantic needs not adequately addressed by traditional models is that of uncertainty. Traditional models assume the database model to be a correct reflection of the world being captured and assume that the data stored is known, accurate, and complete. It is rarely the case in real life that all or most of these assumptions are met. Different models have been proposed to handle different categories of data quality (or lack thereof). Five basic kinds of imperfection have been identified: inconsistency, imprecision, vagueness, uncertainty, and ambiguity (Bosc & Prade, 1993). Inconsistency is a kind of semantic conflict when some aspect of the real world is irreconcilably represented more than once in a database or in several different databases. Inconsistency has traditionally been applied to data. In the context of multidatabases, where multiple sources are integrated, attention was given to inconsistency at the modeling level. Imprecision and vagueness are two closely related qualities. They both relate to the context in which the value attributed to an attribute (or the interpretation assigned to a concept) is known to come from a given interval (or set of values) but we do not know exactly which one to choose at present. In general, vague information is represented by linguistic values. Uncertainty refers to those situations in which we can apportion some, but not all, of our belief to the fact that an attribute took a given value or a group of values. The random uncertainty, described using probability theory, is not considered in this chapter. Finally, ambiguity means that some elements of the model lack complete semantics, leading to several possible interpretations. Generally, several different kinds of imperfection coexist with respect to the same piece of information. A large number of models have been proposed to handle uncertainty and vagueness. Most of these models are based on the same paradigms. Vagueness and uncertainty are generally modeled with fuzzy sets and possibility theory (Zadeh, 1965, 1978). Many of the existing approaches dealing with imprecision and uncertainty are based on the theory of fuzzy sets. Fuzzy information has been extensively investigated in the context of the relational model (Buckles & Petry,
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 155
1982; Ma, Zhang, & Ma, 1999; Prade & Testemale, 1984; Raju & Majumdar, 1988). Recent efforts have extended these results to object-oriented databases by introducing the related notions of classes, generalization/specialization, and inheritance (Bordogna, Pasi, & Lucarella, 1999; Cross, Caluwe, & Vangyseghem, 1997; Cross & Firat, 2000; Dubois, Prade, & Rossazza, 1991; George, Srikanth, Petry, & Buckles, 1996; Gyseghem & Caluwe, 1998; Lee et al., 1999; Ma, Zhang, & Ma, 2004; Marín, Vila, & Pons, 2000; Marín et al., 2003). However, most of this research is focusing on modeling uncertainty at the data level; fewer results exist when it comes to uncertainty at the conceptual model level. It is especially true for modeling uncertain information in object-oriented data models. The UML (Booch, Rumbaugh, & Jacobson, 1998; OMG, 2001) is a set of objectoriented modeling notations that was standardized by the ODMG. The power of the UML can be applied to many areas of software engineering and knowledge engineering (Mili, Shen, et al., 2001). The complete development of relational and object relational databases from business requirements can be described by the UML. The database has traditionally been described by notations called entityrelationship (ER) diagrams, using graphic representation that is similar but not identical to that of the UML. Using the UML for database design has many advantages over the traditional ER notations (Naiburg, 2000). The UML is based largely upon the ER notations and includes the ability to capture all information that is captured in a traditional data model. The additional compartment in the UML for methods or operations allows you to capture items like triggers, indexes, and the various types of constraints directly as part of the diagram. By modeling this, rather than using tagged values to store the information, it is now visible on the modeling surface, making it more easily communicated to everyone involved. So, increasingly, the UML is being applied to data modeling (Ambler, 2000a, 2000b; Blaha & Premerlani, 1999; Naiburg, 2000). More recently, the UML was used to model XML conceptually (Conrad, Scheffiner, & Freytag, 2000). Note that while the UML reflects some of the best object-oriented modeling experiences available, it suffers from a lack of some necessary semantics. One thing lacking can be generalized as the need to handle imprecise and uncertain information. To our knowledge, the issues on fuzzy UML data model have not been addressed in the literature, although imprecise and uncertain information exists in knowledge engineering and database systems and have extensively been studied. In this chapter, different levels of fuzziness will be introduced into the class in the UML, and the corresponding graphical representations are given. The class diagrams of the UML can hereby model fuzzy information. The contribution of this chapter is that an object-oriented conceptual modeling methodology is fully developed for fuzzy information modeling.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
156 Ma
The remainder of this chapter is organized as follows. The second section gives basic knowledge concerning fuzzy set and possibility distribution theories as well as knowledge of the UML class model. The fuzzy extension to class model in the UML is presented in the third section. The fourth section discusses related work, and the last section concludes this chapter.
Basic Knowledge Fuzzy Set and Possibility Distribution The concept of fuzzy sets was originally introduced by Zadeh (1965). Let U be a universe of discourse. A fuzzy value on U can be characterised by a fuzzy set F in U. A membership function µF: U → [0,1] is defined for the fuzzy set F, where µF (u), for each u ∈ U, denotes the degree of membership of u in the fuzzy set F. Thus, the fuzzy set F is described as follows: F = {µ (u1)/u1, µ (u2)/u2, ..., µ (un)/un} where the pair µ (ui)/ui represents the value ui and its membership degree µ (ui). The membership function µF (u) can be interpreted as a measure of the possibility that the value of variable X is u. A fuzzy set is equivalently represented by its associated possibility distribution πX (Zadeh, 1978): πX = {πX (u1)/u1, πX (u2)/u2, ..., πX (un)/un} Here, πX (ui), ui ∈ U, denotes the possibility that ui is true. Let πX and F be the possibility distribution representation and the fuzzy set representation for a fuzzy value, respectively. It is apparent that πX = F is true (Raju & Majumdar, 1988).
UML Class Model UML provides a collection of models to capture the many aspects of a software system. From the database modeling point of view, the most relevant model is the class model. The building blocks in this class model are those of classes and relationships. We briefly review these building blocks.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 157
Figure 1. The class icon Class name Attributes Operations
Classes Being the descriptor for a set of objects with similar structure, behavior, and relationships, a class represents a concept within the system being modeled. Classes have data structure and behavior and relationships to other elements. A class is drawn as a solid-outline rectangle with three compartments separated by horizontal lines. The top name compartment holds the class name and other general properties of the class (including stereotype); the middle list compartment holds a list of attributes; the bottom list compartment holds a list of operations. Either or both of the attribute and operation compartments may be suppressed. A separator line is not drawn for a missing compartment. If a compartment is suppressed, no inference can be drawn about the presence or absence of elements in it. Figure 1 shows a class.
Relationships Another main structural component in the class diagram of the UML is relationships for the representation of relationship between classes or class instances. UML supports a variety of relationships: 1.
Aggregation and composition: An aggregation captures a whole–part relationship between an aggregate, a class that represent the whole, and a constituent part. An open diamond is used to denote an aggregate relationship. Here the class touched with the white diamond is the aggregate class, denoting the “whole.” Figure 2 shows an aggregation relationship.
Figure 2. Simple aggregation relationship Car
Engine
Interior
Chassis
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
158 Ma
Figure 3. Simple generalization relationship Vehicle
Car
Truck
Aggregation is a special case of composition where constituent parts are directly dependent on the whole part, and they cannot exist independently. Composition mainly applies to attribute composition. A composition relationship is represented by a black diamond. 2.
Generalization: Generalization is used to define a relationship between classes to build taxonomy of classes: one class is a more general description of a set of other classes. The generalization relationship is depicted by a triangular arrowhead. This arrowhead points to the superclass. One or more lines proceed from the superclass of the arrowhead, connecting it to the subclasses. Figure 3 shows a generalization relationship.
3.
Association: Associations are relationships that describe connections among class instances. An association is a more general relationship than aggregation or generalization. A role may be assigned to each class taking part in an association, making the association a directed link. An association relationship is expressed by a line with an arrowhead drawn between the participating classes. Figure 4 shows an association relationship.
4.
Dependency: A dependency indicates a semantic relationship between two classes. It relates the classes and does not require a set of instances for its meaning. It indicates a situation in which a change to the target class may require a change to the source class in the dependency. A dependency is shown as a dashed arrow between two classes. The class at the tail of the arrow depends on the class at the arrowhead. Figure 5 shows a dependency relationship.
Figure 4. Simple association relationship installing CD Player
Car
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 159
Figure 5. Simple dependency relationship Dependent
Employee
UML Modeling of Fuzzy Data In this section, we extend the UML class diagrams to model fuzzy data. Because the constructs of the UML contain class and relationships, the extension to these constructs should be conducted based on fuzzy sets.
Fuzzy Class Objects with the same properties are gathered into classes that are organized into hierarchies. Theoretically, a class can be considered from two different viewpoints: 1.
An extensional class, where the class is defined by the list of its object instances
2.
An intensional class, where the class is defined by a set of attributes and the admissible values of the attributes
Therefore, a class is fuzzy because of the following several reasons. First, some objects are fuzzy ones, which have similar properties. A class defined by these objects may be fuzzy. These objects belong to the class with membership degree of [0, 1]. Second, when a class is intensionally defined, the domain of an attribute may be fuzzy, and a fuzzy class is formed. Third, the subclass produced by a fuzzy class by means of specialization and the superclass produced by some classes (in which there is at least one class that is fuzzy) by means of generalization are also fuzzy. Following on the footsteps of Zvieli and Chen (1986), we define three levels of fuzziness. In the context of classes, the three levels of fuzziness are defined as follows: 1.
Fuzziness in the extent to which the class belongs in the data model as well as fuzziness on the content (in terms of attributes) of the class
2.
Fuzziness related to whether some instances are instances of a class; even though the structure of a class is crisp, it is possible that an instance of the class belongs to the class with degree of membership
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
160 Ma
3.
The third level of fuzziness is on attribute values of the instances of the class; an attribute in a class defines a value domain, and when this domain is a fuzzy subset or a set of fuzzy subset, the fuzziness of an attribute value appears
In order to model the first level of fuzziness, i.e., an attribute or a class with degree of membership, the attribute or class name should be followed by a pair of words WITH mem DEGREE, where 0 ≤ mem ≤ 1 and it is used to indicate the degree to which the attribute belongs to the class or the class belongs to the data model (Gyseghem & Caluwe, 1998; Marín, Vila, & Pons, 2000). For example, “Employee WITH 0.6 DEGREE” and “Office Number WITH 0.8DEGREE” are class and attribute with the first level of fuzziness, respectively. Generally, an attribute or a class will not be declared when its degree is 0. In addition, “WITH 1.0 DEGREE” can be omitted when the degree of an attribute or a class is 1. It should be noted that attribute values might be fuzzy. In order to model the third level of fuzziness, a keyword FUZZY is introduced and is placed in front of the attribute. In the second level of fuzziness, we must indicate the degree of membership to which an instance of the class belongs to the class. For this purpose, an additional attribute is introduced into the class to represent instance membership degree to the class, with an attribute domain that is [0, 1]. We denote such special attribute with µ. In order to differentiate the class with the second level of fuzziness, we use a dashed-outline rectangle to denote such class. Figure 6 shows a fuzzy class Ph.D. student. Here, attribute Age may take fuzzy values, namely, its domain is fuzzy. Ph.D. students may or may not have their offices. It is not known for sure if class Ph.D. student has attribute Office. But we know Ph.D. students may have their offices with high possibility, say 0.8. So attribute Office uncertainly belongs to the class Ph.D. students. This class has the fuzziness at the first level and we use “with 0.8 membership degree” to describe the fuzziness in the class definition. In addition, we may not determine if an object is the instance of the class because the class is fuzzy. So an additional attribute µ is introduced into the class for this purpose.
Figure 6. A fuzzy class Ph.D. student ID Name FUZZY Age Office WITH 0.8 DEGREE µ
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 161
Fuzzy Generalization The concept of subclassing is one of the basic building blocks of the object model. A new class, called subclass, is produced from another class, called superclass, by means of inheriting some attributes and methods of the superclass, overriding some attributes and methods of the superclass, and defining some new attributes and methods. Because a subclass is the specialization of the superclass, any one object belonging to the subclass must belong to the superclass. This characteristic can be used to determine if two classes have a subclass-superclass relationship. However, classes may be fuzzy. A class produced from a fuzzy class must be fuzzy. If the former is still called subclass and the later superclass, the subclasssuperclass relationship is fuzzy. In other words, a class is a subclass of another class with membership degree of [0, 1] at this moment. Correspondingly, we have the following method for determining a subclass-superclass relationship: 1.
For any (fuzzy) object, if the membership degree that it belongs to the subclass is less than or equal to the membership degree, then it belongs to the superclass.
2.
The membership degree that it belongs to the subclass is greater than or equal to the given threshold.
The subclass is then a subclass of the superclass with the membership degree, which is the minimum in the membership degree to which these objects belong to the subclass. Formally, let A and B be (fuzzy) classes and β be a given threshold. We say B is a subclass of A if (∀ e) (β ≤ µB (e) ≤ µA (e)) The membership degree that B is a subclass of A should be minµB (e) ≥ β (µB (e)). Here, e is the object instance of A and B in the universe of discourse, and µA (e) and µB (e) are membership degrees of e to A and B, respectively. It should be noted that, however, in the above-mentioned fuzzy generalization relationship, we assume that classes A and B can only have the second level of fuzziness. It is possible that classes A and B are the classes with membership degree, namely, with the first level of fuzziness. Assume that we have two classes A and B as follows:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
162 Ma
A WITH degree_A DEGREE B WITH degree_B DEGREE Then B is a subclass of A if (∀ e) (β ≤ µ B (e) ≤ µA (e)) ∧ ((β ≤ degree_B ≤ degree_A) That means that B is a subclass of A only if, in addition to the requirement that the membership degrees of all objects to A and B must be greater than or equal to the given threshold, and the membership degree of any object to A must be greater than or equal to the membership degree of this object to B, the membership degrees of A and B must be greater than or equal to the given threshold, and the membership degree of A must be greater than or equal to the membership degree of B. Consider a fuzzy superclass A and its fuzzy subclasses B1, B2, …, Bn with instance membership degrees µA, µB1, µB2, ..., and µBn, respectively, which may have the degrees of membership degree_A, degree_B1, degree_B2, …, and degree_Bn, respectively. Then the following relationship is true: (∀e) (max (µB1 (e), µB2 (e), …, µBn (e)) ≤ µA (e)) ∧ (max (degree_B1, degree_B2, …, degree_Bn) ≤ degree_A) It can be seen that we can assess fuzzy subclass-superclass relationships by utilizing the inclusion degree of objects to the class. Clearly such assessment is based on the extensional viewpoint of class. When classes are defined with the intensional viewpoint, there is no object available. Therefore, the method given above cannot be used. At this point, we can use the inclusion degree of a class with respect to another class to determine the relationships between fuzzy subclass and superclass. The notion of inclusion degree was originally developed in Ma, Zhang, and Ma (1999) for assessment of data redundancy in fuzzy relational databases. In Ma, Zhang, and Ma (2004), the inclusion degree is extended to evaluate the membership degree of an object to a class and further the relationships between fuzzy subclass and superclass. Formally, let A and B be (fuzzy) classes and the degree that B is the subclass of A be denoted by µ (A, B). For a given threshold β, we say B is a subclass of A if µ (A, B) ≥ β
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 163
The membership degree that B is a subclass of A is clearly µ (A, B). Now let us consider the situation in which classes A or B are the classes with membership degree, namely, with the first level of fuzziness. Assume that we have two classes A and B as follows: A WITH degree_A DEGREE B WITH degree_B DEGREE Then B is a subclass of A if (µ (A, B) ≥ β) ∧ ((β ≤ degree_B ≤ degree_A) This means that B is a subclass of A only if, in addition to the requirement that the inclusion degree of A with respect to B must be greater than or equal to the given threshold, the membership degrees of A and B must be greater than or equal to the given threshold, and the membership degree of A must be greater than or equal to the membership degree of B. The inclusion degree of a (fuzzy) subclass with respect to the (fuzzy) superclass can be calculated according to the inclusion degree of the attribute domains of the subclass with respect to the attribute domains of the superclass as well as the weight of attributes. The methods for evaluating the inclusion degree of fuzzy attribute domains and further evaluating the inclusion degree of a subclass with respect to the superclass were developed in Ma, Zhang, and Ma (2004). It should be noted that in this work (Ma, Zhang, & Ma, 2004), the relationship between subclass and superclass with the first level of fuzziness was not discussed. In subclass–superclass hierarchies, a critical issue is multiple inheritance of class. Ambiguity arises when more than one of the superclasses have common attributes, and the subclass does not declare explicitly the class from which the attribute was inherited. At this moment, the conflicting attribute in the superclasses is inherited by the subclass dependent on their weights to the corresponding superclasses (Liu & Song, 2001; Ma, Zhang, & Ma, 2004). It should also be noted that in a fuzzy multiple inheritance hierarchy, the subclass has different degrees with respect to different superclasses, which is not the same situation as in classical object-oriented database systems. In order to represent a fuzzy generalization relation, a dashed peculiar triangular arrowhead is applied. Figure 7 shows a fuzzy generalization relationship. Classes Young Student and Young Faculty are all classes with the second level of fuzziness. These classes may have some instances (objects) that belong to the
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
164 Ma
Figure 7. A fuzzy generalization relationship Youth
Young Student
Young Faculty
classes with membership degree. These two classes can be generalized into class Youth, a class with the second level of fuzziness.
Fuzzy Aggregation An aggregation captures a whole-part relationship between an aggregate and a constituent part. These constituent parts can exist independently. Therefore, every instance of an aggregate can be projected into a set of instances of constituent parts. Let A be an aggregation of constituent parts B1, B2, …, and Bn. For e ∈ A, the projection of e to Bi is denoted by e↓Bi. Then we have (e↓B1) ∈ B1, (e↓B2) ∈ B2, …, (e↓Bn) ∈ Bn. A class aggregated from fuzzy constituent parts must be fuzzy. If the former is still called aggregate, the aggregation is fuzzy. At this point, a class is an aggregation of constituent parts with membership degree of [0, 1]. Correspondingly, we have the following method for determining a fuzzy aggregation relationship: 1.
For any (fuzzy) object, if the membership degree to which it belongs to the aggregate is less than or equal to the membership degree to which its projection to each constituent part belongs to the corresponding constituent part.
2.
The membership degree to which it belongs to the aggregate is greater than or equal to the given threshold.
The aggregate is then an aggregation of the constituent parts with the membership degree, which is the minimum in the membership degrees to which the projections of these objects to these constituent parts belong to the corresponding constituent parts. Let A be a fuzzy aggregation of fuzzy class sets B1, B2, …, and Bn, with instance membership degrees that are µA, µB1, µB2, ..., and µBn, respectively. Let β be a given threshold. Then,
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 165
(∀ e) (e ∈ A ∧ β ≤ µA (e) ≤ min (µB1 (e↓B1), µB2 (e↓B2), ..., µBn (e↓Bn))) That means that a fuzzy class A is the aggregate of a group of fuzzy classes B1, B2, …, and Bn if for any (fuzzy) instance object, if the membership degree to which it belongs to class A is less than or equal to the membership degree to which its projection to B1, B2, …, and Bn, say Bi (1≤ i ≤ n), belongs to class Bi. Besides, for any (fuzzy) instance object, the membership degree to which it belongs to class A is greater than or equal to the given threshold. The membership degree that A is an aggregation of class sets B1, B2, …, and Bn should be minµBi (e↓Bi) ≥ β (µBi (e↓Bi)) (1≤ i ≤ n). Here, e is object instance of A. Now let us consider the first level of fuzziness in the above-mentioned classes A, B1, B2, …, and Bn, namely, they are the fuzzy classes with membership degrees. Let A WITH degree_A DEGREE, B1 WITH degree_B1 DEGREE, B2 WITH degree_B2 DEGREE, …… Bn WITH degree_Bn DEGREE. Then A is an aggregate of B1, B2, …, and Bn if (∀ e) (e ∈ A ∧ β ≤ µA (e) ≤ min (µB1 (e↓B1), µB2 (e↓B2), ..., µBn (e↓Bn)) ∧ degree_A ≤ min (degree_B1, degree_B2, …, degree_Bn)). Here β is a given threshold. It should be noted that the assessment of fuzzy aggregation relationships given above is based on the extensional viewpoint of class. Clearly these methods cannot be used if the classes are defined with the intensional viewpoint, because there is no object available. In the following, we present how to determine a fuzzy aggregation relationship using the inclusion degree. Let A be a fuzzy aggregation of fuzzy class sets B1, B2, …, and Bn, and b be a given threshold. Also let the projection of A to Bi be denoted by A↓Bi. Then, min (µ (B1, A↓B1), µ (B2, A↓B2), ..., µ (Bn, A↓Bn)) ≥ β
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
166 Ma
Here µ (Bi, A↓Bi) (1 ≤ i ≤ n) is the degree to which Bi semantically includes A↓Bi. The membership degree to which A is an aggregation of B1, B2, …, and Bn is min (µ (B1, A↓B1), µ (B2, A↓B2), ..., µ (Bn, A↓Bn)). Furthermore, the expression above can be extended for the situation in which A, B1, B2, …, and Bn may have the first level of fuzziness, namely, they may be the fuzzy classes with membership degrees. Let β be a given threshold and A WITH degree_A DEGREE B1 WITH degree_B1 DEGREE B2 WITH degree_B2 DEGREE …… Bn WITH degree_Bn DEGREE Then A is an aggregate of B1, B2, …, and Bn if min (µ (B1, A↓B1), µ (B2, A↓B2), ..., µ (Bn, A↓Bn)) ≥ β ∧ degree_A ≤ min (degree_B1, degree_B2, …, degree_Bn)) A dashed open diamond is used to denote a fuzzy aggregate relationship. A fuzzy aggregation relationship is shown in Figure 8. A car is aggregated by engine, interior, and chassis. In Figure 8, the engine is old, and we have a fuzzy class Old Engine with the second level of fuzziness. Class Old Car aggregated by classes interior and chassis and fuzzy class old engine is a fuzzy one with the second level of fuzziness.
Fuzzy Association Two levels of fuzziness can be identified in the association relationship. The first level of fuzziness means that an association relationship fuzzily exists in two associated classes, namely, this association relationship occurs with a degree of Figure 8. A fuzzy aggregation relationship Old Car
Old Engine
Interior
Chassis
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 167
Figure 9. Fuzzy association relationships installing WITH 0.8 DEGREE CD Player
Car
(a) installing CD Player
Car
(b)
installing WITH 0.8 DEGREE CD Player
Car
(c)
possibility. Also, it is possible that it is unknown for certain if two class instances respectively belonging to the associated classes have the given association relationship, although this association relationship must occur in these two classes. This is the second level of fuzziness in the association relationship and is caused because an instance belongs to a given class with membership degree. It is possible that the two levels of fuzziness mentioned above may occur in an association relationship simultaneously. That means that two classes have a fuzzy association relationship at a class level on one hand. On the other hand, the class instances of these two classes may have a fuzzy association relationship at the class instance level. We can place a pair of words WITH mem DEGREE (0 ≤ mem ≤ 1) after the role name of an association relationship to represent the first level of fuzziness in the association relationship. We use a double line with an arrowhead to denote the second level of fuzziness in the association relationship. Figure 9 shows two levels of fuzziness in fuzzy association relationships. In part (a), it is uncertain if the CD player is installed in the car, and the possibility is 0.8. Classes CD Player and Car have the association relationship installing with an 0.8 membership degree. In part (b), it is certain that the CD player is installed in the car, and the possibility is 1.0. Classes CD Player and Car have an association relationship installing with 1.0 membership degree. But at the level of instances, there exists the possibility that the instances of classes CD Player and Car may or may not have the association relationship installing. In part (c), two kinds of fuzzy association relationships in parts (a) and (b) arise simultaneously.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
168 Ma
It has been shown above that three levels of fuzziness can occur in classes. The classes with the second level of fuzziness generally result in the second level of fuzziness in the association, if this association definitely exists (that means there is no first level of fuzziness in the association). Let A and B be two classes with the second level of fuzziness. Then, the instance e of A is one with membership degrees µA (e), and the instance f of B is one with membership degrees µB (f). Assume that the association relationship between A and B, denoted ass (A, B), is one without the first level of fuzziness. It is clear that the association relationship between e and f, denoted ass (e, f), is one with the second level of fuzziness, i.e., with membership degree, which can be calculated by the following: µ (ass (e, f)) = min (µA (e), µB (f)) The first level of fuzziness in the association relationship can be indicated explicitly by the designers, even if the corresponding classes are crisp. Assume that A and B are two crisp classes and ass (A, B) is the association relationship with the first level of fuzziness, denoted ass (A, B) WITH degree_ass DEGREE. At this moment, µA (e) = 1.0 and µB (f) = 1.0. Then, µ (ass (e, f)) = degree_ass The classes with the first level of fuzziness generally result in the first level of fuzziness of the association, if this association is not indicated explicitly. Let A and B be two classes only with the first level of fuzziness, denoted A WITH degree_A DEGREE and B WITH degree_B DEGREE, respectively. Then the association relationship between A and B, denoted ass (A, B), is one with the first level of fuzziness, namely, ass (A, B) WITH degree_ass DEGREE. Here degree_ass is calculated by the following: degree_ass = min (degree_A, degree_B) For the instance e of A and the instance f of B, in which µA (e) = 1.0 and µB (f) = 1.0, we have: µ (ass (e, f)) = degree_ass = min (degree_A, degree_B)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 169
Finally, let us focus on a situation in which the classes are the first level and the second level of fuzziness, and there is an association relationship with the first level of fuzziness between these two classes, which is explicitly indicated. Let A and B be two classes with the first level of fuzziness, denoted A WITH degree_A DEGREE and B WITH degree_B DEGREE, respectively. Let ass (A, B) be the association relationship with the first level of fuzziness between A and B, which is explicitly indicated with WITH degree_ass DEGREE. Also, let the instance e of A be with membership degrees µA (e), and the instance f of B be with membership degrees µB (f). Then we have: µ (ass (e, f)) = min (µA (e), µB (f), degree_A, degree_B, degree_a)
Fuzzy Dependency Let us now focus on the fuzzy dependency relationship between the source class and the target class. The dependency relationship is only related to the classes and does not require a set of instances for its meaning. Therefore, the secondlevel fuzziness and the third-level fuzziness in class do not affect the dependency relationship. Fuzzy dependency relationship is a dependency relationship with a degree of possibility. Just like the fuzzy association relationship above, the fuzzy dependency relationship can be indicated explicitly by the designers or be implied implicitly by the source class based on the fact that the target class is decided by the source class. Assume that the source class is fuzzy, with the first level of fuzziness. The target class must be fuzzy, with the first level of fuzziness. The degrees of possibility that the target class is decided by the source class are the same as the membership degrees of source classes. For source class Employee WITH 0.85 DEGREE, for example, the target class Employee Dependent should be Employee Dependent WITH 0.85 DEGREE. The dependency relationship between Employee and Employee Dependent should be fuzzy, with an 0.85 degree of possibility. Notice that, not being like the fuzzy association relationship, only one level of fuzziness can be identified in a dependency relationship, which is implied by the first level of fuzziness of the source class if it is not given explicitly.
Figure 10. Fuzzy dependency relationship Dependent WITH 0.5 DEGREE
Employee WITH 0.5 DEGREE
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
170 Ma
Figure 11. A fuzzy UML data model Employee
Dependent
Car using
Middle Employee
Old Employee
Old Car
Young Employee
New Car liking WITH 0.9 DEGREE
Engine
Interior
ID Turbo FUZZY Size
ID Dashboard Seat
Chassis ID
Because the fuzziness of a dependency relationship is denoted implicitly by the first level of fuzziness of the source class, a dashed line with an arrowhead can still be used to denote the fuzziness in the dependency relationship. Figure 10 shows a fuzzy dependency relationship.
An Illustrative Example In Figure 11, we give a simple fuzzy UML data model utilizing some notations introduced in this chapter. Class Car is a superclass, and New Car and Old Car are its two fuzzy subclasses, namely, they may have fuzzy instances. Similarly, class Employee has three fuzzy subclasses: Young Employee, Middle Employee, and Old Employee. Classes Employee and Car have a fuzzy association relationship using, which has a fuzziness at the second level. Again, fuzzy classes Young Employee and New Car have a fuzzy association relationship like, which has fuzziness at the first level. In addition, class Car is aggregated by three classes: Engine, Chassis, and Interior. Class Engine has three attributes. The attributes Id and turbo have crisp values, whereas size is a fuzzy attribute that can take a fuzzy value. Classes Chassis and Interior are crisp classes, and they have no fuzziness at the three levels.
Related Work By using fuzzy set theory, Zvieli and Chen (1986) introduced three levels of fuzziness in the ER model, corresponding to three levels of database abstract: schema (metadata), instance (data), and value (data element). At the first level, entity sets, relationships, and attribute sets may be fuzzy — they have an associated membership degree in the model. The second level is related to the
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 171
fuzzy occurrences of entities and relationships. Such fuzziness means that instances have membership degree with respect to the entity and the relationship. The third level concerns the fuzzy attribute values of special entities and relationships. Consequently, ER algebra was fuzzily extended to manipulate fuzzy data. Based on a fuzzy ER model (Chaudhry, Moyne, & Rundensteiner, 1999), a methodology for design and development of fuzzy relational databases was proposed through the rules developed for mapping fuzzy ER schema to fuzzy relational databases. Based on fuzzy set theory and the fuzzy ER model (Chen & Kerre, 1998), several major notions in the EER model were extended, including fuzzy extension to generalization/specialization, and shared subclass/category as well as fuzzy selective inheritance, and fuzzy inheritance for derived attributes. The full fuzzy extension to EER and the graphical representations were presented in Ma, Zhang, Ma, and Chen (2001). In particular, the formal approach to mapping a fuzzy EER model to a fuzzy object-oriented database schema was provided in Ma, Zhang, Ma, and Chen (2001). In addition to the ER/EER model, the IFO data model (Abiteboul & Hull, 1987) is a mathematically defined conceptual data model that incorporates the fundamental principles of semantic database modeling within a graph-based representational framework. The extensions of IFO to deal with fuzzy information were proposed in the literature (Vila, Cubero, Medina, & Pons, 1996; Yazici, Buckles, & Petry, 1999). In Vila, Cubero, Medina, and Pons (1996), several types of imprecision and uncertainty, such as the values without semantic representation, the values with semantic representation and disjunctive meaning, the values with semantic representation and conjunctive meaning, and the representation of uncertain information, were incorporated into the attribute domain of the objectbased data model. However, some major concepts in object-based modeling, i.e., superclass-subclass, class inheritance, etc., were not discussed. In addition to the attribute-level uncertainty, the uncertainty was considered to be at the object and class level in Vila, Cubero, Medina, and Pons (1996). In Yazici, Buckles, and Petry (1999), two levels of uncertainty, namely, the level of attribute values and the level of entity instances, were considered, and the ExIFO model was hereby developed. In addition, the mapping that transforms the ExIFO model into fuzzy nested relational databases was provided in Yazici, Buckles, and Petry (1999). It should be pointed out that, however, fuzzy extensions to EER and IFO in the literature (Chen & Kerre, 1998; Ma, Zhang, Ma, & Chen, 2001; Vila, Cubero, Medina, & Pons, 1996; Yazici, Buckles, & Petry, 1999) took into account only the second level of fuzziness of entity/class when the major notions in these data models were extended. This chapter differs from this literature in the following ways: first, several new notions in the UML, such as association and dependency, were extended; and second, the first level of fuzziness and second level
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
172 Ma
of fuzziness in entity/class were completely considered when the major notions in the UML were extended.
Conclusions We present a fuzzy extended UML to cope with fuzzy as well as complex objects in the real world at a conceptual level. Different levels of fuzziness are introduced into the class diagram of the UML, and the corresponding graphical representations are developed. It is not difficult to see that the classical UML is essentially a subset of the fuzzy UML. When there is not any fuzziness in the universe of discourse, the fuzzy UML can be reduced to the classical UML. The focus of this chapter is on fuzzy data modeling in the UML. As we know, the UML can be used for knowledge modeling, and knowledge may generally be imprecise and uncertain. In future work, we will concentrate on the study of class operations, constraints, and rules in the fuzzy UML modeling. In addition, mapping the fuzzy UML data model into object-oriented databases will be interesting.
References Abiteboul, S., & Hull, R. (1987). IFO: A formal semantic database model. ACM Transactions on Database Systems, 12(4), 525–565. Ambler, S. W. (2000a). The design of a robust persistence layer for relational databases. Retrieved from the World Wide Web: http://www.ambysoft.com/ persistenceLayer.pdf Ambler, S. W. (2000b). Mapping objects to relational databases. Retrieved from the World Wide Web: http://www.AmbySoft.com/mappingObjects.pdf Baldwin, J. F., Cao, T. H., Martin, T. P., & Rossiter, J. M. (2000). Toward soft computing object-oriented logic programming. In Proceedings of the Ninth IEEE International Conference on Fuzzy Systems (pp. 768–773). Blaha, M., & Premerlani, W. (1999). Using UML to design database applications. Retrieved from the World Wide Web: http://www.therationaledge.com/ rosearchitect/mag/archives/9904/f8.html Booch, G., Rumbaugh, J., & Jacobson, I. (1998). The Unified Modeling Language user guide. Reading, MA: Addison-Wesley.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 173
Bordogna, G., Pasi, G., & Lucarella, D. (1999). A fuzzy object-oriented data model for managing vague and uncertain information. International Journal of Intelligent Systems, 14, 623–651. Bosc, P., & Prade, H. (1993). An introduction to fuzzy set and possibility theory based approaches to the treatment of uncertainty and imprecision in database management systems. In Proceedings of the Second Workshop on Uncertainty Management in Information Systems: From Needs to Solutions. Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational database. Fuzzy Sets and Systems, 7(3), 213–226. Cao, T. H. (2001). Uncertain inheritance and recognition as probabilistic default reasoning. International Journal of Intelligent Systems, 16, 781–803. Chaudhry, N. A., Moyne, J. R., & Rundensteiner, E. A. (1999). An extended database design methodology for uncertain data management. Information Sciences, 121(1–2), 83–112. Chen, G. Q., & Kerre, E. E. (1998). Extending ER/EER concepts towards fuzzy conceptual data modeling. In Proceedings of the 1998 IEEE International Conference on Fuzzy Systems, 2, 1320–1325. Chen, P. P. (1976). The entity-relationship model: Toward a unified view of data. ACM Transactions on Database Systems, 1(1), 9–36. Conrad, R., Scheffiner, D., & Freytag, J. C. (2000). XML conceptual modeling using UML. In Proceedings of the 19 th International Conference on Conceptual Modeling (pp. 558–571). Cross, V., & Firat, A. (2000). Fuzzy objects for geographical information systems. Fuzzy Sets and Systems, 113, 19–36. Cross, V., Caluwe, R., & Vangyseghem, N. (1997). A perspective from the Fuzzy Object Data Management Group (FODMG). In Proceedings of the 1997 IEEE International Conference on Fuzzy Systems, 2, 721–728. Dubois, D., Prade, H., & Rossazza, J. P. (1991). Vagueness, typicality, and uncertainty in class hierarchies. International Journal of Intelligent Systems, 6, 167–183. George, R., Srikanth, R., Petry, F. E., & Buckles, B. P. (1996). Uncertainty management issues in the object-oriented data model. IEEE Transactions on Fuzzy Systems, 4(2), 179–192. Gyseghem, N. V., & Caluwe, R. D. (1998). Imprecision and uncertainty in UFO database model. Journal of the American Society for Information Science, 49(3), 236–252. Lee, J., Xue, N. L., Hsu, K. H., & Yang, S. J. (1999). Modeling imprecise requirements with fuzzy objects. Information Sciences, 118, 101–119.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
174 Ma
Liu, W. Y., & Song, N. (2001). The fuzzy association degree in semantic data models. Fuzzy Sets and Systems, 117(2), 203–208. Ma, Z. M., Zhang, W. J., & Ma, W. Y. (1999). Assessment of data redundancy in fuzzy relational databases based on semantic inclusion degree. Information Processing Letters, 72(1–2), 25–29. Ma, Z. M., Zhang, W. J., & Ma, W. Y. (2004). Extending object-oriented databases for fuzzy information modeling. Information Systems, 29(5), 421–435. Ma, Z. M., Zhang, W. J., Ma, W. Y., & Chen, G. Q. (2001). Conceptual design of fuzzy object-oriented databases using extended entity-relationship model. International Journal of Intelligent Systems, 16, 697–711. Marín, N., Medina, J. M., Pons, O., Sánchez, D., & Vila, M. A. (2003). Complex object comparison in a fuzzy context. Information and Software Technology, 45(7), 431–444. Marín, N., Vila, M. A., & Pons, O. (2000). Fuzzy types: A new concept of type for managing vague structures. International Journal of Intelligent Systems, 15, 1061–1085. Mili, F., Shen, W., Martinez, I., Noel, Ph., Ram, M., & Zouras, E. (2001). Knowledge modeling for design decisions. Artificial Intelligence in Engineering, 15, 153–164. Naiburg, E. (2000). Database modeling and design using Rational Rose 2000. Retrieved from the World Wide Web: http://www.therationaledge.com/ rosearchitect/mag/current/spring00/f5.html OMG. (2001). Unified Modeling Language (UML), version 1.4. Retrieved from the World Wide Web: http://www.omg.org/technology/documents/formal/ uml.htm Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34, 115–143. Raju, K. V. S. V. N., & Majumdar, K. (1988). Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM Transactions on Database Systems, 13(2), 129–166. Vila, M. A., Cubero, J. C., Medina, J. M., & Pons, O. (1996). A conceptual approach for deal with imprecision and uncertainty in object-based data models. International Journal of Intelligent Systems, 11, 791–806. Yazici, A., Buckles, B. P., & Petry, F. E. (1999). Handling complex and uncertain information in the ExIFO and NF2 data models. IEEE Transactions on Fuzzy Systems, 7(6), 659–676.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Fuzzy Information Modeling with the UML 175
Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338–353. Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(1), 3–28. Zvieli, A., & Chen, P. P. (1986). Entity-relationship modeling and fuzzy databases. In Proceedings of the 1986 IEEE International Conference on Data Engineering (pp. 320–327).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
176 Ma
SECTION III
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 177
Chapter VI
A Framework to Build Fuzzy Object-Oriented Capabilities Over an Existing Database System Fernando Berzal University of Granada, Spain Nicolás Marín University of Granada, Spain Olga Pons University of Granada, Spain M. Amparo Vila University of Granada, Spain
Abstract Fuzzy object-oriented database models allow the representation, storage, and retrieval of complex imperfect information according to the objectoriented data paradigm. This chapter describes both a framework and an architecture that can be used to develop fuzzy object-oriented capabilities using the conventional features of the object-oriented data paradigm. We present a framework composed of a set of classical classes, which gives Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
178 Berzal, Marín, Pons, & Vila
support to fuzzily described complex objects. We also explain how to deal with fuzzy extensions of object-oriented features using as a basis, the conventional object-oriented features. This proposal can be used to build a fuzzy object-oriented database system, by taking as a base an existing database system and minimizing the development effort.
Introduction In the last decade, an important group of database researchers focused its studies on the adaptation of existing data models to imperfect information management, most using the Fuzzy Subset Theory, which has proven to be a good tool for handling this kind of information. At the same time, the object-oriented data paradigm increased in popularity among programmers and designers, mainly due to its powerful modeling capabilities. Most of the commercial database management systems that allow the manipulation of objects belong to the following two categories: 1.
Object-oriented database management systems (OODBMSs) (Berler et al., 2000)
2.
Object-relational database management systems (ORDBMSs) (Stonebraker et al., 1999)
On the one hand, object-oriented databases are designed to easily work with object-oriented programming languages such as Java, C#, and C++. OODBMSs use the same model as object-oriented programming languages. In spite of the difficulties and complexity involved by this approach, some commercial products can be found (like O2®, ObjectStore®, Objectivity®, and Versant®), although they represent only a small part of the market. On the other hand, ORDBMSs span object and relational technology. Many of the traditional relational products now incorporate the object-relational framework (like Oracle® and Postgres®). Nowadays, most of the development efforts in the software world use the objectoriented data paradigm to represent and manipulate their data. When these applications are related to soft computing, then fuzzy modeling and representation capabilities are required. In the world of databases, this fact has motivated the study and development of fuzzy object-oriented database modeling tools. They arise from the combination of object-oriented and fuzzy concepts in order to permit the representation of complex imperfect information (Kuo et al., 2001; Caluwe, 1997).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 179
Background The beginning of the study of fuzziness in object-oriented models is in close relation with advanced semantic data models (Ruspini, 1986; Zivieli et al., 1986; Vanderberghe et al., 1991). After these initial steps, many relevant works can be found in the literature: 1.
J. -P. Rossazza et al. (1998) introduced an hierarchical model of fuzzy classes, explaining important notions (e.g., the tipicality concept) by means of the use of fuzzy sets.
2.
George et al. (1993) began to use similarity relationships in order to model attribute value imperfection. The work of George et al. was completed by Koyuncu et al. (2003), giving as a result, IFOOD, an intelligent fuzzy objectoriented model.
3.
G. Bordogna et al. (1994) introduced an extended graphical notation to represent fuzzy object-oriented information.
4.
N. Van Gyseghem et al. (1998) developed the UFO model, one of the most complete proposals that can be found in the literature.
Other relevant works in this area can be consulted (Na et al., 1996, 1996b; Baldwin et al., 2000, 2000b; Cao, 2001). Some define complex algebraic models, while others are focused on the logic world. Even the entity/relationship model is being studied as a design tool for object-oriented databases (Ma et al., 2001).
Motivation and Organization Different trends exist in this research area. Proposals vary from new fuzzy object-oriented data models to adaptations of the classical object-oriented model to allow the storage of fuzzy information. From our point of view, the main drawback of the first kinds of proposals is that when a new fuzzy object-oriented data model is proposed, a new system needs to be developed in order to allow users to interact with the new data model. And developing specific systems from scratch is a hard task that may not be profitable from a commercial point of view. (In fact, OODBMSs, which have a wider set of intended users, have had a lot of problems from the commercial point of view.) However, the approaches belonging to the second trend need only the implementation of a translation layer over an existing database system. Performance would probably be lower, but the development effort would be much lower too. Recently, we developed a proposal to represent fuzzy information in an objectoriented data model. Our research is mainly motivated by the need for an easy-
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
180 Berzal, Marín, Pons, & Vila
to-use transparent mechanism to develop applications dealing with fuzzy information. Following our proposal, programmers and designers should be able to directly use new structures developed to store fuzzy information without the need for any special treatment, without altering the underlying programming platform, and with the most possible transparency. Our proposal allows the programmer to handle data imperfection in an important set of the situations, where it can appear in an object-oriented software development effort. This proposal resulted in the implementation of a framework that can be used in two ways: 1.
Programmers and designers can directly use our proposal over an existing conventional database system.
2.
The proposal could be the basis for the development of a fuzzy objectoriented database system built over an existing conventional database system.
As we will see in this chapter, the underlying system must include some objectoriented capabilities among its characteristics. In fact, though existing OODBMSs are a good choice, some advanced ORDBMSs could also be used, like the last versions of Oracle RDBMS. This chapter is devoted to the explanation of our proposal and is organized as follows, in sections: Fuzziness and Object-Orientation describes the main features of our proposal for dealing with fuzziness in an object-oriented context; in A Supporting Framework section, we explain how to deal with fuzziness using classical object-oriented concepts as the basis of the discussion; A FOODBS Architecture presents an architecture that can be used to develop a system able to store fuzzy information in a classical object-oriented system using the framework described in previous parts of the chapter. Some concluding remarks and future work trends end the chapter.
Fuzziness and Object Orientation Fuzziness and object orientation can be combined from different points of view. We can consider the case of fuzzily described objects, that is, those objects with attribute values that are fuzzy values. On the other hand, we can consider the fuzzification of different concepts of the object-oriented data model, such as the concept of type or the concept of inheritance. The following two subsections describe these two matters.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 181
Objects Fuzzily Described To consider that an object is fuzzily described implies to consider that its state is fuzzy, that is, that its attributes have fuzzy values. The capability of handling fuzzy attribute values must be built in the system, taking into account different semantic interpretations of the domains where these values are defined. Models exist (George et al., 1993; Yazici et al., 1998) that make a clear distinction between conjunctive and disjunctive semantics of fuzzy attribute values. Models also exist (Bordogna et al., 1994) that allow labels to be used in attribute values, as these labels are possibility distributions defined over a reference set of possible values. We take as a basis these theoretical considerations and present a complete way to treat imperfect information at this level: the expressiveness of linguistic labels is used to set imprecise values, but we consider the different semantics that these labels may have according to the characteristics of the domain in which they are defined.
Imprecise Attribute Values The reasons why an attribute value can be ill-defined may differ, from an actual ill-knowledge of the datum to an in-nature imprecision affecting the domain of the attribute. Suppose that we want to describe rooms in our database. Let us consider the following sentences that describe a given room: 1.
The room is big.
2.
The room is of high quality.
Both sentences are expressed with some lack of precision. We use linguistic labels to express the imprecise values in each of these sentences, but each label matches a different semantic pattern. An underlying basic domain exists below the label used in the first sentence (positive real numbers). In contrast, it is not easy to find such an underlying domain for the label high of the last sentence.
Labels without Semantic Representation This case brings together those sentences similar to sentence (2) “The room is of high quality.” In those situations, the domain of the attribute is a set of linguistic labels (e.g., high, regular, low). We cannot define the semantics of the labels by means of fuzzy sets built over an underlying domain. The imprecision embedded in each of the labels forces us to use resemblance relationships to
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
182 Berzal, Marín, Pons, & Vila
Table 1. Quality attribute values
High
High
Regular
Low
1
0.8
0
1
0.8
Regular Low
1
compare values of the domain, instead of the classical equality. (For example, Table 1 contains the definition of a resemblance relation for quality labels.)
Labels with Semantic Representation We saw the way to represent the quality of a room. Let us now see how to represent the extension of the room. In this case, the attribute has a well-defined basic domain (usually a bounded subset of the real interval), and the labels that stand for ill-defined values can easily be described by means of fuzzy subsets defined over the aforesaid basic domain. In fact, the actual value will be one of the values of the support set. That is, the semantics of the label is a possibility distribution of values of the underlying basic domain. Consider, for example, the attribute extension of the class Room. The basic underlying domain of this attribute is the interval [0, ∞), and we add to the domain the set of labels {small, middle-size, big} with definitions that are represented in Figure 1. In this case, we can also use the concept of resemblance relationship to compare labels. Nevertheless, this relationship must be built as an extension of the classical equality that holds in every set, because we have an underlying basic domain with values that must be taken into account. If D stands for the domain, B stands for the basic underlying domain, and L stands for the set of labels, Equation 1 shows a possible resemblance relationship:
1 0 µ S ( x, y ) = µl ( z ) sup z∈B ( µ x ( z ) ⊗ µ y ( z ))
( x = y ) ∧ ( x, y ∈ B ) ( x ≠ y ) ∧ ( x, y ∈ B ) (( x = l ∈ L) ∧ ( y = z ∈ B)) ∨ (( y = l ∈ L) ∧ ( x = z ∈ B)) otherwise
(1)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 183
Figure 1. Labels for attribute extension
Fuzzy Collections We now know how to deal with disjunctive fuzzy sets of values. However, we may have to use fuzzy collections of values in order to express some information about the object we want to represent in the system. These collections of values have a conjunctive interpretation and, thus, need special treatment. For example, consider that we want to represent the set comprised of students who attend their lessons in a given room. We can relate each student with a room, taking into account the amount of daytime he or she spends in this room attending his or her lessons. According to this, the set of student of a room may be expressed as follows: µ(st 1)/st1 + µ(st2)/st 2 + …. + µ(st n)/stn where st i is a student, and µ(st i) is the degree with which the student belongs to the room. If we want to represent this kind of fuzzy value in our system, we also need suitable operators to compare the fuzzy values, taking into account that, now, the semantics of the fuzzy set are conjunctive. Conjunctive fuzzy set comparison is often done by means of the concept of inclusion: A = B if and only if (A ⊆ B) ∧ (B ⊆ A) To compute the inclusion degree of a fuzzy set A in a fuzzy set B, we can use the following operator (Rossaza et al., 1998):
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
184 Berzal, Marín, Pons, & Vila
N(B|A) = minυ∈U{I(µA(u), µB(u))} where I stands for a fuzzy implication operator, and U is the reference set where A and B are defined. The implication operator can be chosen in accordance with the properties we want the inclusion degree to fulfill. For example, you can use the following: if x ≤ y 1 I ( x, y ) = y / x otherwise
It frequently happens that the elements of U are fuzzily described objects. In these situations, for a given element in the set A, it is not clear which element of B has to be taken in order to compare the membership degrees. To perform comparisons among this kind of fuzzy collections, we proposed (Marín et al., 2003) the following set of operators: 1.
An inclusion operator (Θ), which takes into account resemblance between the elements being compared (⊗ stands for a t-norm): Θ S ( B | A) = min max θ A, B , S ( x, y ) x∈U
y∈U
where
θ A, B , S ( x, y ) = ⊗( I ( µ A ( x), µ B ( y )), µ S ( x, y )) 2.
A generalized resemblance operator (ℑ), which considers both inclusion directions and which can be weighted with a cardinality ratio (Φ):
ℑS , ⊗ ( A, B ) = ⊗(Θ S ( B | A), Θ S ( A | B)) ìï1, ï if A = Æ Ù B = Æ F( A, B) = ïí min(| A |,| B |) ïï , otherwise ïîï max(| A |,| B |)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 185
Figure 2. Problem of object comparison Room1 (0.5)Quality: high (0.8)Extension: 30m2 (1)Floor: 4 (1)Student: 1/stu1+1/stu2+ 0.8/stu3+0.5/stu4
Room2
=?
Quality: regular Extension: big Floor: high Students: 1/stu1+1/stu5+ 0.75/stu3+0.6/stu6
Student1
Student2
Student3
(1)Name: John (0.75)Age: young (0.75)Height: 1.85m
Name: Peter Age: young Height: 1.7m
Name: Mary Age: middle-aged Height: short
Student4
Student5
Student6
Name: Tom Age: 24 Height: tall
Name: Peter Age: 25 Height: medium
Name: Tom Age: young Height: 1.9m
Comparison of Fuzzily Described Objects We have seen how to deal with imprecision when it appears in attribute values. Linguistic labels and resemblance measures from fuzzy subsets theory are the tools that allow us to handle fuzzily described objects. Until this point, we can compare a pair of attribute values if they are defined in standard basic domains or in the set of imprecise domains we described in the previous paragraphs. But, how can we compare two complex objects of a given class when they are fuzzily described? The example illustrated in Figure 2 will help us to introduce the problem of object comparison. The figure depicts the information of two objects of a given class Room. Every room is described by its quality and extension (as we considered in previous examples), as well as the floor each room is on. The set of students who attend their lessons in each room is fuzzy, and students are also fuzzily described by name, age, and height. Notice that the description of the objects belonging to class Room is imprecise due to the following reasons: 1.
The quality is expressed by an imprecise label.
2.
The attributes extension and floor can be expressed using a numerical value or an imprecise label.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
186 Berzal, Marín, Pons, & Vila
3.
The set of students is fuzzy, taking into account the percentage of time each student spends receiving the lessons in each room to compute the membership degrees.
To compare both rooms, we need to compare every couple of attribute values. To do that, we know how to handle resemblance in basic domains (quality, extension, and floor) and how to compare fuzzy collections of fuzzily described objects (set of students). Nevertheless, we need to solve two extra problems: 1.
We need to use recursion in order to deal with complex objects (objects with attributes values that are also objects). That is, to compute resemblance between rooms, we have to compare students. During this process, we have to deal with the possible presence of cycles in the data graph (i.e., it is possible that we have to compute the resemblance of objects o1 and o2 in order to compute the resemblance between objects o1 and o2).
2.
We need to aggregate the resemblance information that we collected by studying particular attributes. Then we must compute a general resemblance opinion for the whole objects.
Taking into account the ideas presented in Marín et al. (2003), we can define the calculus of the resemblance between two objects o1 and o2 of a given class C, with a type that is made up by the set of attributes Str C = {a1, a2, ..., an}, by means of a function FE: FE: FC × O(FC) × O(FC) × P(P2 (O(FC))) × P(P2 (O(FC))) → [0,1] where FC is the family of all the classes, and O(FC) is the set of all the class instances. P stands for the power set, and P2 represents those members of the power set whose cardinality is 2 (i.e., pairs of objects in our context). The calculus of FE(C, o1, o2, Ωvisited, Ωaprox) involves the recursive computation described below. There are two basic cases: 1.
When the identity equality holds between the objects:
If o1 = o2, then FE = 1 2.
When a known defined resemblance relation exists in the class: As a particular example, when we compare two fuzzy sets of objects, we can use
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 187
a generalized resemblance degree (Marín et al., 2003) that recursively compares the elements in the sets. If there exists a resemblance relation S defined in C, then: FE = mS(o1,o2). In particular, if o1 and o2 are fuzzy sets, then: FE= ℑFEΩ,⊗(o1, o2) = ⊗(ΘFEΩ(o2|o1), ΘFEΩ(o1|o2)) , where
Θ FE Ω(o|o')=min x∈ Spp(o) max y∈ Spp(o) {I(( µ o' (x), µ o(y)) ⊗ FE(C D,x,y, Ω visitedΩ approx)} where CD stands for the class that is the reference universe of the sets, I is an implication operator, Spp(o) is the support set of o, and ⊗ is a t-norm. A third case provides a general recursive model that applies an aggregation operator over recursive calls that compute the resemblance between couples of attribute values. When aggregating, not all the attributes have the same importance (w ai weights the importance of the attribute ai). The aggregation is founded on the semantics of a quantifier Q (by using oQ - orness of Q). If {o1,o2}∉Ωvisited, then FE = VQ(W,R) where R contains the resemblance values FE(C ai , o 1.a i, o 2.a i, Ω visited∪ {{o 1,o 2}}, Ω approx) if defined (C ai is the domain class of the attribute ai), W contains the weights for attributes a i, and VQ is Vila’s aggregation operator (Vila et al., 1995 ), which is defined as: oQ max i:rai∈R{w ai∧r ai} + (1-o Q)min i:rai ∈R{r ai∨(1-w ai )} The fourth and fifth cases use the variables Ωvisited and Ωapprox to deal with the existence of cycles: 1.
The first time that the couple {o1,o2} produces a cycle, which is detected because {o1,o2} is already in Ωvisited, then the couple is inserted into Ωapprox in order to compute an approximation that focuses only on nonproblematic attributes (those that do not lead to cycles). FE = FE(C,o 1,o 2, ∅, Ωapprox∪ {{o 1,o 2}})
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
188 Berzal, Marín, Pons, & Vila
2.
If the couple of objects are in Ωapprox (i.e., its resemblance is currently being approximated), then we do not calculate a resemblance value, and the function FE is undefined.
Otherwise, when {o1,o2} ∈ Ωvisited∧ {o1,o2}Ωapprox, then FE is undefined. The above function is a resemblance relation, because the properties of the operators used in each of the basic cases are those of a resemblance relation [see Marín et al. (2003) for a more in-depth study.]
Fuzzy Object-Oriented Concepts In the previous section, we studied how to deal with objects that have fuzzy attribute values. However, this is not the only level where fuzziness may appear in an object-oriented context. Many proposals can be found in the literature (Kuo et al., 2001; Caluwe, 1997) where object-oriented concepts are softened so that fuzzy object-oriented models could appear. The addition of fuzziness can be considered at different levels of the objectoriented model (Caluwe, 1997): 1.
Attribute values
2.
Relationships among objects
3.
Class extents
4.
Inheritance relationships
5.
Definition of the type of a class
Let us describe how fuzziness can be added in these levels in order to improve the modeling capability of the object-oriented model.
Explicit Uncertainty in Attributes Values In the previous section, we studied the representation of imprecise attribute values of different kinds. Nevertheless, there exists a close relationship between imprecision and uncertainty. For example, when we say that the age of a student is young, we use an imprecise value to express the age. However, there is an implicit uncertainty about the age of the student: we do not know exactly the value of the age. This implicit uncertainty is well represented by the possibility distribution used to express the semantics of the label young. But, there may be
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 189
situations where, as well as an implicit uncertainty, we have to deal with an explicit uncertainty. For example, consider the following sentences: 1.
It is sure that the student is young.
2.
It is very possible that the student is young.
3.
It is probably that the student is young.
We can use different scales to express the explicit uncertainty that affects an attribute value: we can use probability (w.r.t. possibility) measures defined within the [0,1]-interval, linguistic labels of probability (w.r.t. possibility) with semantic representation that is a disjunctive fuzzy set, certainty measures, evidences, etc. Though we can express imprecision and explicit uncertainty, to deal with them we have to take into account that they are convertible (Gonzalez et al., 1999).
Semantically Enhanced Relationships among Objects Semantic data models usually offer two ways for connecting objects (that can be directly translated to the object-oriented data model): 1.
Attributes values, which involve a functional approach: Using this alternative, we relate a class with the classes that are its attribute domains.
2.
The aggregation construct, which is used to model those relationships that explicitly need the definition of a class in order to represent them.
It should be noted that assuming any kind of imperfection in the functional connection between an object and one of its attribute values is usually equivalent to considering that the attribute domain is affected by imperfection. We studied in previous paragraphs how to deal with imprecise and uncertain matters in connection with this first way of relating objects. In case the programmer decides to represent aggregations as classes, we should take into account the following considerations: 1.
We may want to represent the fact of having partial knowledge about whether given object group (usually a pair) are related. That is, we want to associate some truth value to the relationship.
2.
We may want to consider that the connection among the objects admits degrees of importance. This situation arises when not all the relationship instances have the same strength. In these situations, as suggested by Bordogna et al. (1994), we can use numerical or linguistic values to express this strength.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
190 Berzal, Marín, Pons, & Vila
Notice that the strength has a semantic interpretation different from the one given to uncertainty. We know that the objects are related, but we consider different strengths in their connection. Moreover, both semantic nuances can be used at the same time, if required.
Fuzzy Class Extents When designing the schema of a given application, we may want to use fuzzy extents in the classes; that is, our application semantic needs may require the gradual expression of the object membership to the class it belongs to (for example, we can use this membership degree to express to what extent the object is compliant with a prototypical object in the class). The membership degree is normally valued within [0,1] interval, changing the set of objects that conform the class (i.e., the class extent) into a fuzzy set of objects. There are many proposals in the literature that suggest different ways in which the membership degree of a given object to a certain class can be computed, in case this membership degree must be inferred from its attribute values in relation to the archetypical or expected ones for the class. Most of these approaches are founded on concepts such as inclusion or typicality (Rossaza et al., 1998; Bordogna et al., 1994). The presence of imperfection around the objects is not only translated into the consideration of gradual membership of the objects, but it also generates important problems in the classification process. Before inserting an object in our database, we have to answer two relevant questions: 1.
What class best represents the object?
2.
Does this object already exist in the database?
The answers to these questions are not trivial and could lead us to situations in which we are not sure about an object of a given class. In such situations, the gradual membership is substituted by an uncertain membership.
Softening the Inheritance Relationships Level Specialization processes create subclasses from an existing class according to one of the following ways: 1.
By constraining the description of a property, i.e., the attribute domain of an existing class (e.g., RedCar is a subclass of Car).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 191
2.
By specifying an additional set of properties (e.g., Employee is a subclass of Person).
Both kinds of specializations could lead to imprecise structures by considering flexible ways for characterizing the corresponding subclasses. Therefore, it may be interesting to add a degree in the inheritance connection between two classes.
Uncertainty and Precision Levels in the Schema The presence of uncertainty in the definitions of the structures that characterize the schema of a given problem must be avoided. Precisely, one of the most important aims of a designer is to eliminate uncertainty in the schema, trying to find the hierarchy of classes that best represents the problem being modeled. On the contrary, knowledge of the problem could lead us to manage different levels of precision in the structures. The structure associated with a given class can be viewed as a set of attributes or properties, with a series of associated ranges. This concept of structure, that we call crisp structure, fulfills a large proportion of the needs related to types when the hierarchical structure of a given application is being found. However, there are other problems for which this concept of structure is not suitable, and a softening process is needed. Examples of these problems are the representations of concepts with different levels of precision, semistructured or unstructured data management, or the handling of incomplete information. These kinds of problems require the use of more expressive and powerful techniques to define the structure of a certain class of objects. In Marín et al. (2000), we presented a new concept of type that assists in solving some of these problems. Let us now look at a brief summary of this concept and its most important characteristics. Fuzzy Types Our new concept of type is founded on the idea of fuzzy data structure. A fuzzy structure is a fuzzy set defined over the set of all the possible attributes in the model. Taking this definition into account, a fuzzy type is a type with a structural part S that is a fuzzy structure. The support set of the fuzzy structure associated with the type is the set of attributes that can be used to characterize the type at any moment. The kernelset contains the basic attributes of the type, while each of the α-cuts of the fuzzy structure defines a precision degree with which the type can be considered.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
192 Berzal, Marín, Pons, & Vila
So far, in the object-oriented model, every instance of a class could reference any of the attributes of the class (instance variables). However, with our new kind of type, an instance of a given class may not incorporate certain attributes depending on the a-cut of the class structure with which it was created. Each one of the methods defined in a class must have an associated precision level (as is the case with the attributes or instance variables) that indicates the minimum precision that an instance must have to incorporate a method in its behavior. This level of precision depends on attributes and other methods referenced in the code of the method. The change proposed in the concept of type involves modifications to the idea of instantiation. In order to create a new object of a given class, we must be able to choose the a-cut of properties of the type that will be used to represent it. To do that, the model has a generic method new(α) (with α∈(0; 1]), called fuzzy constructor. The receptor of this method can be any class C, while the parameter is the level a of the structure of this class C needed to represent the new object. The effect of sending the message new(α) to a class C with structural component S and behavior component B, consists of creating an object incorporating the set Sα of attributes. The set Bα of methods defines the behavior of this object. The inheritance mechanism H must enable part of the class structure and behavior to be inherited by its subclasses. As we have done with the instantiation mechanism, we add a threshold to indicate what proportion of the properties we want to be inherited. Two different forms of inheritance can be considered: 1.
Incorporating inherited attributes and methods to the kernel set of the structural and behavior components of the subclass, respectively: In this way, the vagueness of the inherited properties will be eliminated. This type of inheritance will be called inheritance without propagation H crisp.
2.
Keeping the vagueness, by inheriting both properties and methods affected by the corresponding membership degree: This type of inheritance will be called inheritance with propagation H fuzzy.
A Supporting Framework As we mentioned in the introduction, our research is mainly motivated by the need for an easy-to-use transparent mechanism to develop applications dealing with fuzzy information. Following our proposal, programmers and designers should be able to directly use new structures developed to store fuzzy information without the need for any special treatment, without altering the underlying
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 193
programming platform, and with the most possible transparency. Let us explain how.
Allowing the Use of Fuzzily Described Objects This section introduces the class hierarchy that we developed in order to give support to the model described in the previous section. The hierarchy is developed using classical object-oriented concepts and allows for the management and comparison of fuzzily described complex objects. Following the principles that guided our research, let us describe the way this theoretical approach can be implemented in a modern programming platform, so that programmers can easily design their classes to handle imprecise objects and compare the fuzzy objects of these classes with a minimum of effort. We used the reflection capability that many modern programming languages offer to develop a framework that can be used by user-defined classes in order to compare objects. Reflection is a feature of many of the modern programming languages [e.g., Java (java.sun.com) and C# (www.microsoft.com)]. This feature allows an executing program to examine or “introspect” itself and manipulate internal properties of the program. For example, it is possible for a class to obtain the names of all its members and display them. Because our final aim is to allow the programmer to define classes and perform fuzzy comparisons (within queries, for example) without having to write complex specific code for each class written, we can define a generic FuzzyObject class that will serve as a basis class for the definition of any class with objects that need fuzzy representation and comparison capabilities. We can avoid duplicating code in different classes if we write a generic fuzzyEquals method at the FuzzyObject class. Taking into account that the fuzzyEquals method requires access to the particular object fields, the only way we can implement such a general version of this operator is through reflection. Just by extending this general FuzzyObject, the programmer can define his or her own classes to represent fuzzy objects. Our framework, as depicted in Figure 3, also includes some classes to represent common kinds of domains for imprecision, such as linguistic labels without underlying representation (DomainWithoutRepresentation), domains where labels are possibility distributions over an underlying basic domain (DisjunctiveDomain and its subclasses to represent labels with finite support set, basic domain values, and functional representations of labels with infinite support set, like trapezoidal ones), and, finally, fuzzy collections of fuzzy objects (ConjunctiveDomain). These classes define their proper fuzzyEquals logic to handle the different cases we previously discussed in the chapter.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
194 Berzal, Marín, Pons, & Vila
Figure 3. A framework to deal with fuzzily described complex objects FuzzyObject fuzzyEquals(fuzzyObject)
DomainWithoutRepresentation
DisjunctiveDomain
ConjunctiveDomain
fuzzyEquals(fuzzyObject)
fuzzyEquals(fuzzyObject)
fuzzyEquals(fuzzyObject)
DisjunctiveFiniteObject
ConjunctiveFiniteObject
fuzzyEquals(fuzzyObject)
fuzzyEquals(fuzzyObject)
TrapezoidalObject fuzzyEquals(fuzzyObject) BasicObject fuzzyEquals(fuzzyObject)
To enhance the way this framework can be used when writing a soft computing application using one of the foremost programming platforms, consider the following java code for the example of rooms and students (Figure 2). 1.
To represent the classrooms:
public class Room extends FuzzyObject { // Instance variables public Quality quality; public Extension extension; public Floor floor; public StudentCollection students; // Constructor public Room (Quality quality, Extension extension, Floor floor, StudentCollection students) { this.quality = quality; this.extension = extension; this.floor = floor;
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 195
this.students = students; } // Field importance public static float fieldImportance (String ieldname) { String fields[] = new String[] { "quality", "extension", "floor", "students" }; float importance[] = new float[] { 0.5f, 0.8f, 1.0f, 1.0f }; for (int i=0; i
The fieldImportance method is specified to set the attribute importances (although it could be omitted if the user gives the same importance to all of them). The room imprecise attributes can be easily implemented by extending the classes provided by our framework without having to worry about the fuzzyEquals implementation. 2.
The imprecise room quality is an object of a class (Quality) that extends DomainWithoutRepresentation, without having to add any special code.
3.
The extension and floor attributes are both particular cases of DisjunctiveDomain and, as such, they can be basic values, trapezoids, or finitely described labels. The programmer only has to extend the class DisjunctiveDomain, without having to write specialized code (again).
4.
Finally, the set of students is a fuzzy collection of students, where the fuzzy collection StudentCollection inherits from ConjunctiveFiniteObject, and students are similarly defined as a classroom is described.
The following code shows the creation of both rooms, once the classes mentioned above are defined: // Label definitions for students Age young = new Age (new Label("young"), 0, 0, 23, 33 ); Age middle = new Age (new Label("middle-aged"), 23, 33, 44, 48 );
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
196 Berzal, Marín, Pons, & Vila
Height shortHeight = new Height(new Label("short"),0, 0, 150, 160); Height mediumHeight=new Height(new Label("medium"),150,160, 170, 180); Height tall = new Height ( new Label("tall"), 170, 180, 300, 300); //Student definition Student student1 = new Student ("John", young, new Height(185) ); Student student2 = new Student ("Peter", young, new Height(170) ); Student student3 = new Student ("Mary", middle, shortHeight ); Student student4 = new Student ("Tom", new Age(24), tall ); Student student5 = new Student ("Peter",new Age(25),mediumHeight ); Student student6 = new Student ("Tom", young, new Height(190) ); // Label definitions for rooms: // highQuality, mediumQuality, highFloor... (as above) // Sets of students Vector vector1 = new Vector(); vector1.add (new MembershipDegree (1.0f, student1 ) ); vector1.add (new MembershipDegree (1.0f, student2 ) ); vector1.add (new MembershipDegree (0.8f, student3 ) ); vector1.add (new MembershipDegree (0.5f, student4 ) ); StudentCollection set1 = new StudentCollection ( vector1 ); Vector vector2 = new Vector(); vector2.add (new MembershipDegree (1.0f, student1 ) ); vector2.add (new MembershipDegree (1.0f, student5 ) ); vector2.add (new MembershipDegree (0.75f, student3 ) ); vector2.add (new MembershipDegree (0.6f, student6 ) ); StudentCollection set2 = new StudentCollection ( vector2 ); //Room definitions Room room1 = new Room ( highQuality, new Extension(30), new Floor(4), set1 ); Room room2 = new Room ( mediumQuality, big, highFloor, set2 );
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 197
We can compare the rooms by invoking their fuzzyEquals method, as in System.out.println("room1 fvs room2=" + room1.fuzzyEquals(room2));
that returns an approximate value of 0.81. Thus, we encapsulated fuzzy object comparisons in our framework classes so that programmers can now freely compare imprecisely described objects without having to code any comparison logic. The capability of comparing fuzzily described objects is the basis for querying in the system. Every query describes a pattern, and we have to find the objects in the database that match this pattern. The same set of operators used to develop the fuzzyEquals method can be directly used to compare real objects with object patterns.
Allowing the Use of Fuzzy Object-Oriented Concepts The previous section introduced a framework that allows programmers and designers to deal with fuzzily described objects. Following the idea that guides our proposal, this section explains how to deal with fuzzy object-oriented concepts using as a basis a conventional object-oriented system or an advanced object-relational system.
Support for Fuzzy Extensions of Object-Oriented Concepts We have two alternatives when dealing with the fuzzy extensions of objectoriented features that we described in the previous section: 1.
To develop a new system that implements fuzzy object-oriented features intrinsically
2.
To represent new fuzzy extensions using classical object-oriented structures as the basis
The first alternative implies the implementation of a whole database system, while the second implies the implementation of an interface that translates fuzzy concepts into classical ones that are managed by an underlying classical database system. Moreover, interested users who have to use some fuzzy features to solve their problems can directly use the proposed classical structures without needing any special software.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
198 Berzal, Marín, Pons, & Vila
As we saw in the previous section, we can consider the following list of fuzzy extensions of classical object-oriented concepts: 1.
Explicit uncertainty in attribute values
2.
Semantically enhanced relationships among objects
3.
Fuzzy class extents
4.
Fuzzy inheritance connections
5.
Fuzzy type definitions
All of these characteristics can be directly translated into classical objectoriented structures. Table 2 summarizes how to deal with these fuzzy extensions according to our proposal (Blanco et al., 2001).
A FOODBS Architecture In this section, we present an architecture that can be used to develop a system able to store fuzzy information in a classical object-oriented system using the model described in the previous parts of the chapter. According to the principle that guided our approach, all the proposed extensions of the object-oriented data model are built by means of structures that can be directly translated into a set of standard classes. This feature allows us to decrease the development effort needed to implement a fuzzy object-oriented database system with the capabilities we propose. Let us briefly examine our development strategy. Figure 4 depicts a simplification of the ANSI/SPARC standard database architecture with little modification. External views are organized in such a way that the user can transparently manage data imperfection. This is the fuzzy view of the system. At the same time, the conceptual schema is divided into two different layers: the upper layer contains fuzzy schemata definitions, while the lower layer holds the corresponding classical object-oriented representation needed to support these fuzzy schemata. The internal schema is that of the classical database system being used as the basis for the fuzzy database system. The strategy discussed leads us to an architecture organized into three levels (see Figure 5): 1.
The classical database management system will provide most of the management functions and will store the objects created using the userdefined schemata.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 199
Table 2. Fuzzy concepts and object orientation Fuzzy Concept
Classical Implementation
Explicit uncertainty in attribute values
To consider that some kind of uncertainty is associated with some attribute implies that the attribute domain is an aggregation of the actual attribute domain and the scale where this lack of certainty is measured
Semantically enhanced relationships among objects
If we want to represent the fact of having partial knowledge about whether given object groups (usually a pair) are related, we can include in the class a new attribute standing for the belief in the corresponding aggregation, using the appropriate truth scales for dealing with explicit uncertainty. If we want to represent that the connection among the objects admits degrees of importance, we can use numerical or linguistic values to express this strength. For example, we can consider the set {"high", "medium", "low"} of labels, with each label represented as a disjunctive fuzzy set in [0,1]. That is, we can add to the class an extra attribute that expresses this importance or strength in the connection.
Fuzzy class extents
We only have to add an extra attribute to the class that we want to extend in a fuzzy way. The domain of this attribute could be: - The interval [0,1] - A set of linguistic labels that express membership, defined over the aforementioned interval
Fuzzy inheritance connections
Some important models (Rossaza et al., 1998; George et al., 1993) consider that the superclass–subclass relationship can admit the use of degrees, founded on the idea of inclusion or matching between typical subclass attribute values and typical superclass attribute values. This characteristic can be represented in a classical object-oriented model by means of the use of static variables that express these connection degrees using suitable scales.
Fuzzy-type definitions
This new way of considering the type definition can be easily modeled over a traditional object-oriented model, using the concept of 1-ramified hierarchy of classes (Marín et al., 2001). A 1-ramified hierarchy of classes is defined as a series of classes C1, ..., Ci1,Ci, Ci+1, ..., Cn verifying the following properties: - For any i ∈ 1..n - 1, Sub{Ci} = {Ci+1} (Sub{Ci} stands for the set of subclasses of Ci). - For any i ∈ 2..n, Sup{Ci} = {Ci-1} (Sup{Ci} stands for the set of superclasses of Ci). - A finite sequence of values αi exists, associated with the hierarchy, such that α1 = 1, αn > 0, and αi > αi+1. Each class of the hierarchy is used to represent an α-cut of the type being defined.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
200 Berzal, Marín, Pons, & Vila
Figure 4. New layer to deal with fuzziness
Figure 5. System architecture
2.
The conceptual fuzziness handler will augment the classical system capabilities to allow imperfect data manipulation.
3.
The interface will communicate with the previous level, hiding the underlying complexity, and will allow users to develop their fuzzy object-oriented databases.
Metadata and general data persistence depend on two storage areas: 1.
A metadata catalog will store the fuzzy schemata defined by the user.
2.
A classical database will support the storage and management of user application objects.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 201
A Prototype In order to experiment with the mentioned architecture and framework, a prototype was developed. FoodBi is a graphical system that allows the creation and management of fuzzy object-oriented schemata. By means of this interface, the user can build a hierarchy of classes with fuzzy types, using, at the same time, suitable attribute domains for imperfection handling. This prototype uses Java as the target object-oriented language and Oracle 9i (an advanced object-relational DBMSs) as the DBMS back-end.
An Example: Class Inspector The core part of FoodBi is devoted to facilitating the creation of classes with extended characteristics. The information the user is asked for when defining a new class is as follows: 1.
General metadata that describe the class: identifier, kind of extent (crisp or fuzzy), description, and so on;
2.
Set of attributes that characterize its structural component (which can be fuzzy);
3.
Set of methods that conform its behavioral component (which can also be fuzzy); and
4.
A model of inheritance, using the proposed fuzzy inheritance extensions.
Figure 6 illustrates FoodBi class inspector, when defining the structural part of a class Image, which is organized in three levels of precision and has some attributes that may have imprecise values (age and quality). The information provided by the user when defining an attribute determines the way in which fuzziness will be handled: 1.
In the case of attributes with imprecise values, the user can build labeled domains by choosing among different semantics: with or without underlying basic domain, disjunctive, conjunctive, etc.
2.
In case the attribute value can be affected by explicit uncertainty, the user can attach to the attribute domain a set of linguistic labels or the [0,1] interval in order to express this explicit uncertainty.
3.
The user can even graduate the relationship expressed by the attribute, combining the attribute domain with a suitable linguistic domain for expressing strength values.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
202 Berzal, Marín, Pons, & Vila
Figure 6. Class inspector
Once the class description is completed, FoodBi translates it into a set of standard Java classes that implement it, following the guidelines of the fuzzy object-oriented model presented in previous sections of this chapter.
Conclusions In this chapter, we studied several suitable strategies to face the representation of the different kinds of imperfections that may arise when a database is being designed in an object-oriented paradigm, according to the level at which these imperfections may occur. As part of our proposal, we demonstrated how to implement reusable fuzzy comparison capabilities in modern programming platforms through the use of reflection and theoretical results that help us apply fuzzy techniques in objectoriented models. We also presented an architecture for the development of a fuzzy object-oriented database management system. This architecture is founded on the idea of minimizing the development effort needed to obtain data imperfection manage-
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 203
ment capabilities. As the new structures needed to support data imperfection are implemented using standard object-oriented techniques, we can use an existing classical database system as the basis for our fuzzy one. This way, we only have to develop an upper layer on top of the classical system, avoiding the effort required by the implementation of a whole new system. A prototype was developed to verify the viability of our proposals. The theoretical approach is currently being extended in order to deal with queries: in fact, the FuzzyEquals method described in the chapter is being used as the basis in order to perform object queries (Marín et al., 2004). The prototype is currently the basis for two main development efforts: 1.
Toward its completion as a fuzzy object-oriented data management system; and
2.
Toward the achievement of a general object-oriented class library that can be used to manage fuzzy information without the need for any additional interface.
Acknowledgment This work was partially supported by the Spanish “Comisión Interministerial de Ciencia y Tecnología” under grants TIC2003-08687-C02-02 and TIC200204021-C02-02.
References Baldwin, J. F., Cao, T. H., Martin, T. P., & Rossiter, J. M. (2000). Toward soft computing object-oriented logic programming. In Proceedings of the Ninth IEEE International Conference on Fuzzy Systems (pp. 768–773). Baldwin, J. F., Cao, T. H., Martin, T. P., & Rossiter J. M. (2000b). Implementing Fril++ for uncertain object-oriented logic programming. In Proceedings of the Eighth IEEE International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 496–503). Berler, M., Eastman, J., Jordan, D., Russell, C., Schadow, O., Stanienda, T., & Velez, F. (2000). The object data standard: ODMG 3.0. New York: Morgan Kaufmann Publishers.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
204 Berzal, Marín, Pons, & Vila
Blanco, I. J., Marín, N., Pons, O., & Vila, M. A. (2001). Softening the objectoriented database-model: Imprecision, uncertainty, and fuzzy types. In Proceedings of IFSA/NAFIPS World Congress. Bordogna, G., Lucarella, D., & Pasi, G. (1994). A fuzzy object oriented data model. In Proceedings of FUZZ-IEEE (pp. 313–317). Caluwe, R. de. (1997). Fuzzy and uncertain object-oriented databases: Concepts and models. Advances in fuzzy systems—applications and theory (Vol. 13). Singapore: World Scientific. Cao, T. H. (2001). Uncertain inheritance and recognition as probabilistic default reasoning. International Journal of Intelligent Systems, 16, 781–803. Cubero, J.C., Marín, N., Medina, J. M., Pons, O., & Vila M. A. (2004). Fuzzy object management in an object-relational framework. In Proceedings of IPMU, pp.1767-1774. George, R., Buckles, B. P., & Petry, F. E. (1993). Modelling class hierarchies in the fuzzy object-oriented data model. Fuzzy Sets and Systems, 60, 259– 272. Gonzalez, A., Pons, O., & Vila, M. A. (1999). Dealing with uncertainty and imprecision by means of fuzzy numbers. International Journal of Approximate Reasoning, 21, 233–256. Gyseghem, N. Van, & Caluwe, R. de. (1998). Imprecision and uncertainty in the UFO database model. Journal of the American Society for Information Science, 49, 236–252. Koyuncu, M., & Yazici, A. (2003). IFOOD: An intelligent fuzzy object-oriented database architecture. IEEE Transactions on Knowledge and Data Engineering, 15(5), 1137–1154. Kuo, J. -Y., Lee, J., & Xue, N. -L. (2001). A note on current approaches to extend fuzzy logic to object oriented modeling. International Journal of Intelligent Systems, 16, 807–820. Ma, Z. M., Zhang, W. J., Ma, W. Y., & Chen, C. Q. (2001). Conceptual design of fuzzy object-oriented databases using extended entity-relationship model. International Journal of Intelligent Systems, 16, 697–711. Marín, N., Pons, O., & Vila M. A. (2001). A strategy for adding fuzzy types to an object-oriented database system. International Journal of Intelligent Systems, 16, 863–880. Marín, N., Medina, J. M., Pons, O., Sánchez, D., & Vila, M. A. (2003). Complex object comparison in a fuzzy context. Information and Software Technology, 45, 431–444.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
A Framework to Build Fuzzy Object-Oriented Capabilities 205
Marín, N., Pons, O., & Vila M. A. (2000). Fuzzy types: A new concept of type for managing vague structures. International Journal of Intelligent Systems, 15, 1061–1085. Na, S. L., & Park, S. (1996). Management of fuzzy objects with fuzzy attribute values in new fuzzy object oriented data model. In Proceedings of the Second International Workshop on FQAS (pp. 19–40). Na, S. L., & Park, S. (1996b). A fuzzy association algebra based on fuzzy object oriented data model. In Proceedings of the 20th International Conference on Compsac (pp. 624–630). Rossazza, J. -P., Dubois, D., & Prade, H. (1998). A hierarchical model of fuzzy classes. In Fuzzy and uncertain object-oriented databases. Concepts and models, Advances in fuzzy systems—applications and theory (Vol. 13, pp. 21–61). Ruspini, E. H. (1986). Imprecision and uncertainty in the entity-relationship model. In H. Prade, & C. V. Negiota (Eds.), Fuzzy logic and knowledge engineering (pp. 18–28). Heidelberg: Verlag TUV Reheiland. Stonebraker, M., & Brown, P. (1999). Object/relational DBMSs: Tracking the next great wave. New York: Morgan Kaufmann Publishers. Vanderberghe, R. M., & Caluwe, R. de. (1991). An entity-relationship approach to the modeling of vagueness in databases. In Proceedings of ECSQAU— Symbolic and quantitative approaches to uncertainty (pp. 338–343). Vila, M. A., Cubero, J. C., Medina, J. M., & Pons, O. (1995). The generalized selection: An alternative way for the quotient operations in fuzzy relational databases. In B. Bouchon-Meunier, R. Yager, & L. Zadeh (Eds.), Fuzzy logic and soft computing. Singapore, World Scientific Press. Yazici, A., George, R., & Aksoy, D. (1998). Design and implementation issues in the fuzzy object-oriented data model. Journal of Information Sciences, 108, 241–260. Zivieli, A., & Chen, P. P. (1986). Entity-relationship modeling and fuzzy databases. In Proceedings of the Second International Conference on Data Engineering — IEEE (pp. 18–28).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
206
Helmer
Chapter VII
Index Structures for Fuzzy Object-Oriented Database Systems Sven Helmer Universität Mannheim, Germany
Abstract This chapter gives an overview of indexing techniques suitable for fuzzy object-oriented databases (FOODBSs). First, typical query patterns used in FOODBSs are identified, namely, single-valued, set-valued, navigational, and type hierarchy access. The description of the patterns does not follow a particular fuzzy object-oriented data model but is kept general enough to be used in different FOODBS contexts. Second, for each query pattern, index structures are presented that support the efficient evaluation of these queries. These range from standard index structures (like B-trees) to sophisticated access methods (like Join Index Hierarchies). Due to space constraints, an explanation of the basic techniques is given rather than an exhaustive description. However, the interested reader is supplied with a broad list of references for further reading. Finally, a summary and outlook conclude the chapter.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 207
Introduction One important technique used to accelerate the associative access in database management systems (DBMS) is the use of index structures. When searching for data, we want to avoid the worst case, i.e., having to scan through the whole database and test every data object, because this is inefficient. Index structures help here as they allow fast access to data by content. Due to the semantic richness of object-oriented DBMSs, we have different methods for indexing than, e.g., in relational DBMSs. Adding fuzziness increases the number of possibilities even further. Unfortunately, publications on indexing in fuzzy object-oriented DBMSs are few and far between. Although indexing in advanced DBMSs (e.g., object-oriented, spatial, image, temporal, or XML databases) is an established research topic (for overviews see Bertino, 1997; Liu, 1996; Luk, 2002; Manolopoulos, 1999; Mueck, 1997), indexing in fuzzy databases has not yet received much attention. This chapter is organized as follows. First, we give a brief introduction to the concepts of object-oriented DBMSs needed in the remainder of the chapter. Next, we give an overview of the different aspects of accessing data in fuzzy object-oriented DBMSs. In the next section, we investigate several index structures supporting these access patterns. We then express our opinion on future trends in the area of access methods for FOODBS systems. Finally, in the last section, we conclude with a brief summary.
Preliminaries Storage Hierarchy In every computing system, also in every DBMS, we have several layers of storage (Figure 1). Generally, the higher a memory type is positioned in this hierarchy, the faster, the costlier, and the smaller it becomes. The differences between the levels are usually several orders of magnitude. We divide this hierarchy into three subcategories: primary, secondary, and tertiary storage. Primary storage consists of CPU-registers, cache memory, and main memory; secondary storage comprises the disk level; and tertiary storage includes the tape level. We restrict ourselves to the levels that are most important for index structures in DBMSs: main memory and disks.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
208
Helmer
Figure 1. Levels of storage hierarchy
Object Model Now we present a brief introduction to a (nonfuzzy) object-oriented database model. For a detailed definition see the standard by the Object Data Management Group (ODMG) (Cattell, 2000). We introduce fuzziness to this model in the next section when describing the access patterns. Central to the object-oriented model are objects, which are database entities described by their identities, their types, and their states. The identity of an object is defined by a unique object identifier (OID), which never changes during the lifetime of the object. Each object is also an instance of a certain type (this also does not change for an object). The type determines the behavior and structure of an object. The behavior is constituted by a set of operations the object is able to execute. The structure, in turn, is described by a set of attributes and the possible relationships the object can enter into with other objects. Attributes are not restricted to domains with atomic values but are allowed to be collections, like sets, lists, or tuples. At each point in time, an object has an internal state. The state of an object is defined by the values of its attributes and the current relationships it sustains. A type can inherit its basic structure and behavior from another type and extend this structure and behavior. In this case, we speak of inheritance: a subtype inherits properties from a supertype. All objects belonging to a type (and all its subtypes) are combined in an extent of this type. Another important feature of an object model is substitutability, i.e., an object can be used at any place in Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 209
which an object of one of its supertypes is used. Last but not least, there is polymorphy. A polymorphic operation is defined for a set of types, not only for a single type. In this way, types that may otherwise be unrelated can show the same behavior. For example, the operator “+” (addition) has to be implemented differently for integers than it does for floats, but it has the same semantics.
Classification of Access Patterns We have to distinguish between several different types of representing and accessing data in fuzzy object-oriented DBMSs. The access methods presented later will reflect this, i.e., it will not be possible to support all query patterns with a single index structure. As many different fuzzy object models have been developed in recent years, we try to keep the description of the data representation general enough to demonstrate the applicability of different index structures in the context of FOODBSs. We differentiate between the following access patterns: 1.
Single-valued attributes associated with a degree of uncertainty
2.
Multivalued attributes that are described by fuzzy sets or possibility distributions
3.
Navigational access via paths, i.e., objects are linked together with pointers [Not all fuzzy object-oriented data models support fuzzy associations between objects, among those that do are by Bordogna (1994), Na (1996), and Yazici (1997, 1998).]
4.
Access via type hierarchies, i.e., queries may refer to specific types or a subhierarchy of types [Again, not all fuzzy object-oriented models support fuzzy type hierarchies, among those that do are by Bordogna (1994), George (1992), Na (1996), and Yazici (1997, 1998).]
Single-Valued Attributes For our first access pattern, we are going to look at single-valued attributes that have a grade of certainty (usually ranging from 0 to 1) attached to their values. This grade reflects the level of belief in this value and is based on certainty theory (Durkin, 1994; Shortliffe, 1975). Assume that we have a database for the administration of a university. We could have a type called Staff that holds the data for employees:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
210
Helmer
class Staff { attribute String Name; attribute String Position : degree; attribute Integer Age : degree; }
The addition of the clause : degree after the attributes Position and Age tells us that the values of these two attributes can be uncertain. So, if we are unsure whether a person works as an assistant professor, we can store the value "Assistant Professor" (0.6) in the attribute Position. (Note that this approach could also be modeled in crisp object models by adding another attribute holding the corresponding degree for each attribute that can contain uncertain data.) Possible queries in this context would be: “Give me the names of all staff that work as an assistant professor with at least a degree of 0.7” or “Give me the names and positions of all persons who are younger than 30 with a certainty of 0.4.” This approach is popularly applied for inexact reasoning in expert systems. As a matter of fact, the expert system MYCIN provided the basis for certainty theory.
Set-Valued Attributes A more flexible approach than the previous one is to represent the value of an attribute by means of a (disjunctive) fuzzy set. Look at the following example (again, we use a general notation for fuzzy attributes): class Staff { attribute String Name; fuzzy attribute String Position; fuzzy attribute Integer Age; }
Now the two attributes Position and Age are declared as fuzzy. What does this mean? If we want to express that it is perfectly possible that a person works as a research assistant or assistant professor, maybe is even an associate professor, but probably not a full professor, we can describe this fact by the fuzzy
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 211
Figure 2. Examples for fuzzy sets 1.0
1.0
0.5
0.5
Res. Assis.Assoc. Full Assis. Prof. Prof. Prof.
(a) Possible positions
20
40
60
80
(b) Young age
set in Figure 2(a). Describing the age of this person as young could be done with the fuzzy set described by the membership function in Figure 2(b). Querying on fuzzy sets is more flexible but also more complex than querying on single-valued attributes. One popular approach is based on the possibility theory by Prade and Testemale (1984). So, we are going to concentrate on this technique and give a brief description in the following. We want to fetch all objects (with a fuzzy attribute A) that satisfy a query condition µa°θ, meaning Aθa is satisfied, where θ is a (fuzzy) comparison operator and a is a (fuzzy) constant, represented by µθ and µa, respectively. As the values of A (and the query condition) can be fuzzy, there is some uncertainty as to whether a data item satisfies the condition or not. Two fuzzy measures are used to express this degree of uncertainty. One is the possibility measure defined as follows: ∀X ∈ P (Ω) : Π ( X ) = max π A( oi ) (ω ) ω ∈X
(1)
where Ω is the domain of attribute A, while P(Ω) denotes the power set of Ω. The value of attribute A of object oi is described by a possibility function πA(oi) on Ω (which basically is a normalized fuzzy set, i.e., at least one item has a membership degree of 1.0). Associated with each possibility measure is a necessity measure N(X): ∀X ∈ P(Ω ) : N ( X ) = 1 − Π ( X )
(2)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
212
Helmer
The possibility that the value of attribute A of data item oi belongs to the set of values determined by θ and a is equal to Π ( a o θ | A(oi )) = max min( µ aoθ (ω ), π A( oi ) (ω ))
(3)
µ aoθ (ω ) = max min( µθ (ω , ω ' ), µ a (ω ' ))
(4)
ω∈Ω
with
ω '∈Ω
The necessity of belonging to this set is equal to N ( a o θ | A(oi )) = min max( µ aoθ (ω ),1 − π A( oi ) (ω )) ω∈Ω
(5)
Let us also present an example query for this access pattern. We want to find all persons who are approximately young [see Figure 2(b) for the fuzzy set “young”]. The comparison operator for “approximately equal to” could be defined similarly to the one found in Prade (1984): | ω − ω'| 1 − for | ω − ω ' | ≤ 5 µ ≈ (ω , ω ' ) = 5 0 else
(6)
This formula assumes that Ω is represented by a range of numbers (as in this example, an age). The comparator determines the degree of similarity between age ω and age ω'. As we are working with a constant fuzzy value (µyoung), we can calculate the query condition ( µyoung ≈) (ω) beforehand: °
µ youngo≈ (ω ) = max min( µ ≈ (ω , ω ' ), µ young (ω ' )) ω '∈Ω
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 213
Figure 3. Relationships between object types
Navigational Access Relationships between objects are described by references from one object to another. In Figure 3, we see a schema graph describing the fact that a department employs several people who are engaged in different projects. In a FOODBS system, the relationships may be fuzzy, i.e., each link from one object to another has a degree of uncertainty associated with it. Figure 4 shows an excerpt of an instantiation of the above schema. Looking at this example, we see that the person with identification s1 is certainly employed at the department d1, while we are not 100% sure that this person is working on project p1. A possible query in this context could be: “Give me all departments that probably (with a degree larger than 0.8) employ people who are almost surely (with a degree greater than 0.95) involved in the projects p3 or p4.”
Type Hierarchies A query in an object-oriented database system may refer to objects of a certain type or to a certain type and all its subtypes. Look at the hierarchy of types
Figure 4. Instantiation of the schema in Figure 3
p1
0.9
s1
1.0
d1
0.8
1.0
s2 0.5 0.4
0.4 0.6
s3 s4
p3
1.0 0.3
d2 1.0
p2
0.8
p4
1.0 0.8
p5
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
214
Helmer
Figure 5. Type hierarchy Staf f
Administrative
Technical
Academi c
Teaching
Research
depicted in Figure 5. In the case of FOODBS systems, we may have objects that are not clearly assigned to a certain type. We do not look at how the membership grades are determined exactly but assume that we are able to compute them in some way. A typical query involving type hierarchies might be: “List the names of all academics who are older than 40 years. Make sure that the degree of membership to the class Academics or a subclass is at least 0.9.” As we will see, efficiently evaluating queries in which type hierarchies and other properties are mixed is not straightforward.
Index Structures for Access Patterns After having introduced different access patterns in the last section, we now show how these queries can be supported by various index structures. The outline of this section follows that of the last section.
Single-Valued Attributes Accesses to single-valued attributes are easiest to handle, as we can use the standard index structures of (relational) DBMSs. We present two of the most widely known index structures: B-trees and external hashing.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 215
Figure 6. B-tree 19
8
5
14
27
17
7
9
11
18
20
56
23
58
38
63
54
B-trees B-trees (Bayer, 1972) (or the more advanced B+-trees) are the standard index structures in relational database systems. They are balanced multiway trees, i.e., in contrast to binary trees, a node can have more than one key and more than two children (multiway), and all leaves are on the same level (balanced). The keys in a node N are sorted, and a subtree is assigned to each key. All keys in a subtree are less than the assigned key. All keys greater than the keys in node N are saved in an additional subtree (see Figure 6 for an example). In a database system, the nodes of a B-tree are mapped to pages in the secondary storage. A B-tree is much shallower than a binary tree, because the fan-out is much higher. For this reason and because of the balancing, only a few page accesses are necessary to find a key. To increase branching even further, B+trees are used. In B+-trees, all records are kept in the leaves — the inner nodes contain only reference keys. Normally, these keys are much smaller than the records. Thus, the level of branching is increased, and the height of the tree decreases. More details on B-trees and B+-trees can be found in standard textbooks on database systems (e.g., Silberschatz, 2001).
External Hashing We describe an extendible hashing index here, as it is a typical representative of an external hashing scheme. An extendible hashing index is divided into two parts: a directory and buckets (for details, see also Fagin, 1979). In the buckets, we store the full hash keys of and pointers to the indexed data items. We determine the bucket into which a data item is inserted by looking at a prefix hd of d bits of the hash key h. For each possible bit combination of the prefix, we find an entry in the directory pointing to the corresponding bucket. The directory has 2d entries, where d is called global depth (see also Figure 7). When a bucket
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
216
Helmer
Figure 7. Extendible hashing 000
001
010
011
100
101
110
111
d =3
d’=2 d’=3 d’=3 h 2 =00 h3 =010 h 3 =011
d’=1 h1 =1
overflows, it is split, and all its entries are divided among the two resulting buckets. In order to determine the new home of a data item, the length of the inspected hash key prefix has to be increased until at least two data items have different hash key prefixes. The size of the current prefix d' of a bucket is called local depth. If we notice after a split that the local depth d' of a bucket is larger than the global depth d, we have to increase the size of the directory. This is done by doubling the directory as often as needed to have a new global depth d equal to the local depth d'. For the bucket that was split, the new pointers are put into the directory. For the other buckets, the directory entries are copied. B-trees and external hashing assume that we want to submit queries involving one attribute: “List the names of all persons that are 35 years old” or “Return all persons on whose age we are certain (degree = 1.0).” Usually, queries will combine attribute values with certainty degrees and will even use ranges: “I want to have a list of all persons older than 40 with a certainty degree of at least 0.8.” In such cases, B-trees and external hashing will not be efficient. We need multidimensional access methods like grid files or k-d trees (to name prominent representatives).
Grid Files A grid file can be seen as a generalization of hashing to multiple dimensions (Nievergelt, 1984). Let us assume that we want to index the attribute Age with its corresponding degree of uncertainty. Figure 8 shows an example of a grid file for that case. The data space is partitioned into cells. The cells can share data pages as indicated by the dashed lines in Figure 8. For each dimension, we provide a linear scale that partitions the particular dimension in a uniform way, mapping the domain to an index. Accesses to the grid are done via these linear scales to
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 217
Figure 8. Grid file >60
<20 0
1
2
3
<0.25 0
5 1
2
0 3
determine the correct cell index. Range queries pose no problems. We just have to be careful to eliminate false drops caused by the page sharing.
K-d Trees The original k-d tree is a generalization of a binary tree to many dimensions (Bentley, 1975). In an ordinary (balanced) binary tree, each node splits the remaining data objects beneath it roughly into two halves. All objects with values smaller than the node value are found to the left of the node, all those with greater values are found to the right of the node. At each level of a k-d tree, a different dimension is chosen to divide the data objects. In our running example, we would first split according to age, then according to the uncertainty degree, then age again, and so on. As binary trees are not well suited for secondary storage structures, several extensions and modifications to k-d trees were proposed, e.g., k-d B-trees (Robinson, 1981) and hBπ-trees (Evangelidis, 1995). (For a general overview of multidimensional access methods, see Gaede, 1998.)
Set-Valued Attributes This query type is more flexible than the previous one on single-valued attributes. Therefore, this is the area where the most work has been done (Bosc, 1989, 1988; Boss, 1999; Helmer, 2001). (Additionally, all of these techniques can be used in other fuzzy DBMSs and are not restricted to object-oriented DBMS.) The basic principle (as introduced by Prade, 1984) is to look at fuzzy attribute values in
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
218
Helmer
terms of possibility distributions. Simplifying, we can see a possibility distribution as a disjunctive normalized fuzzy set, i.e., at least one value from the domain has a membership degree of one. Expressions (3) and (5) are unwieldy in terms of calculating them efficiently. Bosc and Galibourg show in Bosc (1989) how to simplify the evaluation of these expressions. A data item oi belongs to the set of data items possibly satisfying the query, iff Π ( a o θ | A(oi )) > 0 ⇔ L>0 (π A( oi ) ) ∩ L>0 ( µ aoθ ) ≠ ∅
(7)
where L >0 are α -cuts of fuzzy sets. An α -cut of a fuzzy set F is defined as (0 ≤ α ≤ 1) Lα ( µ F ) = {ω ∈ Ω | µ F (ω ) ≥ α }
(8)
We talk of strict α-cuts whenever L>α ( µ F ) = {ω ∈ Ω | µ F (ω ) > α }
(9)
There are two special α-cuts, the core L 1(µF) and the support L>0(µF) of a fuzzy set F. For more selective queries, an acceptance threshold α can be provided by the user. Determining qualifying data items then boils down to Π (α o θ | A(oi )) > α ⇔ Lα (π A( oi ) ) ∩ Lα ( µ aoθ ) ≠ ∅
(10)
⇒ L>0 (π A(oi ) ) ∩ Lα ( µ aoθ ) ≠ ∅
(11)
In both cases, the appropriate α -cut of µa°θ can be calculated beforehand, and the supports of the fuzzy sets describing the attribute values of the data items can be used in filtering data items during query evaluation. The case for the necessity measure is handled similarly. A data item oi belongs to the set of data items necessarily satisfying the query, iff
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 219
N (α o θ | A(oi )) > 0 ⇔ L1 (π A( oi ) ) ⊆ L>0 ( µ aoθ )
(12)
For an acceptance threshold of α we get N ( a o θ | A(oi )) > α ⇔ L1−α (π A( oi ) ) ⊆ Lα ( µ aoθ )
(13)
⇒ L1 (π A( oi ) ) ⊆ Lα ( µ a oθ )
(14)
Hereafter, when searching for supports that intersect with the α-cut of the query predicate, we call this a nonempty intersection query. When looking for cores that are a subset of the query predicate, we call this a subset query. Queries using this principle are supported by indexing the cores and supports of the fuzzy sets, respectively. In the literature we find two different approaches. The first approach assumes that the cores and supports may contain an infinite number of elements from a (continuous) domain. However, we have to be able to describe the cores and supports by closed intervals (Bosc, 1989). The second approach assumes that the cores and supports contain a finite number of elements from a (discrete) domain. An advantage here is that we are not restricted to intervals. In the following, we are going to discuss index structures capable of supporting the interval-based approach and then continue with those for discrete values.
Relational Interval Trees (RI-trees) Before introducing the RI-tree, we will present a brief introduction of the underlying principle, the interval tree by Edelsbrunner (Preparata, 1993). The backbone of an interval tree is a balanced binary tree on the domain from which the endpoints of the intervals are taken (see Figure 9 for dividing up the domain). When inserting an interval i = (li, ui) with lower bound li and upper bound ui, we attach it to the highest node v in the tree for which li ≤ v ≤ ui. The intervals associated with a node are stored in two lists: Lv and U v. In Lv all intervals are sorted in ascending order by li, and in Uv they are sorted in descending order by ui. [See also Figure 9 for an example after inserting the intervals (1,3), (2,5), (3,3), (3,7), (5,6), (5,7) into an interval tree.] The sorting of the two lists accelerates the query evaluation, as we will see in the following. When querying with an interval (or a point, which can be seen as an interval with li = ui), we traverse down from the root to a subset of leaves. Let λ and µ be the
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
220
Helmer
Figure 9. An interval tree 4 L4 U 4 (2,5 ) (3,7 )
2 L2 U 2 (1,3) (1,3 )
1 L1
U1
6
(3,7 ) (2,5 )
L6
3 L3 (3,3 )
5 U3
L5
U5
U6 5 (,6) (5,7 )
7
(5,7 ) (5,6 )
L7
U7
3 (,3)
lower and upper bounds of our query interval q, respectively. While descending down the tree, we have to distinguish three different cases. Figure 10 (taken from Kriegel, 2000) illustrates this for intersection queries. When v < λ, we have to check Uv for possibly intersecting intervals. As soon as we fail to find intersecting intervals, we can stop, as Uv is sorted by ui. We then continue by following the reference to the right child. When µ < v, we have to check Lv for possible query answers and continue down the left child of v. In case of λ ≤ v ≤ µ, we output all intervals in Lv (or U v) and visit both children. Searching for subintervals of q (in the case of subset queries) is not hard to do either. We just have to look at the nodes for which λ ≤ v ≤ µ and search for candidates in Lv and Uv. We can utilize the ordering of the lists by searching them from back to front. For the relational interval tree, the backbone is not actually materialized, as it has a regular structure. We create three different relations: i(v, li, ui) for the intervals, l(v, li) for the lists Lv, and u(v, ui) for the lists U v (each with an appropriate index). Querying is done by computing the numbers of the visited
Figure 10. Querying on an interval tree
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 221
nodes and submitting the corresponding range queries to the list relations l and u.
G-trees G-trees are a combination of grid files with B-trees using a clever partition numbering. In order to describe the index structure in an understandable way, we restrict ourselves to the two-dimensional case (which is also the case we need to index fuzzy data). Assume that each partition can hold no more than two data objects. We start with an initial partitioning as depicted in Figure 11(a) (taken from Kumar, 1994), where we split along the first dimension and number the partitions using the binary strings 0 and 1. After inserting some more objects, we have to split the partition 0 [see Figure 11(b)]. We do so along the second dimension, numbering the newly created partitions 00 and 01. As more overflows occur, we alternate between the two dimensions and number the partitions accordingly [see Figures 11 (c) and (d)]. This regular numbering scheme has several advantages, e.g., finding parent and child partitions is straightforward, as is finding complements of a partition (for details see Kumar, 1994).
Figure 11. Partitioning scheme in a G-tree
01
0
1
1
00
(a)
(b) 0111
010
010
011
0110 1
1
00
00
(c)
(d)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
222
Helmer
The partitions are indexed using a B-tree-like structure called G-tree. First, the binary numbers are converted to decimals in the following way. All binary numbers are brought to the same length by padding them with trailing 0s. In our example in Figure 11, we would have 0 (0000), 4 (0100), 6 (0110), 7 (0111), and 8 (1000). These numbers are inserted into the G-tree like into a B-tree. When searching the tree, we have to compute the relevant partition numbers and then look them up. When inserting and deleting objects, we have to adjust the partitioning scheme accordingly (for details see Kumar, 1994). Liu et al. adapted G-trees for fuzzy data by mapping fuzzy queries onto range searches (Liu, 1996). The intervals of supports (and cores) of fuzzy sets are mapped to two-dimensional space by considering the lower bound of the interval as x-value and the upper bound as y-value. Possible candidates for nonempty intersection queries are found by retrieving objects for which 0 ≤ x ≤ µ and λ ≤ y ≤ ∞. For subset queries, we need to check λ ≤ x ≤ ∞ and 0 ≤ y ≤ µ.
General Two-Dimensional Indexes The principle used in Liu (1996) for G-trees can also be applied to other twodimensional index structures, like the aforementioned grid files and k-d trees. For example, an approach using a multilevel grid file was introduced by Yazici and Cibiceli (1999).
Signatures We will now turn to index structures assuming a finite set of discrete values in the cores and supports of the fuzzy sets to be indexed. First, we give a brief review of the superimposed coding technique, and then we will discuss index structures built around this method. Superimposed coding is based on the idea of hashing values into random k-bit codes in a b-bit field and superimposing the codes for each value in a signature (Knuth, 1973). The fixed size b is called the signature length. We use signatures to represent the α-cuts of µa°θ and the supports and cores of the indexed fuzzy sets. There are two advantages to signatures. One is their constant length; keys of constant length are easier to manage than keys of variable length. The other advantage is the great speed with which signatures can be compared by using only bit operations.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 223
Example 1: An example for encoding the core of the fuzzy set “position” from Figure 2 in an 8-bit signature with k = 2 is: value
bitcode
Research Assistant
1001 0000
Assistant Professor
0001 0010
Signature
1001 0010
We cannot assume that the signatures of distinct sets are distinct. Still s θ t ⇒ sig ( s ) θ sig (t ) for θ ∈ {⊆, ∩}
(15)
where s and t are arbitrary sets, and sig (s) θ sig (t) and sig (s) are defined as
sig( s ) ⊆ sig(t ) := sig( s ) & ¬sig(t ) = 0 sig( s ) ∩ sig(t ) := sig( s ) & sig(t ) ≥ k | sig( s ) |:= number of bits set in s, also called the weight of sig( s )
with & denoting bitwise and and ¬ denoting bitwise complement. Hence, a pretest based on signatures can be fast because it involves only bit operations. Now, instead of comparing Lα(µa°θ) to the support or core of each πA(oi), we first compare the signature of Lα(µa°θ) to the signature of each support or core. During the evaluation of a query, if sig(L >0 ( π A(o i ) ) ∩ sig(L a ( µ a ° θ )) or sig(L1(πA(oi))⊆sig(Lα(µa°θ)) holds, we call oi a drop. Additionally, if ∏(a°θ A(oi)) > α or N(a°θ A(oi)) > α also holds, we have a right drop, else oi is a false drop. After determining all data items that are drops, we have to filter out the false drops. [The probabilities that data items turn out to be false drops have been studied intensively in Ishikawa (1993). We will not go into detail here.]
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
224
Helmer
SSF/Compressed SSF A sequential signature file (SSF) (Ishikawa, 1993) is a simple index structure. It consists of a sequence of pairs of signatures (of supports or cores, depending on the supported query type) and references to data items. During retrieval, the SSF is scanned and all data items oi with matching signatures are fetched and tested for false drops. Boss and Helmer (1999) showed that SSF and its compressed counterpart, compressed signature file (CSF), can be used to index fuzzy sets, and that this approach is faster than scanning all fuzzy sets. In the following section we will discuss how the usual ways of structuring indexes, namely, hierarchical organization and partitioning, are applied to signatures.
Hierarchical Signature Organization A signature tree (ST) is an hierarchical version of a signature file. The internal structure of STs is similar to that of R-trees (Guttman, 1984). The leaf nodes of ST (Deppisch, 1986; Hellerstein, 1994) contain pairs of signatures and references. So we find the same information in the leaf nodes of an ST as in an SSF. We can construct a single signature representing a leaf node by superimposing all signatures found in the leaf node (with a bitwise or-operation, denoted by “|”). This corresponds to uniting the sets in the leaf nodes. We call a union of sets from lower levels in the tree a bounding set. An inner node contains signatures of and references to each child node (Figure 12). The meaning of sig (Lx(πA(oi))) is the signature of the support (x=“>0”) or core (x=“1”). When we evaluate a query, we begin by searching the root for matching signatures. We recursively access all child nodes with signatures that match and work our way down to the leaf nodes. In inner nodes, a signature matches if Figure 12. A signature tree (ST) [sig(o 1
.A)2 | sig(o 3
.A) | sig(o [sig(o .A), .A)5 | ]sig(o 4 6
.A) ] | sig(o [sig(o .A), .A)8 | sig(o9 7
.A) ] | sig(o
’|’ denotes bitwise or
[sig(o 1
.A), ref(o [sig(o )].A), ref(o [sig(o )].A), ref(o 1 2 2 3 3
[sig(o 4
)]
[sig(o 7
.A), ref(o [sig(o ref(o [sig(o )].A), r 8 )].A), 7 8 9 9
.A), [sig(o ref(o [sig(o )].A), ref(o 4 ref(o 5 )].A), 5 6 6
)]
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 225
Figure 13. Extendible signature hashing (ESH) d’ 2 =
h2 (x) = 00
[sig(o 1
.A),1 ref(o
0 01
d’ =3
h3 (x) = 010
0 10
[sig(o 2
.A), ref(o 2
1 00
d’ =3
h3 (x) = 011
1 01
[sig(o 3
.A),3 ref(o
d’ =1
h1 (x) 1 =
[sig(o 4
.A), ref(o 4
d =3
)],
0 00
)],
0 11
)],
1 10 1 11
)],
sig(inner node)∩sig(L α ( µ a°θ )). In leaf nodes, we check sig(L >0 ( π A(o i ) )) ∩sig(Lα(µa°θ)) or sig(L1(πA(oi)))⊆sig(Lα(µa°θ)), depending on the query type. An alternative for evaluating nonempty intersection queries is searching for all supersets of the singletons in Lα(µa°θ) and then forming the union of all retrieved answer sets. At first glance, this looks inefficient, as we have to start a subquery for each value in Lα(µa°θ). However, the performance of signature-based access methods for superset queries is significantly better than for nonempty intersection queries because the false drop rate can be kept much lower for superset queries.
Partitioned Signature Organization An extendible signature hashing index (ESH) (Helmer, 2003) is based on extendible hashing. As already mentioned, it is divided into two parts: a directory and buckets. In the buckets we store the signature/reference pairs of all data items (see Figure 13). We determine the bucket into which a signature/reference pair is inserted by looking at a prefix of d bits of a signature (where d is the global depth of the hash table). In order to find all subsets of Lα(µa°θ), we determine all buckets to be fetched. We do this by generating all subsets of sig(Lα(µa°θ). Then we access the corresponding buckets sequentially (by ascending page number), taking care not to access a bucket more than once. Afterwards we check the full signatures and eliminate the false drops. ESH has a disadvantage compared to the other signature-based index structures: we cannot evaluate nonempty intersection queries directly with this kind of
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
226
Helmer
index. This is due to the fact that we store partial signatures in the directory of ESH. Let part d(sig(s)) denote the first d bits of the signature of set s. Then, sig( s ) ⊆ sig(t ) ⇒ part d (sig( s )) ⊆ part d (sig(t ))
but sig( s ) ∩ sig(t ) ⇒ part d (sig( s )) ∩ part d (sig(t )) .
This means that we cannot deduce nonempty intersection by looking at partial signatures. We can, however, evaluate this kind of query similarly to the alternative technique used for ST by searching for all supersets of the singletons in Lα(µa°θ).
Inverted Files An inverted file (see Figure 14) consists of a directory containing all distinct values in the domain W, and a list for each value consisting of the references to data items with support or core of πA(oi) contains this value. For an overview on traditional inverted files, see Kitagawa (1996) and Sacks-Davis (1997). As done frequently, we can hold the search values of the directory in a B+-tree. Moreover, the lists are modified by storing the cardinality of the cores with each data item reference (denoted by oixy.Ain Figure 14). This enables us to answer subset
Figure 14. Inverted file v1 v2 v3 v4
vn
[ref(o i11
), i 11
|o
i12 .A|], i 12
[ref(o i13
), i 13
|o
[ref(o i21
), i 21
|o
i22 .A|], i 22
[ref(o i32
), i 32
|o
[ref(o i31
), i 31
|o
i32 .A|], i 32
[ref(o i33
), i 33
|o
[ref(o i41
), i 41
|o
i42 .A|], i 42
[ref(o i43
), i 43
|o
[ref(o in1
), n1 i
|o
n2 in2 .A|], i
[ref(o in3
), n3 i
|o
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 227
queries efficiently by using the cardinalities as a quick pretest. The lists can also be compressed using, for example, lightweight compression techniques (Westmann, 2000). When evaluating a nonempty intersection query, we simply fetch the lists for all items in Lα ( µ a oθ ) and form the union of the retrieved data items. When evaluating a subset query, we traverse all lists associated with the values in Lα(µa°θ). We count the number of occurrences for each reference appearing in a retrieved list. When the counter for a reference is not equal to the cardinality of its core, we eliminate that reference. We can do this because this reference also appears in lists associated with values that are not in Lα(µa°θ). The referenced core cannot be a subset of Lα(µa°θ). In cases of subset and nonempty intersection queries, we have to check whether the retrieved data items satisfy the query possibly (or necessarily) as the supports (and cores) serve only as filters.
Paths In this section, we investigate index structures for indexing paths in objectoriented DBMSs and show how they can be adapted to fuzzy object-oriented DBMSs. We are going to look at two index structures in particular: access support relations (ASRs) (Kemper, 1992) and join index hierarchies (Han, 1999) and their respective adaptions to fuzzy DBMSs.
Access Support Relations (ASRs) ASRs relate objects to each other and may span over reference chains. These chains may even include collection-valued components, and, depending on the applications, several different variants of ASRs can be used. We start by describing binary ASRs, which encode paths of length one. Figure 15 shows the binary ASRs for the example in Figure 4. We dropped the objects d2, s3, s4, and p5 to keep the example manageable. We added several new objects to show how objects not taking part in all relationships are handled. Another important change is the inclusion of the uncertainty degrees to support fuzzy queries, which are not included in the original ASRs. When merging binary ASRs together (in order to support longer paths), we distinguish between four different extensions: canonical, left-complete, rightcomplete, and full. Let us illustrate the properties of these different extensions by means of our example.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
228
Helmer
Figure 15. Binary ASRs Staff.worksIn Project.name
Department.employs
s1
p1
0.9
s1
p2
1.0
d1
s1
1.0
s2
p2
0.8
d1
s2
0.8
s2
p3
0.4
d3
s8
0.2
s2
p4
0.6
s6
p6
0.7
s7
p7
0.4
p1
“Natix”
p2
“Timber”
p3
“Tamino”
p4
“Rainbow”
p6
“Galax”
p8
“IPSI-XQ”
Figure 16. Canonical extension ASRcan: Department.employs.worksIn.name d1
1.0
s1
0.9
p1
“Natix”
d1
1.0
s1
1.0
p2
“Timber”
d1
0.8
s2
0.8
p2
“Timber”
d1
0.8
s2
0.4
p3
“Tamino”
d1
0.8
s2
0.6
p4
“Rainbow”
Canonical extensions contain only information on complete paths, i.e., paths that start at department objects and end at the names of projects (Figure 16). Left-complete extensions include all paths starting at department objects but not necessarily ending at projects (Figure 17). Similar to this are right-complete extensions, which end at names of projects but do not necessarily go all the way to department objects (Figure 18). Full extensions also comprise all partial paths (Figure 19). Usually we do not materialize all extensions but a mix of different extensions and decompositions. A decomposition of an ASR is a projection on relevant (consecutive) attributes of an extension. Those access relations that are materialized are indexed using B +-trees, speeding navigational accesses considerably. For details on how to optimize ASRs for specific applications see Kemper (1989).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 229
Figure 17. Left-complete extension ASRleft: Department.employs.worksIn.name d1
1.0
s1
0.9
p1
“Natix”
d1
1.0
s1
1.0
p2
“Timber”
d1
0.8
s2
0.8
p2
“Timber”
d1
0.8
s2
0.4
p3
“Tamino”
d1
0.8
s2
0.6
p4
“Rainbow”
d3
0.2
s8
—
—
—
Figure 18. Right-complete extension ASRright: Department.employs.worksIn.name d1
1.0
s1
0.9
p1
“Natix”
d1
1.0
s1
1.0
p2
“Timber”
d1
0.8
s2
0.8
p2
“Timber”
d1
0.8
s2
0.4
p3
“Tamino”
d1
0.8
s2
0.6
p4
“Rainbow”
—
—
s6
0.7
p6
“Galax”
—
—
p8
“IPSI-XQ”
Join Index Hierarchies (JIHs) One problem with ASRs can be their sheer size for long paths, even if decomposing them. Usually, only the endpoints of paths in an ASR are indexed, which makes updates on links in between costly (as we have to scan the whole relation).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
230
Helmer
Figure 19. Full extension ASRright: Department.employs.worksIn.name d1
1.0
s1
0.9
p1
“Natix”
d1
1.0
s1
1.0
p2
“Timber”
d1
0.8
s2
0.8
p2
“Timber”
d1
0.8
s2
0.4
p3
“Tamino”
d1
0.8
s2
0.6
p4
“Rainbow”
d3
0.2
s8
—
—
—
—
—
s6
0.7
p6
“Galax”
—
—
—
—
p8
“IPSI—XQ”
—
—
s7
0.4
p7
—
Figure 20. Example for a join index hierarchy JIH: Department Project d1
p1
“Natix”
d1
p2
“Timber”
d1
p3
“Tamino”
d1
p4
“Rainbow”
JIHs generalize the decomposition principle by allowing the omission of intermediate objects in a path (Han, 1999). For example, we could have an index that jumps from department objects right to projects, skipping staff objects (see Figure 20). A complete JIH schema for our example can be seen in Figure 21(a) (d = Department, s = Staff, p = Project, n = Name). The lower part is the base
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 231
Figure 21. Full JIH vs. partial JIH
(a)
(b)
JIH, which consists of all binary relationships. Due to space constraints, usually only part of a full JIH is materialized [see Figure 21(b)]. However, two difficulties have to be overcome. We have to guarantee the correctness of updates on intermediate links in paths and have to find a way to handle the intermediate uncertainty degrees.
Updates Look at part of the schema instantiation in Figure 22. Clearly, there are two paths from d1 to p2. When deleting one of them (e.g., d1 → s2, because s2 starts working at another department), we have to decide what to do with our relationship d1 → p2 in Figure 20. By looking at the JIH in Figure 20, we cannot decide whether d1 → p2 should be deleted or not. Han et al. solved this problem by counting the number of links between each pair of objects. For the base JIH, this is trivial. For our example in Figure 22, we would store the following four tuples in the appropriate base JIH relations: (d1, s1, 1), (d1, s2, 1), (s1, p2, 1), and (s2, p2, 1). This is also done for the relations on higher levels, e.g., in the tuple (d1, p2, 2). When deleting a link in the base JIH, we propagate these changes to the higher levels. In this case, we would subtract one for the counter for d1 → p2 and would know that d1 and p2 are still connected.
Figure 22. Multiple paths
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
232
Helmer
Uncertainty Degrees The question remains how we handle the uncertainty degrees of the intermediate links we cut away on the higher levels of a JIH. (It is no problem to store them in the base JIH.) One possible solution is to allocate space in the levels above the base JIH in which to store all the intermediate uncertainty degrees in lists. However, we expect this to bloat the index significantly. A more elegant solution can be found if we are interested in an overall uncertainty degree of all paths. Assume that the function used to compute this overall uncertainty degree is reversible, like multiplying the degrees along each path and averaging all paths. For example, in Figure 22, we would store the sum of the products of the uncertainty degrees (1.0 × 1.0 + 0.8 × 0.8 = 1.64) and the number of paths in the tuple (d1, p2, 2, 1.64). When deleting a path, the sum is reduced by the appropriate value, and the counter is decremented by one.
Type Hierarchies In this section, we will briefly present the conventional techniques used for type hierarchy indexing in non-FOODBS systems. In a second step, we will show how to combine and extend these methods for FOODBS systems. One difficulty in indexing type hierarchies is that we can either group the objects by type or by key values. Each approach has its advantages and disadvantages, as we will see.
SC-trees An SC-tree (Kim, 1989) is straightforward. Basically, we build a separate B+tree for each type. When querying a subhierarchy of our example in Figure 5, we determine all types included in the subhierarchy and evaluate a query on each corresponding B+-tree. When interested in all academics, we have to query the B+-trees for the class Academic, Teaching, and Research.
H-trees While an SC-tree maintains a set of isolated structures for each type, an H-tree (Low, 1992) nests these B+-trees to avoid a full search of each component. This means that the nodes of a superclass B +-tree may contain pointers to nodes of subclass B+-trees. There are two important rules for nesting the B+-trees. First, we have to make sure that the ranges of the nesting node and the nested node are compatible, so that we do not accidentally end up in a different part of the
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 233
domain when traversing pointers to a different B+-tree. Second, all leaf nodes in a subclass B+-tree have to be reachable from the corresponding superclass B+tree. Due to space constraints, we are not going to present the details on how this is done (for further explanations see Low, 1992).
CH-index A CH-index (Kim, 1989) uses a different approach than SC- or H-trees. Here, the objects are indexed using a single B+-tree structure, and the inner pages look like the inner pages of a regular B+-tree storing the values of the indexed attributes. The leaf pages look different, however. In the leaf pages, we distinguish between the different types of objects. Figure 23 shows a simplified view of a CH-index (for details see Kim, 1989) indexing the ages of staff members (with a path from the root node to a leaf). For each value (in a leaf page), we have a list for each type for which objects exist that have this value.
CG-trees Depending on the size of the indexed type hierarchy, we have many entries in a leaf page of a CH-index that we are not interested in during query evaluation. For example, if we want to retrieve all academics, we can ignore objects of the types Staff, Administrative, Technical, and Nontechnical. Unfortunately, pointers to these objects are contained in the leaf pages of a CH-index. In a CG-tree, we have at most one pointer per type. Figure 24 shows the two lowest levels of a (slightly simplified) CG-tree (for implementation details, see Kim, 1989). The objects belonging to the type Academic are stored on the pages
Figure 23. Example for a CH-index
28 Academic
25
35
........
28
33
........
29 Academic
Research
.......
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
234
Helmer
Figure 24. CG-tree 28 Academic
Research
33 Academic
P1
35 Academic
P2
Research
P3
P4
P5
P1, P2, and P3, and those for type Research are stored on pages P4 and P5. As objects of different types are probably not distributed in the same way, pages can be shared. So, for example, if pages P1 and P2 are only lightly filled, they can be merged to one page that is shared between the two entries for Academic on the level above (for details on how to balance the leaf pages, see also Kilger, 1994).
Multikey Index The basic idea in using multikey indexes is to consider the type information as just another dimension describing an object. The main problem with this approach is the partial ordering of the types. We would like to impose a total order on the types in such a way that all queries regarding subhierarchies map to contiguous range queries. Assume that we want to retrieve all academics between the ages of 25 and 50. Figure 25(a) shows an optimal way to linearize all the types of the staff hierarchy, while Figure 25(b) shows a suboptimal solution. When the objects are optimally arranged by type on disk, we can (for all subtype
Figure 25. Linearizing type hierarchies
Research Teaching Academic
Research Teaching N
N
Technical Academic Adminstrative Staff
Technical Adminstrative Staff 20 40 60 80
(a)
20 40 60 80
(b)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 235
hierarchies) retrieve all objects belonging to a certain subtype hierarchy via one sequential scan without gaps. Mueck and Polaschek gave an algorithm that finds an optimal linearization (if one exists) (Mueck, 1996, 1997). After linearizing the type hierarchy, we can use any standard multikey index structure. However, it is not always possible to find an optimal linearization in the case of multiple inheritance. This is also important in the context of FOODBS systems, because fuzzy membership of objects in classes may lead to similar problems.
Indexing Fuzzy Type Hierarchies The technique used for SC-trees can be adapted to fuzzy type hierarchies in a straightforward manner by exchanging the B+-trees for other data structures. However, SC-trees are the most inefficient of all presented indexes, as we have to do a full search for each subtype. H-trees are too closely knit to the structure of B+-trees, which are not necessarily ideal data structures for FOODBS systems. Multikey indexes are performant, if we are able to linearize the type hierarchy well enough. We expect that this may be difficult to do for fuzzy type hierarchies. For these reasons, we opt for adapting the methods used in CHindexes and CG-trees to fuzzy type hierarchies. Looking at Figure 8, we see that each cell in a grid file has a pointer to a data page (which it may share with other cells). Similar to CG-trees, we propose that cells should contain a pointer for each different object type that is present. As we expect that not all cells are filled evenly, objects of different types will probably share different data pages. There are still a couple of open questions regarding the optimization of this data structure, e.g., how do we exactly determine the cell sharing for each type, and how do we balance the data pages in this twodimensional case? Another, more general problem affecting all indexes for fuzzy type hierarchies is the fact that we store objects belonging to more than one type redundantly for each type. Adding another level of indirection will solve this particular problem but will also add inefficiency.
Future Trends Developing new and improving existent index structures for FOODBS systems will remain a viable research topic in the future, as many open problems still exist. Let us name a few important ones here.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
236
Helmer
Multidimensional index structures have demonstrated their usefulness for FOODBS systems by their flexibility, however, they have a general problem with high-dimensional data (this is called the “curse of dimensionality”). Fortunately, when used for indexing fuzzy data, we are at the lower end of multidimensionality. This, and the fact that fuzzy data is an application for multidimensional index structures with special needs and requirements, makes us hopeful that further improvements can be found. Efficient support of navigational accesses to objects via paths still lacks satisfactory handling of uncertainty degrees. For example, the more efficient JIHs (compared to access support relations) have problems dealing with uncertainty degrees on intermediate paths that have been cut away in the index. Indexing fuzzy type hierarchies is also not yet perfect. Details on how to optimize the index structures for certain applications are missing, and we have some redundancy in storing objects belonging to more than one type.
Conclusions Efficient data retrieval is a necessity for a database system in order to be accepted by end users. The history of database systems is full of examples to prove this. Relational systems were able to replace network and hierarchical database systems only after their performances were increased considerably. Non-FOODBS systems can only be found in niche applications today, as their performance and scalability could not keep up with relational systems. One issue today is the performance of native XML database systems, which still lags behind expectations. In our opinion, the fate of each new kind of database system will be partly decided by whether or not its performance will improve significantly over time. This is also true for FOODBS systems. The task to improve their performance will not be easy, because in addition to the fuzzy components, the regular objectoriented components also need to be improved. One important step in improving the efficiency of a database system is the introduction of powerful index structures. Although a promising start has been made for FOODBS systems, this research area has not yet received enough attention. Especially in the area of path accesses and fuzzy type hierarchies, there are still plenty of opportunities left for future research.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 237
References Bayer, R., & McCreight, E. (1972). Organization and maintenance of large ordered indexes. Acta Informatica, 1, 173–189. Bentley, J. L. ( 1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509–517. Bertino, E., Ooi, B. C., Sacks-Davis, R., Tan, K. -L., Zobel, J., Shidlovsky, B., & Catania, B. (1997). Indexing techniques for advanced database systems. Dordrecht: Kluwer Academic Publishers. Bordogna, G., Lucarella, D., & Pasi, G. (1994). A fuzzy object oriented data model. In Proceedings of the Third IEEE Conference on Fuzzy Systems (pp. 313–318). Bosc, P., & Galibourg, M. (1989). Indexing principles for a fuzzy database. Information Systems, 14(6), 493–499. Bosc, P., Galibourg, M., & Hamon, G. (1988). Fuzzy querying with SQL: Extensions and implementation aspects. Fuzzy Sets and Systems, 28, 333– 349. Boss, B., & Helmer, S. (1999). Index structures for efficiently accessing fuzzy data including cost models and measurements. Fuzzy Sets and Systems, 108(1), 11–37. Cattell, R., Barry, D. K., Berler, M., Eastman, J., Jordan, D., Russell, C., Schadow, O., Stanienda, T., & Velez, F. (Eds.). (2000). The Object Data Standard: ODMG 3.0. San Francisco: Morgan Kaufmann. Deppisch, U. (1986). S-tree: A dynamic balanced signature index for office retrieval. In Proceedings of the 1986 ACM Conference on Research and Development in Information Retrieval (pp. 77–87). Durkin, J. (1994). Expert systems: Design and development. Upper Saddle River, NJ: Prentice Hall. Evangelidis, G., Lomet, D., & Salzberg, B. (1995). The hbπ -tree: A modified hbtree supporting concurrency, recovery and node consolation. In Proceedings of the 21st VLDB Conference (pp. 551–561). Fagin, R., Nievergelt, J., Pippenger, N., & Strong H. R. (1979). Extendible hashing — a fast access method for dynamic files. ACM Transactions on Database Systems, 4(3), 315–344. Gaede, V., & Günther, O. (1998). Multidimensional access methods. ACM Computing Surveys, 30(2), 170–231.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
238
Helmer
George, R., Buckles, B. P., & Petry, F. E. (1992). An object-oriented data model to represent uncertainty in coupled artificial intelligence-database systems. In M. P. Papazoglou, & J. Zeleznikow (Eds.), The next generation of information systems: From data to knowledge — A selection of papers presented at two IJCAI-91 workshops, Sydney, Australia, August 26, 1991 (Vol. 611 of Lecture Notes in Computer Science, pp. 37–48). Berlin: Springer. Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD (pp. 47–57). Han, J., Xie, Z., & Fu, Y. (1999). Join index hierarchy: An indexing structure for efficient navigation in object-oriented databases. ACM Transactions on Knowledge and Data Engineering, 11(2), 321–337. Hellerstein, J. M., & Pfeffer, A. (1994). The RD-tree: An index structure for sets. Technical Report 1252. Madison: University of Wisconsin. Helmer, S. (2001). Indexing fuzzy data. In Proceedings of the Joint Ninth IFSA World Congress and 20th NAFIPS International Conference (pp. 2120–2125). Helmer, S., & Moerkotte, G. (2003). A performance study of four index structures for set-valued attributes of low cardinality. VLDB Journal, 12(3), 244–261. Ishikawa, Y., Kitagawa, H., & Ohbo, N. (1993). Evaluation of signature files as set access facilities in OODBs. In Proceedings of the 1993 ACM SIGMOD (pp. 247–256). Kemper, A., & Moerkotte, G. (1989). Access support in object bases. Technical Report 17/89. Karlsruhe: University of Karlsruhe. Kemper, A., & Moerkotte, G. (1992). Access support relations: An indexing method for object bases. Information Systems, 17(2), 117–146. Kilger, C., & Moerkotte, G. (1994). Indexing multiple sets. In Proceedings of 20th International Conference on Very Large Data Bases (pp. 180– 191). Kim, W., Kim, K. -C., & Dale, A. (1989). Indexing techniques for objectoriented databases. In W. Kim, & F. H. Lochovsky (Eds.), Objectoriented concepts, databases, and applications (pp. 371–394). Reading, MA: Addison-Wesley. Kitagawa, H., & Fukushima, K. (1996). Composite bit-sliced signature file: An efficient access method for set-valued object retrieval. In Proceedings of the International Symposium on Co-operative Database Systems for Advanced Applications (CODAS) (pp. 388–395).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index Structure for Fuzzy Object-Oriented Database Systems 239
Knuth, D. E. (1973). The art of computer programming (Vol. 3): Sorting and searching. Reading, MA: Addison Wesley. Kriegel, H. -P., Pötke, M., & Seidl, T. (2000). Managing intervals efficiently in object-relational databases. In Proceedings of the 26th VLDB Conference (pp. 407–418). Kumar, A. (1994). G-tree: A new data structure for organizing multidimensional data. Transactions on Knowledge and Data Engineering, 6(2), 341– 347. Liu, C., Ouksel, A. M., Sistla, A. P., Wu, J., Yu, C. T., & Rishe, N. (1996). Performance evaluation of G-tree and its application in fuzzy databases. In CIKM ’96, Proceedings of the Fifth International Conference on Information and Knowledge Management (pp. 235–242). Low, C. C., Ooi, B. C., & Lu, H. (1992). H-trees: A dynamic associative search index for OODB. In Proceedings of the 1992 ACM SIGMOD Conference (pp. 134–143). Luk, R. W. P., Leong, H. V., Dillon, T. S., Chan, A. T. S., Croft, W. B., & Allan, J. (2002). A survey in indexing and searching XML documents. Journal of the American Society for Information Science and Technology, 53(6), 415–437. Manolopoulos, Y., Theodoridis, Y., & Tsotras, V. J. (1999). Advanced database indexing. Dordrecht: Kluwer Academic Publishers. Mueck, T. A., & Polaschek, M. L. (1996). Indexing type hierarchies with multikey structures. In Proceedings of the Seventh Workshop on Persistent Object Systems (POS) (pp. 184–193). Mueck, T. A., & Polaschek, M. L. (1997). Index data structures in objectoriented databases. Dordrecht: Kluwer Academic Publishers. Na, S., & Park, S. (1996). A fuzzy association algebra based on a fuzzy object oriented data model. In Proceedings of the 20th Computer Software and Applications Conference (COMPSAC ’96) (pp. 276–281). Nievergelt, J., & Hinterberger, H. (1984). The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems, 9(1), 38–71. Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34, 115–143. Preparata, F. P., & Shamos, M. I. (1993). Computational geometry: An introduction. Berlin: Springer. Robinson, J. T. (1981). The k-d B-tree. In Proceedings of the 1981 ACM SIGMOD (pp. 10–18).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
240
Helmer
Sacks-Davis, R., & Zobel, J. (1997). Text databases. In Indexing techniques for advanced database systems (pp. 151–184). Dordrecht: Kluwer Academic Publishers. Shortliffe, E. H., & Buchanan, B. G. (1975). A model of inexact reasoning in medicine. Mathematical Biosciences, 23, 351–379. Silberschatz, A., Korth, H. F., & Sudarshan, S. (2001). Database system concepts. New York: McGraw-Hill. Westmann, T., Kossmann, D., Helmer, S., & Moerkotte, G. (2000). The implementation and performance of compressed databases. SIGMOD Record, 29(3), 55–67. Yazici, A., & Cibiceli, D. (1999). An access structure for similarity-based fuzzy databases. Information Sciences, 115(1–4), 137–163. Yazici, A., & Koyuncu, M. (1997). Fuzzy object-oriented database modeling coupled with fuzzy logic. Fuzzy Sets and Systems, 89(1), 1–26. Yazici, A., George, R., & Aksoy, D. (1998). Design and implementation issues in the fuzzy object-oriented data model. Information Sciences, 108(1–4), 241–260.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 241
Chapter VIII
Introducing Fuzziness in Existing Orthogonal Persistence Interfaces and Systems Miguel Ángel Sicilia University of Alcalá, Spain Elena García-Barriocanal University of Alcalá, Spain José A. Gutiérrez University of Alcalá, Spain
Abstract Previous research has resulted in generalizations of the capabilities of OODB models and query languages to cope with imprecise and uncertain information in several ways, informed by previous research in fuzzy relational databases. As a result, a number of models and techniques to integrate fuzziness in its various facets in object data stores are available for researchers and practitioners, and even extensions to commercial systems have been implemented. Nonetheless, for those models and techniques to become widespread in industrial contexts, more attention should be paid to their integration with current database design and programming practices, so that the benefits of fuzzy extensions could be easily adopted and seamlessly integrated in current applications. This chapter attempts to provide some criteria to select the fuzzy extensions that Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
242 Sicilia, García-Barriocanal, & Gutiérrez
more seamlessly integrate in the current object storage paradigm known as orthogonal persistence, in which programming-language object models are directly stored, so that database design becomes mainly a matter of object design. Concrete examples and case studies are provided as practical illustrations of the introduction of fuzziness both at the conceptual and the physical levels of this kind of persistent system.
Introduction A number of research groups has investigated the problem of modeling fuzziness in the context of object-oriented databases (OODBs), e.g., De Caluwe (1998), Ma, Zang, and Ma (2003), and some of their results include research implementations on top of commercial systems, e.g., those reported in Yazici, George, and Aksoy (1998) and in Schenker, Last, and Kandel (2001). Despite the considerable amount of significant research in the field, no commercial system is available today that supports fuzziness explicitly in its core physical or logical model, and existing database standards regarding object persistence sources — like those of the Object Data Management Group (ODMG) (Cattell, 2000) and Java™Data Objects (JDO) (Russell et al., 2001) — do not support vagueness or any other kind of generalized uncertainty information representation (Klir & Wierman, 1998) in their data models. One possible reason for this lack of integration of fuzziness in industrial practices may be found in the relative complexity of modeling with fuzzy mechanisms, which makes it difficult for average practitioners to fully understand and exploit the potential of fuzzy techniques. Studies coming from the field of psychology of programming, like those by Green and Petre (1996) and Kao and Archer (1997), may serve as points of departure to investigate how fuzziness affects the mental models of programmers and designers. In any case, further research is needed in how to extend existing (crisp) database programming technology to its fuzzy generalization in an acceptable and “usable” way for the average developer. In addition, some of these generalizations may eventually lead to reduced performance and other inefficiencies, precluding a priori their acceptability. This chapter aims at providing an overview of some of the issues regarding the just described situation, and at serving as a point of departure for further research in the area. The rest of this chapter is structured as follows. The second section provides a brief review of existing research on extending OODB models, and the motivation for research on usability and acceptability of fuzzy constructs in orthogonal persistence systems and programming interfaces. The third section deals with the introduction of specific fuzzy constructs in orthogonal persistence systems, Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 243
according to their similarities to existing crisp conceptual modeling elements. The fourth section briefly sketches some of the representational and physical storage issues that must be taken into account when introducing fuzzy constructs. Finally, some concrete illustrations of the issues are provided in the fifth section.
Background Several fuzzy OODB models and applications have been reported to date. Similarity-based models like the one described in Aksoy, Yazici, and George (1996) provide class definitions based on similar value ranges of instances. Models based on possibility theory (Dubois, Prade, & Rossazza, 1991) are able to represent vagueness and uncertainty in class hierarchies by introducing constraints in attribute values. Models like UFO (De Caluwe, 1998) provide a variety of representations for imperfect information, separating concerns for vagueness and for uncertainty. Other authors proposed fuzzy sets as first-class programming objects (Inoue, Yamamoto, & Yasunobu, 1991). Existing applications of fuzzy object databases include geographical information systems (Cross & Firat, 2000), applications to multimedia (Koprulu, Cicekli, & Yazici, 2003), and retrieval in image databases (Nepal, Ramakrishna, & Thom, 1999). Database models like FOOD (Yazici & Koyuncu, 1997) and FRIL++ (Cao & Rossiter, 2003) integrate with logics or deductive capabilities to provide support for fuzzy inference, but we will not deal with this issue here, because most current industrial applications do not include reasoning and are not based on a sort of knowledge representation formalism, in the sense given by Davis, Shrobe, and Szolovits (1993). Despite the fact that current approaches to uncertainty and imprecision in object databases are fairly diverse in their supporting mathematical frameworks and assumptions, for now, they are relegated to research systems for specific applications. In fact, fuzzy object models are not considered in standard modeling languages like the Unified Modeling Language (UML), and they are not supported by any kind of free or commercial persistence system. This situation is aggravated by the fact that object databases are currently considered “niche” technologies (Kim, 2003) that have not reached a state of wide industrial adoption, except for specialized applications like CAD/CAM, resulting in a lack of common physical and distribution architectures. Consequently, the case for fuzzy extensions to object databases requires the practical integration of research models in existing products and programming interfaces. Such pragmatically directed integration efforts should take as a point of departure the existing mindset conformed by the most-used object-oriented
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
244 Sicilia, García-Barriocanal, & Gutiérrez
languages (like Java or C++) and database systems (converging on ODMG and more recently, on JDO), considering consistency and ease of understanding as the primary concerns. Extensions to database or object design artifacts should first come in the forms of strictly additive increments, so that the (crisp) semantics of the previous models remain unaffected for backward compatibility. But this is not always easy, because generalizations often require changes in basic model definitions, like those of existing extensions to ODMG type systems (De Tré & De Caluwe, 2003) and to UML basic cardinality definitions (Sicilia, García, & Gutiérrez, 2002). This chapter describes a concrete selection of basic fuzzy extensions and their rationales, along with some implementation concerns regarding their suitability in practical settings.
Introducing Fuzziness in Orthogonal Persistence Interfaces One view of fuzzy extensions to object database technology is that of providing the more comprehensive range of conceptual elements to obtain the richer model in terms of features for the representation of uncertainty and imprecision in its various facets (Smets, 1997). This view is mainly oriented toward obtaining mathematical models that integrate a large number of features and techniques in a single model. An example of such an integrated system in the fuzzy relational database arena is GEFRED (Medina, Pons, & Vila, 1994). But such an approach does not consider a priori issues of usability and adequacy of the extensions being included, from the perspective of technology adoption. One alternative view of extending object database models with fuzzy constructs is that of taking existing database concepts as points of departure, and selecting for inclusion first those fuzzy extensions that are closer to existing modeling concepts, in an attempt to conform a set of extensions that seamlessly integrate with existing orthogonal persistence systems and programming practices. This latter approach, that has received little attention to date, is the one adopted in this chapter, so that the rest of this section addresses general criteria for the introduction of fuzziness and general extensions to existing and widespread data modeling concepts.
Criteria for the Introduction of Fuzzy Constructs Here we are concerned with the selection of fuzzy extensions to the object database model that are closer to existing widespread object-oriented modeling
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 245
and programming concepts, instead of focusing on other kinds of technical considerations described elsewhere (Askoy & Yazici, 1993). From a cognitive perspective, database models and associated programming models require the construction of mental models, and some assumptions are required to select fuzzy information artifacts. This perspective leads us to consider the usability of fuzzy constructs as the general criterion. Usability must be understood here as the extent to which a given fuzzy extension matches the existing concepts that are commonly dealt with by practitioners. This concept of usability must be broken down in more concrete attributes that will be discussed in what follows. According to the cognitive dimension framework (Green, 2000) role-expressiveness is a dimension of information artifacts that refers to how easy it is to discover the rationale for structures. In the study of visual programming languages (Green & Petre, 1996), it is also mentioned in the dimension of closeness of mapping of the representation to the domain, and consistency, which states that similar semantics should be expressed in similar syntactical structures. These three dimensions can be adapted to become criteria for the introduction of extensions for fuzziness in object database models, taking as a point of departure the actual design and programming interfaces of OODBs. Imperative OODB application programming interfaces stay close to the semantic and syntax of the object-oriented programming languages in which they are embedded — see, for example, (Atkinson et al., 1996) — facilitating the construction of research prototypes that extend commercial systems by adding a software layer that acts as a proxy filter (Gamma et al., 1995) for the underlying nonfuzzy languages. Both JDO, ODMG, and other nonstandardized programming interfaces follow to some extent the principles of orthogonal persistence, so that the problem of introducing fuzziness can be viewed as the problem of “fuzzifying” common object-oriented design relationships and design tactics. This is the approach taken in this chapter, which focuses on widespread design and programming practices like UML (OMG, 1999) modeling and JDO- or ODMG-based programming. Consequently, the criteria considered for our purposes can be stated as follows: 1.
The extensions must be consistent with existing OODB design or implementation elements. That is, they must be recognizable as generalized or decorated variants or well-known elements.
2.
To enhance role-expressiveness, extensions that do not require the understanding of nontrivial mathematical properties or frameworks will be selected first.
3.
The selected extensions at the conceptual level must not express a concrete imprecision or uncertainty handling procedure but only reflect properties that can be captured by average modelers from the domain being modeled.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
246 Sicilia, García-Barriocanal, & Gutiérrez
This set of criteria may be considered controversial, but it represents a first attempt to come up with a framework to reason about fuzzy technology adoption in general. The criteria led us to adopt a method to design extensions that essentially proceeds by extending the main concepts in the UML and in related object database Application Programming Interfaces (APIs) with the simplest fuzzy counterpart. This is intended as a first step for adoption that would ideally be followed by subsequent assessment and redesign steps, all aimed at finally coming up with full-fledged fuzzy database models that incorporate all the expressive power currently contained in fuzzy models (De Caluwe, 1996).
From Fuzzy Conceptual Modeling to Fuzzy Databases: Extending the UML Currently, the UML is defined in the framework of a four-layer meta-modeling architecture. The meta-meta-model layer (M3) is a language for the specification of meta-models (oriented toward building repositories of modeling languages) and is loosely connected with the meta-model layer. In turn, the metamodel layer (M2) contains the essential definition of the UML modeling constructs. Levels M1 (user model layer) and M0 (user object layer) correspond with the definition of UML models, and instances of the elements in these models, respectively. Extensions to the UML are achieved at the M2 level, and, although this approach has been recently criticized (Atkinson & Kühne, 2000), the majority of the current extensions are carried out in that way. The relationship between layers in the UML architecture is conceived exclusively in terms of instance-of relationships. More specifically, elements at layer M1 are instances of elements at layer M2, and elements in the M0 layer are instances of both M1 and M2 layers (this is considered a loose meta-modeling approach). The main extension mechanism in the UML is the concept of Stereotype, which defines a virtual subclass of a UML metaclass, allowing for the definition of new meta-attributes and extended semantics. A profile is a stereotyped UML Package that contains a set of extensions. Tag definitions can be defined independently of any stereotype, in which case its tagged values can be attached to any ModelElement instance, as we require.
Fuzzifying Classes and Objects According to the UML 1.5 specification, a class “is the descriptor for a set of objects with similar structure, behavior, and relationships. The model is concerned with describing the intension of the class, that is, the rules that define it.”
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 247
This definition precludes approaches to fuzzy classes that are defined by extension or that allow for partial degrees of applicability for attributes, if maximum consistency with previous semantics is required. In addition, definition by intension is difficult to remove from current object-oriented programming languages. Consequently, the type of fuzziness selected provides a path for partial membership of instances, but with conventional attribute definitions. Practical examples of such kinds of models can be found in the literature (Sicilia, García, Díaz, & Aedo, 2002b). Class variants that vary in attribute definitions can be introduced by standard means through multiple classification via inheritance, interface implementation, or specialized design patterns, if necessary. Figure 1 shows an example UML diagram with a class in which instances are allowed partial membership. In most cases, membership is a function of the actual values of attributes, so that methods to specify the computation of such degrees have to be provided (e.g., through using some specific tagged values). In Figure 1, examples of fuzzy attributes are provided. Attribute a is defined in the domain of a datatype AValueScale that can be used to represent standard fuzzy values. The details and forms of the membership functions and other properties could be represented at the conceptual level through UML tagged values that could be eventually used to generate database code. Attribute b is stereotyped with <> denoting that its values can be given in interval form, and attribute c is stereotyped with <<poss>> indicating that its values are possibilities. From the perspective of the developer, all these extensions are simply specialized data types, expressed through the conventional UML notation. Their implementation does not require specialized database structures, provided that interpretation and subsequent elaboration are kept as part of the class’ responsibilities. According to fuzzy class semantics, flexible inheritance imposes a constraint on the membership of instances. In concrete terms, if A is a subclass of B, the membership of any instances to A must not be greater than its membership in B, otherwise it would contradict the crisp case.
Figure 1. Example UML diagram with fuzziness at attribute and class levels «fuzzy» A -a : AValueScale «interval» -b : double(idl) «poss» -c
«enumeration» AValueScale +very_low +low +medium +high +very-high
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
248 Sicilia, García-Barriocanal, & Gutiérrez
µA(x) ≤ µB(x) ∀ A,B ∈ Class The expression is only a special case of fuzzy generalization-specialization (genspec) relationship, as described by Chen (1998). Stricter requirements may be enforced through common object constraint language (OCL). In any case, the interpretation does not interfere with the conventional monotonic interpretation of inheritance, according to which subclassing is a way of extending, but never of constraining, some of the semantics of the subclasses. Note that the kind of fuzziness described for classes and inheritance is introduced at the M0 level. In addition, all the elements in a (static) UML model can be given a grade of belonging to the model. This concept is similar to that of the FuzzyEER at level L1 for entities, relationships, and attributes, so that, for example, the set of entities in a model can be given a membership grade (Chen, 1998, p. 64). This can be interpreted, for example, as “it is not completely sure the role element E plays in the context of the model.” This fuzziness at M1 has other interesting applications. For example, numeric “distance” between classes and subclasses can be used in the construction of applications that consider conceptual structures (Sicilia, García, Aedo, & Díaz, 2003). Because all these M1-level elements are found in specialized, knowledge-based applications, we will not deal with them here.
Introducing Fuzzy Associations Associations are considered mathematical relations among instances. A crisp relation represents the presence or absence of interconnectedness between the elements of two or more sets. This concept is referred to as association when applied to object-oriented modeling. According to the UML, an association defines a semantic relationship between classifiers, and the instances of an association can be considered a set of tuples relating instances of these classifiers, where each tuple value may appear, at most, once. A binary association may involve one or two fuzzy relations (i.e., the unidirectional and bidirectional cases), although due to the semantic interpretation of associations, they are in many cases considered to convey the same information (i.e., the association between authors and books is interpreted in the same way despite the navigation direction). Fuzzy relations are generalizations of the concept of a crisp relation in which various degrees of strength of relation are allowed (Klir, 1988). A binary fuzzy relation R on X×Y is a fuzzy subset of that Cartesian product as denoted in Expression (1):
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 249
R = {((x, y ), µ R ( x, y ) ) | ( x, y ∈ X × Y )}
(1)
All the relation concepts can be extended to the n-ary case, where R (X 1 , X 2 , K , X n ) ⊂ X 1 × X 2 × K × X n
(2)
We will restrict ourselves to the binary case, because it is the most common case in database applications. Fuzzy associations can be represented as literal tuples between model elements that hold an additional value representing their membership grade to the association. This assumption implies some constraints in the implementation of bidirectional associations, because both association ends should be aware of updates on the other. Fuzzy associations are represented in UML models by simply adding a <> stereotype, for the sake of maximum consistency, as first proposed in (Gutierrez, Sicilia, & Garcia, 2002). The interpretation of the association is expressed by additional substereotypes, but at the modeling and database representation level, the top stereotype could suffice in most common domain modeling situations. Additional restrictions on associations are represented, as usual, with OCL constraints. The use of fuzzy cardinalities would require a change in the UML meta-model, so that we could use annotations for the “many” (denoted by the symbol *) cardinality to specify them. In any case, cardinality restrictions do not affect physical representation but only update semantics, which are usually enforced by the application, even in the crisp case. An example of association design will be described later.
Issues of Representation and Efficiency in Integrating Fuzziness in Object Sources Once a number of conceptual-level fuzzy extensions to the object model — as those described in the previous section — are selected, the feasibility of integrating such extended elements in existing database systems must be addressed. In this section, a number of concrete issues regarding the physical integration of fuzziness in existing systems are briefly sketched, and empirical techniques for their assessment will be considered. Of course, the collection of issues covered in what follows is not intended to be comprehensive but is to provide an overview of the kind of inquiry efforts required.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
250 Sicilia, García-Barriocanal, & Gutiérrez
Fuzziness and Physical Storage Models in Object Bases Object databases are diverse in their models of physical storage, with architectures that range from server-based query resolutions, like that of CA-Jasmine, to models based in client caches of objects that distribute the workload of query processing to the client applications. The latter architectures put the burden of computations of membership values in the client, requiring special considerations for physical clustering, as will be illustrated later, in the context of a case study. Despite the fact that standards for object database access were proposed (e.g., ODMG or JDO), no common storage and distribution architecture currently exists. Consequently, the provision of fuzzy extensions must be carefully examined with regard to existing data architectures. One common feature of object databases is their navigational capabilities, which entails some concept of database object reference that generalizes the notion of pointer or reference of programming language objects. Such database references use a concrete form of indirection mechanism from secondary storage to principal memory (Tarr, 1995). This entails that in many cases, object databases tend to maintain objects in the same physical address, due to the cost of changing all the references to a given object when moving them. In addition, objects of the same class are frequently clustered together for performance reasons. Consequently, classification by extension depending on attribute values seems to interfere with storage models, so that models that retain intensional class definitions appear to integrate better with physical structures.
Representing Classes and Associations Through α -Cuts Membership degrees in fuzzy classes or degrees of participation in fuzzy associations are usually represented through infinite domains, e.g., the [0,1] interval. This entails that every object in a class or association may eventually be associated to a different membership degree, so that processing of collections of objects would entail time-consuming iterations. This is a cross-cutting concern of fuzziness in database systems, because fuzzy queries inherently require the sorting of query results by degree, or perhaps in some cases, the selection of a subset of results that satisfies a given requirement on membership degrees, e.g., a degree threshold for queries. Representations based on level-cuts have been proposed as a way to efficiently access fuzzy structures (Boss & Helmer, 1999). But in the case of orthogonal object persistence filters, the design of such structures has to be done at the class design level (Sicilia, Gutiérrez, & García, 2002). In the following section, a case study provides details about this approach.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 251
Techniques for Assessing Representation Adequacy Some previous work addressed the problem of benchmarking object databases that provide diverse read and update mechanisms (Hosking, 1995). Performance metrics can be broken down in the read and update categories. Read metrics are concerned with the mechanism of object faulting, that is, the check that the referred object is in memory for any pointer or reference use, leading eventually to data transfer from the server. Update metrics are related to the propagation of updates on objects to the server, according to the transactional semantics that are common to practically every object database system. In the latter case, eager or lazy approaches to updates can be implemented. In the case of dealing with fuzziness, the key performance determinant is the retrieval of collections of fuzzy objects and the possible combinations of membership values with standard fuzzy operators (conjunctive, disjunctive, negation, hedges, and the like). Consequently, conventional measurement techniques must be informed with attributes related to fuzziness, most notably including: 1.
Extent cardinality for fuzzy classes
2.
Fuzzy relation cardinality for fuzzy associations
3.
Degree of granulation permitted for instances of fuzzy classes or links in fuzzy associations
The three elements can be used to make a choice for the underlying collections supporting them, which may eventually be changed dynamically, reflecting changes in the cardinality of the participating instances. Cardinalities of classes and associations become the raw data required to build benchmarking suites, but also consider the tolerance of queries for each given application to low membership (relevance) of retrieved objects in general. This indicates that tolerance becomes a dimension that must be considered when evaluating a fuzzy OODBMS. Information granulation is viewed as a form of compression inspired in human perceptual processes (Zadeh, 1997). As such, the degree of granulation a given application tolerates impacts on the storage requirements and on the domain of the types that hold the information, also constituting a dimension in the assessment of database systems for which further research would be necessary. In addition, the adequacy of fuzzy databases can be approached from the perspective of the concept of epistemological adequacy, proposed by McCarthy (1981). Here the perspective is that of assessing the matching of the representational structures used with the actual forms of uncertainty or imprecision
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
252 Sicilia, García-Barriocanal, & Gutiérrez
inherent to the domain being modeled. Currently, this kind of assessment can only be carried out by contrasting taxonomies of information imperfection (Smets, 1997) with an explicit modeler’s concern for these kinds of imperfection in the domain.
Case Studies In this section, we illustrate some of the issues described in the previous sections through concrete technological artifacts. First, the extension of JDO database programming interfaces is discussed, and then performance issues regarding a small footprint persistence engine and a full-fledged database server are described.
Fuzzification of Standardized Interfaces: The Case of JDO The Java™ Data Objects (JDO) API1 is a standard interface-based Java model abstraction of persistence, developed under the auspices of the Java Community Process, and somewhat continuing the efforts of the ODMG group. In essence, JDO provides a standard API for the storage of Java object models in any kind of supporting database technology, including relational, object-relational or object databases. Consequently, it provides orthogonal persistence irrespective of the final physical storage. Persistent-capable instances in JDO must belong to a class that implements the PersistenceCapable interface. Classes may directly implement the interface, or it can be added by enhancer tools, which automatically modify the Java source code or bytecode. It provides navigational and declarative access to persistent instance by means of a query API and a query language called JDOQL. Navigational access can be carried out, for example, by calling the getExtent method of the persistence manager, which returns a Collection with all the instances belonging to a given class. JDO provides a method makePersistent in PersistentManager to make concrete instances persistent, and it also provides persistence by “reachability,” so that any instance linked to a persistent one (transitively) is also made persistent. Consequently, adding fuzzy classes to JDO requires two sets of extensions. On the one hand, the programming interfaces must be extended to include the option of explicitly handling membership grades (in a way consistent with existing programming practices). On the other hand, the query language must be extended to a flexible one (ideally) dealing with the extension but not obscuring the original syntax or semantics of the original. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 253
Extending navigational access is basically a matter of providing class extents that somewhat embody membership values for each instance. Providing such support without changing Java collection semantics can be done by means of the genericity of Java container classes that is based on storing any reference type, i.e., any instance belonging to the Object class. This approach is similar to the one described in Sicilia, García, Díaz, and Aedo (2002) to extend RecordSets with membership grades. It becomes necessary to wrap an existent persistent manager with a new class FuzzyPersistentManager providing the same interface but handling internally the processing of membership degrees: Extent e = null; try{ pm.currentTransaction().begin(); e = pm.getExtent(myclasses.X, true, “asc;min=0.2”); } catch(javax.jdo.JDOException){...} The second parameter (e.g., “true”) passed to getExtent() indicates that instances of subclasses must also be retrieved. The third parameter (asc;min=0.2) indicates properties of the fuzzy set being retrieved, which can be the ordering (ascending or descending by membership value) or cuts using thresholds or ranges of membership values. A typical example of extent iteration is sketched in what follows: FuzzyExtent fe = (FuzzyExtent)e; it = e.fuzzyIterator(); while (it.hasNext()){ FuzzyObject aux = (FuzzyObject) it.next(); X anX = (X) aux.getObject(); double mu = aux.getMembership(); } The iterator internally points to fuzzy objects that provide membership information. If the iterator() method is used instead of fuzzyIterator(), conventional (crisp) iteration semantics are provided. This retrieval and processing schema leaves the semantics of JDO interfaces unaffected, guaranteeing backwards compatibility.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
254 Sicilia, García-Barriocanal, & Gutiérrez
The query language JDOQL uses Java syntax for the specification of queries, which are essentially Boolean filters on instance collections. Because queries are specified as Strings, the approach, to provide maximum consistency and roleexpressiveness, is that of leaving the syntax unaffected and simply handling fuzziness implicitly in operators. A typical extended query example is the following: String filter = “address.state == state && “ + “salary >= sal && “ + “department.name.startsWith(deptName) && “ + “projects.contains(proj) && “ + “proj.budget > 10000000”; Extent extent = pm.getExtent(ProductiveEmployee.class, true, “asc;min=0.01”); Query query = pm.newFuzzyQuery(extent, filter); ((FuzzyQuery)query).interpretAllFuzzy(); query.declareImports(“import Project”); query.declareVariables(“Project proj”); query.declareParameters( “String state, String deptName, int sal”); Collection result = (Collection)query.execute( “Georgia”, “Network”, new Integer(100000)); In the above example, ProductiveEmployee is a fuzzy subclass of the employees who performed properly in the last quarter, according to imprecise criteria. Their extents are filtered with a degree of 0.01, and then a conventional JDOQL query is passed to a query object with fuzzy capabilities. The invocation to interpretAllFuzzy indicates to the query resolution process that all the operators in its filters are to be interpreted in fuzzy terms, and consequently, the and logical operator (&&) will also produce the combination of scores according to a T-norm. Alternatively, the interfaces of FuzzyQuery could be used to force the interpretation of fuzziness only in some of the filters that are affecting the query. This approach to extending JDOQL is similar to that used in fJDBC (Sicilia, García, Díaz, & Aedo, 2002), and puts fuzziness as an optional feature, because subsequent iteration may choose to discard membership values. It should also be noted that complex approaches to object comparison (Marín, Medina, Pons, Sánchez, & Vila, 2003) could be implemented without changing the JDOQL
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 255
syntax, thanks to the provision of abstract comparison methods in the Java language.
Fuzzification in Persistence Engines: The Case of db4o The db4o2 object database is a lightweight OODB engine that provides a seamless Java language binding (it uses reflection run-time capabilities to avoid the need to modify existing classes to make their instances storable) and a novel query-by-example (QBE) interface based on the results of the SODA3 (Simple Object Data-base Access) initiative. In what follows, we will discuss a concrete representational structure for fuzzy items that acts as an indexed structure. Such physical representation issues are justified by the fact that fuzzy queries often retrieve many more objects than crisp ones, which resulted in the investigation of concrete access mechanisms to improve performance like the relational access structure described in Yazici and Cibiceli (1999). Here we will describe a concrete approach to fuzzy association design. Because it is common practice to develop object-oriented software from previously defined UML models, we can consider UML semantics as a model from which associations are implemented in specific object-oriented programming languages. This occurs through the process of association design that essentially consists of the selection of the concrete data structure that better fits the requirements of the association (e.g., Rumbaugh et al., 1996).Therefore, the process of fuzzy association design will be an extension of conventional association design practices. A common representation for fuzzy relations is an n-dimensional array (Klir, 1988), but this representation does not fit well in the object paradigm, in which a particular object (element of one of the domains in the relation) is aware only of the tuples to which it belongs (the links), and uses them to navigate to other instances. We extended the association concept to design fuzzy relations attached to classes in a programming language so that a particular instance has direct links (i.e., “knows”) to instances associated with it. Access to the entire relation (that is, the union of the individual links of all the instances in the association) is provided as a class responsibility, as will be described later. The membership values of the relation must be kept apart from the instances of the classes that participate in the association. A first approach could be that of building Proxies for the instances, which will hold a reference to the instance at the other side of the association and the membership grade, and storing them in a standard collection. The main benefit of this approach is simplicity, because only a class called, for example, FuzzyLink (FL from now on), solves the representation problem. That is enough for the case of association with cardinality. We used this first approach for comparison purposes with our final design. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
256 Sicilia, García-Barriocanal, & Gutiérrez
Figure 2. Unidirectional binary association design «interface» FuzzyAssociationEnd +put()
FuzzyUnorderedAssociationEnd
-assoc
1
«fuzzy»
1
1
*
A
*
1
-assoc
B
B
A
(a)
(b)
A drawback of the FL approach for associations with multiple cardinalities is that the responsibility of preserving relation properties is left to the domain-class designer. This is one of the reasons that prompted us to develop a second approach in which the collection semantics, and not the element semantics, are extended. The base of our fuzzy collection framework is a FuzzyAssociationEnd (FAE) interface that defines common behavior for all fuzzy associations. Concrete classes implement that interface to provide different flavors of associations. In this work, we will restrict our discussion to a FuzzyUnorderedAssociationEnd (FUAE) class. The class diagram in Figure 2 shows how a unidirectional fuzzy association [Figure 2(b)] from class A to class B can be designed with our framework [Figure 2(a)]. It should be noted that the put method can be used to add and remove objects from the relation. The latter case can be carried out by specifying a zero membership. (We considered in this implementation that zero membership is equivalent to the lack of a link.) Because many associations that store different information may exist between the same pair of classes, associations must be named. The class-instance FUAE is responsible for maintaining a collection of the associations that are maintained as instances of it (i.e., this behavior is modeled as a class responsibility). These different associations are represented by instances as a FuzzyUnorderedAssociation (FUA) class. Therefore, FUA instances represent entire generic associations and store the union of the links that belong to them. Using dictionaries with fixed precision-membership values as keys provides performance benefits in common operations on fuzzy sets, like α-cuts, outperforming common container classes (bags, sets, and lists). The rationale behind
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 257
this organization is that association traversal would often be done by specifying a minimum membership grade, that is, to obtain an element of the partition of the fuzzy relation. This way, we are representing the relation by its resolution form defined by Equation (3): R = Uα Rα
α ∈ ΛR
α
(3)
where ΛR is the level set of R, Rα denotes an a-cut of the fuzzy relation, and αRα is a fuzzy relation as defined in Equation (4):
µ αRα ( x, y ) = α ⋅ µ Rα ( x, y )
(4)
The implementation is an extension of Java’s HashMap collection, which essentially substitutes the add behavior with that of a link operation sketched as follows: public Object link(Object key, Object value){ if (key.getClass() == Double.class){ double mu = ((Double)key).doubleValue(); // truncates to current precision: mu = truncateTo(mu); // Get the set of elements with the given mu: HashSet elements=(HashSet)this.get( new Double(mu) ); if ( elements == null ){ HashSet aux = new HashSet(); aux.add(value); super.put(new Double(mu), aux); }else elements.add(value); } // Inform the association that a new link has been added: if (association !=null)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
258 Sicilia, García-Barriocanal, & Gutiérrez
Figure 3. An example of the “is interested in” fuzzy relation sports : Subject
: Set
: FUAE
u1 : User
mu = 1 : Set mu = 0.6 u2 : User music : Subject : Set : FUAE
mu = 0.2 u3 : User 0.45 : Set mu 0.8 : Set
u3 : User
mu
association.put(key, this, value); return null; } Figure 3 illustrates our design by showing an “is interested in” relation between the set U of the users of a Web site and the set S of subjects of the page it serves. Experimental studies pointed out the performance benefits of this approach (Sicilia, Gutiérrez, & García, 2002). Because the activationDepth parameter of db4o determines the amount of reference traversals that are read in advance, it should be considered an important factor in achieving such results. It must be reduced from the default value 5 to 2 or 1 to obtain a significant improvement, because with the default value, the entire object graph is always retrieved. The resolution form of a fuzzy relation is a convenient way to represent and subsequently store fuzzy associations in orthogonal persistence engines. Additional constraints on link insertion semantics can be added to obtain specialized relations like similarity relations, as described in Gutierrez, Sicilia, and Garcia (2002).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 259
Interaction of Fuzziness with Physical Structures: Case of ObjectStore The ObjectStore database system4 is one of the most mature and stable products in the OODB market, currently in its 6.1 version. It provides a fast performance architecture originally called the Virtual Memory Mapping Architecture (VMMA) that enables programmers to design physical structures that minimize response time by clustering objects that are likely to be used together (Hansen, Adams, & Gracio, 1999). Essentially, this architecture provides a client–server architecture that can be tuned to minimize data transfer from the ODB to the client by carefully distributing objects in fixed-size containers called clusters, which reside in expandable storage containers called segments. The VMMA relies on a mechanism in the client (application) side that produces “page faults” on a process virtual memory setting each time a pointer or reference to a persistent object is referenced. If the object is in the client’s address space, it is directly mapped to the application address space, so that only in cases when the object is not found in the client does the page fault handler goes to secondary storage. Cache affinity is “the generic term that describes the degree to which data accessed within a program overlaps with data already retrieved on behalf of a previous request” (Visnick, 2003). Cache affinity is critical for the performance of the applications, because it minimizes client–server data transfer, due to a larger number of hits that are resolved locally in the cache of the client. Data affinity depends on the set of database pages a client needs at a given time (working set). Therefore, objects that are normally used together must be put together in physical storage, so they will be retrieved in the same data pages, thus minimizing the client’s request to ObjectStore. Conversely, objects rarely used must be kept apart from those frequently used. Clustering refers to that process of putting together the data that are read or updated frequently at the same time, and several design criteria are provided in documentation related to ObjectStore to guide physical design, including uses of indexes, selection of physical storage structures, and even refactoring of class design. When dealing with fuzzy classes, flexible queries often act as filters that use membership grades to select objects depending on a given a-cut. Because fuzzy querying is not a feature of ObjectStore, the provision of that filtering behavior would reside with the client, and hence, it is required that the full collection of membership degrees be retrieved before resolving the query. If we use object clustering, membership grades would be represented as a field inside the physical structure of the object, so that each fuzzy query would require the transfer of the entire object structure, significantly slowing the performance of functionalities
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
260 Sicilia, García-Barriocanal, & Gutiérrez
requiring instance selection based on fuzzy degrees. This situation points out the necessity of separating the fuzzy mappings from the rest of the information on fuzzy objects. That separation of objects and their membership degrees is a concrete realization of the “Head–Body Split” technique described in Visnick (2003). As a general database design pattern, it can be synthesized in the following Java-like declarations using a simple delegation scheme:
// Original class
// Result of the split
public class FuzzyClass{ // field declarations: private X1 x1; private X2 x2; … private XN xN;
public class FuzzyClass{ // membership grade: private double mu; // helper instance: private FuzzyClass_Crisp _aux; // constructor public FuzzyClass(…){ aux = new FuzzyClass_Crisp(..); … }
// membership grade: private double mu; // methods … }
// accessors for membership public double getMu(){ return mu; } … // methods delegated to the // FuzzyClass_Crisp } public class FuzzyClass_Crisp{ // all the method and field // declarations // except those related // to fuzziness. … }
Once the split into two classes is done, the database designer must allocate instances of FuzzyClass_Crisp classes in separate physical units, so that only the lighter version of the instances of fuzzy class X are required to filter by membership, resulting in decreased data transfer loads. In the case of fuzzy associations, the collections that hold the mappings of pairs of instances should be isolated in independent clusters, so that clients are able to
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 261
first retrieve the entire fuzzy subset of the Cartesian product, select the fuzzy links that are interesting for the given functionality, and then retrieve the subset of pairs of instances that are relevant according to their degrees. The rationale for such a technique is analogous to the “Isolate Index” technique described in Visnick (2003). To summarize, cache-based object architectures require that computations with membership grades be handled on the client side, so that degrees of fuzzy classes or associations that are in a working set should be clustered together.
Future Trends The eventual widespread adoption of fuzzy object-oriented technology will be, necessarily, accompanied by a generalized interest in fuzziness as a first-class citizen in conceptual models and programming technology. Fuzziness generalizes common crisp modeling constructs to a higher level of flexibility that is not always required, so that a careful and progressive selection of the fuzzy extensions that are introduced becomes crucial. A modular extension for fuzziness of the UML language — continuing previous work (Sicilia, García, & Gutiérrez, 2002) and leveraging existing research on fuzzy conceptual models (Chen, 1998) — may represent an important step in that direction, especially now that its 2.0 major version provides improved extension mechanisms. Moreover, one of the major current drivers of database technology is the specificity of Web information, which benefits from the navigational structure of object stores. Recent advances in Web information storage and management (May & Lausen, 2004) go a step further in the integration of object models with the specifics of the hypermedia structure of the Web. In addition, provided that the vision of a Semantic Web (Berners-Lee, Hendler, & Lassila, 2001) eventually becomes a reality, the amount of metadata expressed in XML-based languages like RDF will call for new requirements on object models and databases, and also new query languages (Karvounarakis et al., 2003). Consequently, research on the integration of fuzziness in languages for the description of Web resources represents an important direction that has yet to be addressed in a number of research works regarding fuzzy description logics (see, for example, Straccia, 2001) and their practical applications for Web management issues (Sicilia, 2003). With respect to the design and implementation of ODB systems, aspect-oriented design (AOD) represents a promising new technology that may eventually be used to add fuzziness to object database models, isolating the storage and computation of membership degrees from the functionality that is not affected
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
262 Sicilia, García-Barriocanal, & Gutiérrez
by them, extending existing related work (Rashid & Sawyer, 2001). Consequently, fuzziness can be considered a cross-cutting concern in information systems, and its management can be modularized in aspects or other similar design-level constructs to clearly differentiate it (Sicilia & García, 2004). This would eventually result in aspect-enabled object data stores enabling the storage handling of uncertainty and imprecision at the programming language level (e.g., using the popular aspect-j Java extension5), without changing the “crisp” classes. This would result in a cleaner separation of concerns than those using conventional inheritance (Yazici, George, & Aksoy, 1998).
Conclusions The introduction of fuzziness in existing OODB models must be carried out by considering existing database design and programming practices to make the extensions easier to understand and adopt by practitioners not knowledgeable in fuzzy set theory or related mathematical frameworks for uncertainty. This approach is proposed as a way to foster fuzzy technology adoption by the community of orthogonal-persistence developers. Using consistency and selfand domain closeness as general criteria, a restricted subset of the rich array of proposed fuzzy extensions is selected, comprising fuzzy classes and inheritance (respecting intensional definitions), fuzzy associations as specific fuzzy relations, and fuzziness at the attribute level implemented as class’ responsibilities. A number of issues regarding the physical storage and representation of such fuzzy extensions were described and illustrated through case studies. First, the integration of fuzziness with standard fuzzy database access interfaces was illustrated with the JDO API. Second, the importance of representing membership degrees in compact form was illustrated through a case study about the db4o database engine. This association design approach provides improved performance in operations that involve link retrieval by membership value, and adds no significant time overhead in common collection iteration processes. In addition, it was illustrated how cache-based architectures for ODBs — like that of ObjectStore — call for physical grouping techniques that must take into account the fact that computation with membership degrees occurs previous to actual data transfer processes.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 263
References Aksoy, D., & Yazici, A. (1993). Criteria for evaluating fuzzy object oriented database models. In E. Gelenbe (Ed.), Proceedings of the Eighth International Symposium on Computer and Information Sciences (pp. 136– 143). Aksoy, D., Yazici, A., & George, R. (1996). Extending similarity-based fuzzy object-oriented data model. In K. M. George, J. H. Carroll, D. Oppemheim, & J. Hightower (Eds.), Proceedings of the 1996 ACM Symposium on Applied Computing (pp. 542–546). New York: ACM Press. Atkinson, C., & Kühne, T. (2000). Strict profiles: Why and how. In A. Evans, S. Kent, & B. Selic (Eds.), “UML” 2000 — The Unified Modeling Language, Third International Conference (Lecture Notes in Computer Science 1939, pp. 309–322). New York: Springer. Atkinson, M. P., Daynes, L., Jordan, M. J., Printezis, T., & Spence, S. (1996). An orthogonally persistent Java. ACM Sigmod Record, 25(4), 68–75. Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic Web. Scientific American, 284(5), 34–43. Boss, B., & Helmer, S. (1999). Index structures for efficiently accessing fuzzy data including cost models and measurements. Fuzzy Sets and Systems 108(1), 11–37. Cao, T. H., & Rossiter, J. M. (2003). A deductive probabilistic and fuzzy OODB language. Fuzzy Sets and Systems 140(1), 129–150. Cattell, R., Barry, D., Berler, M., Eastman, J., Jordan, D., Russell, C., et al. (2000). The object data standard: ODMG 3.0. San Francisco, CA: Morgan Kaufmann Publishers. Chen, G. (1998). Fuzzy logic in data modeling: Semantics, constraints, and database design. Norwell, MA: Kluwer. Cross, V., & Firat, A. (2000). Fuzzy objects for geographical information systems. Fuzzy Sets and Systems 113(1), 19–36. Davis, R., Shrobe, H., & Szolovits, P. (1993) What is a knowledge representation? AI Magazine, 14(1), 17–33. de Caluwe, R. (Ed.). (1998). Fuzzy and uncertain object-oriented databases: Concepts and models (Advances in Fuzzy Systems, Applications and Theory, Vol. 13). River Edge, NJ: World Scientific. de Tré, G., & De Caluwe, R. (2003). Level-2 fuzzy sets and their usefulness in object-oriented database modeling. Fuzzy Sets and Systems 140(1), 29– 49.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
264 Sicilia, García-Barriocanal, & Gutiérrez
Dubois, D., Prade, H., & Rossazza, J. P. (1991). Vagueness, typicality and uncertainty in class hierarchies. Int. Journal Intelligent Systems, 6, 167– 183. Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1995). Design patterns: Elements of reusable object oriented design. Boston, MA: Addison Wesley. Green, T. R. G. (2000). Instructions and descriptions: Some cognitive aspects of programming and similar activities. In V. Di Gesù, S. Levialdi, & L. Tarantino (Eds.), Proceedings of Working Conference on Advanced Visual Interfaces (pp. 21–28). New York: ACM Press. Green, T. R. G., & Petre, M. (1996). Usability analysis of visual programming environments: A “cognitive dimensions” framework. Journal of Visual Languages and Computing, 7(2), 131–174. Gutiérrez, J. A., Sicilia, M. A., & García, E. (2002). Integrating fuzzy associations and similarity relations in object oriented database systems. In Proceedings of the International Conference on Fuzzy Sets Theory and Its Applications (pp. 66–67). Hansen, D., Adams, D., & Gracio, D. (1999). In the trenches with ObjectStore. Theory and Practice of Object Systems, 5(1) 201–207. Hosking, A. (1995). Benchmarking persistent programming languages: Quantifying the language/database interface. In Proceedings of the OOPSLA’95 Workshop on Object Database Behavior, Benchmarks, and Performance. Inoue, Y., Yamamoto, S., & Yasunobu, S. (1991). Fuzzy set object: Fuzzy set as first-class object. In Proceedings of IFSA 1991 (pp. 70–73). Kao, D., & Archer, N. P. (1997) Abstraction in conceptual model design. International Journal of Human–Computer Studies, 46(1), 125–150. Karvounarakis, G., Magkanaraki, A., Alexaki, S., Christophides, V., Plexousakis, D., Scholl, M., et al. (2003). Querying the semantic Web with RQL. Computer Networks, 42(5), 617–640. Kim, W. (2003). A retrospection on niche database technologies. Journal of Object Technology, 2(2), 35–42. Klir, G., & Wierman, M. (1998). Uncertainty-based information: Elements of generalized information theory (Studies in Fuzziness and Soft Computing, Vol. 15). New York: Springer-Verlag. Koprulu, M., Cicekli, N. K., & Yazici, A. (2003). Spatio-temporal querying in video databases. Information Sciences (to appear). Ma, Z. M., Zhang, W. J., & Ma, W. Y. (2003). Extending object-oriented databases for fuzzy information modeling, Information Systems (in press).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 265
Marín, N., Medina, J. M., Pons, O., Sánchez, D., & Vila, M. A. (2003). Complex object comparison in a fuzzy context. Information and Software Technology, 45(7), 431–444. May, W., & Lausen, G. (2004). A uniform framework for integration of information from the Web. Information Systems, 29(1), 59–91. McCarthy, J. L. (1981). Epistemological problems of artificial intelligence. In B. L. Webber, & N. J. Nilsson (Eds.), Readings in artificial intelligence (pp. 459–465). Los Altos, CA: Kaufmann. Medina, J. M., Pons, O., & Vila, M. A. (1994). GEFRED. A generalized model of fuzzy relational databases. Information Sciences, 76(1–2), 87–109. Nepal, A., Ramakrishna, M. V., & Thom, J. A. (1999). A fuzzy object query language (FOQL) for image databases. In A. L. P. Chen, & F. H. Lochovsky (Eds.), Proceedings of the Sixth International Conference on Database Systems for Advanced Applications (pp. 117–127). Piscataway, NJ: IEEE Press. Object Management Group: OMG Unified Modeling Language Specification, Version 1.3 (1999). Rashid, A., & Sawyer, P. (2001). Aspect-orientation and database systems: An effective customisation approach. IEE Proceedings — Software, 148(5), 156–164. Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F., & Lorenson, W. (1996). Object oriented modeling and design. Upper Saddle River, NJ: Prentice Hall. Russell, C. et al. (2001). Java Data Objects (JDO) Version 1.0 proposed final draft, Java Specification Request JSR000012. Schenker, A., Last, M., & Kandel, A. (2001). Fuzzification of an object-oriented database system. International Journal of Fuzzy Systems, 3(2), 432– 441. Sicilia, M. A. (2003). The role of vague categories in semantic and adaptive Web interfaces. In R. Meersman, & Z. Tari (Eds.), Proceedings of the Workshop on Human Computer Interface for Semantic Web and Web Applications (Lecture Notes in Computer Science 2519, pp. 210–222). New York: Springer Verlag. Sicilia, M. A., & García, E. (2004). On imperfection in information as an “early” crosscutting concern and its mapping to aspect-oriented design. In Proceedings of the Early Aspects Workshop: Aspect-Oriented Requirements Engineering and Architecture Design (to appear). Sicilia, M. A., García, E., & Gutiérrez, J. A. (2002). Integrating fuzziness in object oriented modelling languages: Towards a fuzzy-UML. In Proceed-
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
266 Sicilia, García-Barriocanal, & Gutiérrez
ings of the International Conference on Fuzzy Sets Theory and its Applications (pp. 66–67). Sicilia, M. A., García, E., Aedo, I., & Díaz, P. (2003). Representation of concept specialization distance through resemblance relations. In J. M. Benitez, O. Cordon, F. Hoffmann, & R. Roy (Eds.), Advances in Soft Computing — Engineering, Design and Manufacturing (Springer Engineering series, pp. 173–182). New York: Springer Verlag. Sicilia, M. A., García, E., Díaz, P., & Aedo, I. (2002). Extending relational data access programming libraries for fuzziness: The fJDBC framework. In T. Andreasen, A. Motro, H. Christiansen, & H. L. Larsen (Eds.), Proceedings of the Flexible Query Answering Systems International Conference (Lecture Notes in Artificial Intelligence 2522, pp. 314–328). New York: Springer. Sicilia, M. A., García, E., Díaz, P., & Aedo, I. (2002b). Fuzziness in adaptive hypermedia models. In J. Keller, & O. Nasraoui (Eds.), Proceedings of the North American Fuzzy Information Processing Society Conference (pp. 268–273). Piscataway, NJ: IEEE Press. Sicilia, M. A., Gutiérrez, J. A., & García, E. (2002). Designing fuzzy relations in orthogonal persistence object-oriented database engines. In F. J. Garijo, J. C. Riquelme, & M. Toro (Eds.), Advances in artificial intelligence (Lecture Notes in Computer Science 2527, pp. 243–253). New York: Springer. Smets, P. (1997). Imperfect information: Imprecision-uncertainty. In A. Motro, & P. Smets (Eds.), Uncertainty management in information systems: From needs to solutions (pp. 225–254). Norwell, MA: Kluwer Academic Publishers. Straccia, U. (2001). Reasoning within fuzzy description logics. International Journal of Artificial Intelligence Research, 14, 137–166. Tarr, C. (1995). Identity indirection design pattern. In Proceedings of the OOPSLA ’95 workshop on design patterns for concurrent, parallel, and distributed object-oriented systems. Visnick, L. (2003). Clustering techniques in ObjectStore. Technical white paper. Retrieved September 2003 from the World Wide Web: http:// www.objectstore.net Yazici, A., & Cibiceli, D. (1999). An access structure for similarity-based fuzzy databases. Information Sciences, 115(1–4), 137–163. Yazici, A., & Koyuncu, M. (1997). Fuzzy object-oriented database modeling coupled with fuzzy logic. Fuzzy Sets and Systems 89(1), 1–26.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Introducing Fuzziness 267
Yazici, A., George, R., & Aksoy, D. (1998). Design and implementation issues in the fuzzy object-oriented data model. Information Sciences, 108(1–4), 241–260. Zadeh, L. (1997). Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems, 90(2), 111–127.
Endnotes 1
http://java.sun.com/products/jdo/
2
http://www.db4o.com/
3
http://sodaquery.sourceforge.net/
4
http://www.objectstore.net/
5
http://eclipse.org/aspectj/
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
268 Sicilia, García-Barriocanal, & Gutiérrez
SECTION IV
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 269
Chapter IX
An Object-Oriented Approach to Managing Fuzziness in Spatially Explicit Ecological Models Coupled to a Geographic Database Vincent B. Robinson University of Toronto at Mississauga, Canada Phil A. Graniero University of Windsor, Canada
Abstract This chapter uses a spatially explicit, individual-based ecological modeling problem to illustrate an approach to managing fuzziness in spatial databases that accommodates the use of nonfuzzy as well as fuzzy representations of geographic databases. The approach taken here uses the Extensible Component Objects for Constructing Observable Simulation Models (ECOCOSM) system loosely coupled with geographic information systems. ECOCOSM Probe objects flexibly express the contents of a spatial database within the context of an individualized fuzzy schema. It affords the ability Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
270
Robinson & Graniero
to transform traditional nonfuzzy spatial data into fuzzy sets that capture the uncertainty inherent in the data and model’s semantic structure. The ecological modeling problem was used to illustrate how combining Probes and ProbeWrappers with Agent objects affords a flexible means of handling semantic variation and is an effective approach to utilizing heterogeneous sources of spatial data.
Introduction Progress in global connectivity has led to a situation where we now need to deal with more heterogeneous information consisting of a broad variety of digital spatial/geographical data and address operational sources, such as simulation models, which create new data and information. The scale of the problem has changed from just a few databases to thousands, perhaps millions, as geographical information resources. Such new resources are most often added independently to the accessible set of resources without regard to the myriad end-uses that may be applied to them (Mackay, 1999). Thus, spatially explicit information resources may be used in many different contexts without regard for the underlying uncertainties of the data, or their relationships to the semantics of the problem domain (Robinson & Frank, 1985; Burrough & Frank, 1996). Although such uncertainties in geographic databases have been recognized for decades, it would be extraordinary to have institutional databases contain anything as detailed as fuzzy membership values or other detailed measures of uncertainty attached to objects or tuples. Geographic databases with no explicitly recorded uncertainty measures are commonly used as the basis for computationally intensive investigations of complex ecological systems. One major approach that developed over the past few decades is individual-based modeling (IBM) (Grimm, 1999; Lomnicki, 1999; Bian, 2003). It is a computational approach to modeling a system through the interaction of atomic models of each individual inhabiting the system. They provide several advances over traditional ecosystem models. Foremost among the advances is the fact that they discard the assumption that there is some average, or mean, individual that adequately represents every individual in a population. They also dispose of the assumption that significant interactions take place evenly across populations. Such models are usually spatially explicit, allowing interaction between individuals to occur over a wide range of space. Importantly, they are able to represent the biological, physiological, and behavioral distinctions seen in individuals in the real world. Because the individual is the atomic unit, the simulation is able to take spatially explicit localized interac-
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 271
tions into account. Thus, a model of higher-order entities (e.g., populations) emerges from the dynamics of individual interactions in much the same manner as the higher-order phenomena observed in the real world (Anderson, 2002). One such problem domain concerns the simulation of dispersal behavior of animals across a landscape. Previous research suggested that errors in dispersal parameters such as misclassification of habitat suitability or incorrect estimation of how far a disperser can travel can have larger consequences for predicting dispersal success than do errors in landscape classification (Ruckelshaus et al., 1997). There are crucial parameters in models of movement, such as perceptual range, that cannot be precisely specified from field and experimental work (Mech & Zollner, 2002). However, classification errors can still have significant consequences. Ruckelshaus et al. (1997) showed that uncertainty in the model parameters and in the underlying data stored in a database is a significant problem to be addressed by ecological modeling efforts. This led to a detailed suggestion that these problems be investigated by integrating fuzzy information processing, computational simulation modeling, and spatial database issues with intelligent systems research while maintaining a direct interplay with real-world ecological research (Robinson, 2002). The approach taken here uses the Extensible Component Objects for Constructing Observable Simulation Models (ECO-COSM) system loosely coupled with GRASS, an open-source GIS (Neteler & Mitasova, 2002) and ArcGIS© (McCoy & Johnston, 2001). ECO-COSM is a simulation modeling framework used to build spatially explicit ecological models (Graniero, 2001). Its component-based structure allows a model design to evolve by replacing or adding individual model components that change the overall behavior. The simulation framework provides a library of modular software objects that manage the structure of space and time within a simulation model. It includes mechanisms to handle concurrent activity among objects within the simulation. Objects that have embedded assumptions about the spatial or temporal structure of the simulated world are packaged into replaceable modules. In this illustrative example, the goal is to simulate the detailed dispersal movements of a population of squirrels in a spatially explicit manner, using behavioral modules that fuzzify the spatial database contrasted to modules that do not. This effort can be related to several themes in the fuzzy object-oriented database literature. Like several others, we emphasize the importance of incorporating some form of intelligence in the system (Bordogna & Chiesa, 2003; Koyuncu & Yazici, 2003; Petry et al., 2003). As noted by Cross and Firat (2000), one recognized stream of research in fuzzy databases focused on developing frontend fuzzy querying capabilities on top of conventional database systems. Sometimes those databases are object-oriented (Koyuncu & Yazici, 2003) and sometimes they are conventional (Petry et al., 2003). Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
272
Robinson & Graniero
A query-directed approach was taken when examining the incorporation of fuzziness in a system for managing spatially explicit ecohydrologic simulations, namely the Knowledge-Based Land Information Manager and Simulation (KBLIMS) system (Robinson, 2000). Like KBLIMS, we focus on a kind of ecological simulation. However, we have taken a different, albeit objectoriented, approach to handling fuzziness. Our use of agents with probes allows us to address issues of fuzziness for individual-based modeling that could not be adequately addressed by a system such as KBLIMS. The object database described by Robinson (2000) had to be constructed, originally, from non-objectbased GIS database information. We address this practical issue by concentrating our object-oriented techniques within the ECO-COSM framework, thus allowing straightforward access to heterogeneous spatial data sources that are required for such simulations. By taking this loose-coupling approach, we differ from those such as Koyuncu and Yazici (2003) who take a tightly coupled approach to incorporating intelligence in a fuzzy object-oriented architecture. Like Mackay (1999), we recognize a distinction between the ontology upon which individual information sources are constructed and the ontology of an enduser of the information sources. In our case, the end-user is an individual agent that views its surrounding world in order to make a movement decision, and our system must manage the queries and actions for populations of agents so that the results of a simulation may be represented in a GIS database (e.g., for visualization). Fuzzy database models have been defined for dealing with imperfect information, either in the database (Robinson, 1988; Petry, 1996), in the queries (Koyuncu & Yazici, 2003), or in both data and queries (Bordogna & Chiesa, 2003; Morris, 2003). In a sense, we integrate all three approaches in the work presented here. Agents use Probe objects to query a database. At this stage, the database is assumed to be a conventional nonfuzzy, GIS database. However, as we note later in the chapter, this approach is easily extensible to accommodate coupling with an object-oriented database, fuzzy or crisp. Therefore, imperfect information is dealt with at the query end by the Probe objects that in effect allow each Agent an object-oriented database of its own upon which the Agent poses queries to gather fuzzy information to support a decision to either move to a new location or remain in place. The use of Probes allows us to incorporate knowledge about not only the data and its fuzziness, but also about the problem domain that is a function of the Agent’s role within the simulation model. Thus, the combination of Probe and Agent allows semantics to be modeled within each Agent. In the framework presented below, each Agent class has an ontology of its own in which the semantics of its problem domain are defined. However, the framework we laid out is flexible enough to be able to incorporate additional object-oriented representation schemes. We show how the use of ECO-COSM’s Probe objects afford the ability to express the contents of a spatial database within the context of a particular,
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 273
individualized fuzzy schema. A traditional crisp spatial database can be easily transformed into fuzzy sets that capture the following: 1.
Uncertainty inherent within the database’s contents
2.
Uncertainty inherent within the model’s semantic structure
3.
Ambiguity or vagueness in the meaning of the database’s contents that is generated by the different semantic requirements of different agents and the natural variability among individuals within an ecological population
The next section outlines the key concepts that link individual-based ecological models, agent-based modeling, object-oriented design, and GIS databases, and presents the primary challenges of representing fuzziness in such a complex application domain. Then we present a conceptual overview of the squirrel dispersal model we use as an illustrative example throughout this chapter. The architecture of the modeling framework that was used to implement the model is then described, and some of its key features that provide a solution to the challenges of this problem domain are explained. The section on fuzzy spatial relations and database query illustrates how context-specific fuzzy spatial relations can be created ad hoc to constrain database queries. Then we present an innovative way to add fuzzy information to a conventional, nonfuzzy GIS database not only within a model’s context, but also within the variable context of individual model objects. The next section demonstrates the utility of deriving fuzzy information from a nonfuzzy GIS database at the individual level by presenting differences in modeled squirrel dispersal according to individual variation in perception of the environment and variation in the decision-making process. We conclude the chapter with discussion of the strengths, limitations, and future possibilities of this approach.
Objects, Agents, Geographic Databases, and Ecological Models Spatially explicit ecological models are used to study plausible connections between landscape patterns and species viability (Ruckelshaus et al., 1997). In an information-based approach to modeling the movement of animals, such models may link behavioral ecology with landscape-level ecological processes (Lima & Zollner, 1996). A computing environment that supports development of spatially explicit individual-based modeling should support, among other requirements, the following: mobility, evaluating and interacting with other individuals, and acquiring and maintaining knowledge about the surrounding landscape
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
274
Robinson & Graniero
(Westervelt, 2002). Because the behavior of an IBM emerges from individual behaviors, a more comprehensive, flexible, and accurate model is obtained by modeling the intelligence inherent in individual inhabitants via implementation as computational agents (Anderson, 2002; Bian, 2000; Rickel et al., 1998; Westervelt, 2002; Westervelt & Hopkins, 1999). However, the geographic databases in support of IBMs are usually not object-oriented databases but are repositories of data that are queried for information by objects in an object-oriented model environment. Once the data are served to the querying object, they are incorporated within the object-oriented environment of the model (Westervelt & Hopkins, 1999; Robinson, 2002). This hybrid object-oriented approach has allowed the combination of GIS and agent-based models in a variety of environmental and social contexts not limited to the modeling of animal movements (Gimblett, 2002; Westervelt, 2002; Harper et al., 2002; Petry et al., 2002; Leclercq et al., 1999; Graniero & Robinson, 2003). An agent is a program that perceives its environment and acts upon it (Anderson, 2002; Russell & Norvig, 1995). In this modeling domain, it is information drawn from a GIS database that will supply an agent with its “perception” of its environment. The concept of this relationship is illustrated in Figure 1. The implication of this relationship is that much of the general research related to geographic, or spatial, databases and geographic information systems (GISs) may provide relevant support for advancing the development of IBMs.
Figure 1. Conceptual illustration of major components of a spatially explicit ecological model that focuses on movement behavior of individual animals, e.g., natal disperal (Note the loosely coupled relationship with the geographic information system.)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 275
Agent-based approaches have begun to be applied to a number of problems using GIS databases. Various uses of agent-based applications at the systems level were reviewed by Li et al. (2001). These authors concluded with a suggestion of a GeoAgent; that is, a mobile Agent that can enhance its abilities by using Wrapper Agents in an assemble-on-demand fashion, and use geospatial knowledge to deal with geospatial problems. Wrapper Agents are agents in their own right, but they are also designed to provide additional layers of processing or decision-making support to “client” Agents. The GeoAgent is particularly wellsuited to geospatial problems dealing with a WebGIS (Li et al., 2001). The distributed environment afforded by the concept of the WebGIS has led to other applications of agent-based techniques applied to GIS. To support GIS interoperability, a semantic mediation approach was presented that utilizes the object-oriented nature of agents and agent wrappers to resolve semantic differences among systems across a Web-based environment (Leclercq et al., 1999). Further research in the management of uncertainty in distributed spatial information systems demonstrated the potential utility of fuzzy sets in addressing issues of semantic heterogeneity. The suggested approach incorporates an object-oriented data model that supports the intelligent conflation of uncertain geographic features in response to a spatial query (Cobb et al., 2000). Exploiting advances in the representation and processing of fuzzy spatial relations, this approach was extended to develop a system that retrieves, filters, integrates/conflates, and validates geospatial data from multiple sources using intelligent agents. It was argued that the use of intelligent agent technology in this context offers advantages over the standard client-server architecture (Petry et al., 2002) which is consistent with experiences in developing spatially explicit ecological models that depend on spatially explicit databases. Like other efforts (Anderson, 2002; Rickel et al., 1998; Westervelt & Hopkins, 1999; Harper et al., 2002), we approach the problem of building individual-based, spatially explicit simulation models from an object-oriented perspective utilizing spatial databases and mobile agents. It was suggested that the choice of this problem domain provides the ability to investigate issues that integrate computational simulation modeling and spatial database issues with intelligent systems research while maintaining a direct interplay with real-world ecological research (Robinson, 2002). Figure 1 shows the major components of a spatially explicit ecological model and the relationship between each of them. Of critical importance in all the models is some representation of the landscape. Such information is typically stored in a spatial database that “feeds” a simulation model. Here the landscape is treated as a spatial database from which the animal objects will receive information about their surroundings. A particular challenge in individual-based ecological modeling is that the ways in which landscape data are collected and stored in the spatial database are often different from the ways
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
276
Robinson & Graniero
in which the modeled agents should “perceive” the same landscape if they are to remain operationally consistent with the modeled domain. In this approach to modeling individual animals dispersing across a landscape, animal objects pose spatial queries to the landscape to acquire information. Like their counterparts in the real world, they are able to acquire information about the landscape only within a certain distance determined by the animals’ perceptual range (Mech & Zollner, 2002; Zollner, 2000) or finite range of vision (Fahse et al., 1998). That information is then processed to determine the specifics of which movement behavior to pursue. Ruckelshaus et al. (1997) suggested that errors in dispersal parameters have much larger consequences for predicting dispersal success than do errors in landscape classification. Their conclusions suggest that uncertainty surrounding dispersal parameters is a significant problem that ecological models and modelers must face. The role of fuzzy sets in the representation of objects in geographic databases for a variety of applications has received considerable attention. However, the usual approach is to address the representation of uncertainty directly, in some fashion, with the objects stored in a database (Cross & Firat, 2000; Yazici & Akkaya, 2000) or as part of the query subsystem (Yazici & Akkaya, 2000; Morris, 2003). Although appropriate in many applications, such approaches have limitations when using geographic databases in the context of information-based simulation modeling of complex environmental and ecological processes. The simulation models have their own semantics that may be distinct from or unknown to the database author, the user, or other models (or submodels). This is especially relevant when trying to reconcile the semantics of the original observations with the semantics of a simulation modeling domain. In addition, most complex environmental modeling domains contain many models and submodels that interact with one another, consequently generating semantic errors (see Mackay & Robinson, 2000; Mackay, 1999). Furthermore, Robinson (2000) showed that in an object-oriented database with a visual query system, environmental simulation models may be embedded in the query or in the query results. In this case, the user may have one set of semantics in mind that may, or may not, be consistent with the semantics of the simulation models being used to generate the answer to the query. In fact, there may be no reconciliation process. That led to research into methods for modeling semantic agreement and model self-evaluation (Mackay & Robinson, 2000; Mackay, 1999) and would seem to justify embedding more intelligence into such systems. Therefore, we use the concept of Probes in an object-oriented, agent-based system as a practical means of addressing issues of fuzziness in spatially explicit data, while at the same time maintaining the integrity of large, complex simulation projects. From a modeling perspective, this approach can substantially reduce artifacts caused by parameter uncertainty (Robinson & Graniero, in press).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 277
Overview of the Dispersal Model Now, we briefly present an overview of the natal dispersal model described in more detail elsewhere (Robinson & Graniero, in press). In this model, the dispersal movement process of each animal object consists of two major decisions: movement and residence. If the object is to move from its current location, then it must decide on a destination location. Once at the new location, it will need to assess its surroundings to gather information that is used to make a residence decision. In other words, has the animal object found a suitable location, or will it need to continue the dispersal movement? In the following sections we present a simple fuzzy decision-making process for each decision. The decision model used is one in which relevant goals and constraints are expressed in terms of fuzzy sets, and a decision is determined by an appropriate aggregation of the fuzzy sets (Bellman & Zadeh, 1970). Fundamental to either the movement or residence decision is information about the surrounding landscape and conspecifics (other animals of the same species already residing in nearby locations). This is usually confined to a perceptual range (Mech & Zollner, 2002) or finite range of vision (Fahse et al., 1998). Because an animal’s perceptual range represents its informational window onto the larger landscape, it determines how much of the area surrounding the individual it can perceive. In the spatially explicit simulation model outlined in Figure 1, this is tantamount to the perceptual range being a spatial constraint on a query to the GIS database. The basic decision model used here is one in which relevant goals (GM) and constraints (CM) are expressed in terms of fuzzy sets, and a decision is determined by an appropriate aggregation of the fuzzy sets (Bellman & Zadeh, 1970; Klir & Yuan, 1995). More detailed discussion of the goals and constraints is presented in Robinson and Graniero (in press). In the movement decision model, the constraints consist of two major sets of locations. One set includes those locations that are within the visible perceptual range (ψ). The other constraint relates to distance from conspecifics. Some species are attracted to concentrations of conspecifics and others are not; locations under consideration must satisfy the individual’s tolerance of nearby conspecifics. The goal of an individual is to find a location as near the edge of the perceptual range as possible that is considered to be acceptable habitat and fits the set of constraints. Thus, the goal set (GM) is a function of the spatial arrangement of habitat and what we call dispersal imperative, the details of which are presented in Robinson and Graniero (in press) . On the first move, the degree to which each location within the perceptual range falls in the decision set (DM) is defined by DM = C M∩GM. Movement is to the location with the highest value for D M, i.e., κ{x∈XµDM(x) = max DM}. However, Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
278
Robinson & Graniero
given the nature of the problem, it is possible that more than one location will have the same maximum value. In that case, should there be ties, the first one in the list is chosen (i.e., a “lazy” sufficing strategy). On moves beyond the first, there is the question of directional bias. Based on previous work reported in the ecological literature, a bias to move in the general direction of the last move is incorporated in the decision set. In that case, should there be ties, a random location among the candidate set (D M) is chosen (i.e., an “exploratory” sufficing strategy). Once the animal object has moved to a location, it must then decide whether it is a location suitable for stopping its dispersal movement. Like the movement decision model, this is one in which relevant goals (G R) and constraints (CR) are expressed in terms of fuzzy sets, and a decision is determined by an appropriate aggregation of the fuzzy sets (Bellman & Zadeh, 1970; Klir & Yuan, 1995). In the residence decision model, the animal is constrained by whether or not its current location is sufficiently spatially separated from conspecifics that a home range can be established, while the goal is to have habitat of sufficient area. Finally, a decision rule is applied to the decision set that leads to the animal taking up residence at the location or attempting a move to another location. The details of this decision model are presented in Robinson and Graniero (in press). Because this work is focused on modeling natal dispersal, we use the residence decision primarily as a stopping rule. Future elaborations will incorporate exploratory movement so that the agent explores the vicinity around its destination and uses that information in a more sophisticated decision process than presented here, to choose whether to establish a home range or not. However, at the present, we simplified the decision to address just a few key criteria that were suggested by the literature (Allen, 1987; Wolff, 1999).
Architecture of the ECO-COSM System The computational environment presented in this section meets the two requirements that allow functioning intelligent agents in a simulation model. One requirement is that a model of the agent’s behavior be constructed with a facility for implementing the agent’s decision-making abilities. The second requirement is that the simulated world functions as both an environment unto itself and a virtual reality to the agents inhabiting it (Anderson, 2002). Our approach uses the ECO-COSM system (Graniero, 2001) loosely coupled with the Geographic Resources Analysis Support System (GRASS), an open-source GIS (Neteler & Mitasova, 2002), and ArcGIS© (McCoy & Johnston, 2001). ECO-COSM is a simulation-modeling framework used to build spatially explicit ecological models.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 279
It has a component-based structure that permits a model design to evolve by replacing and adding individual model components. The changed, or additional, components, in turn, change the overall system behavior. The simulation framework provides a library of modular software objects that manage the structure of space and time within a simulation model and mediate the behavior of model components within that structure. It includes mechanisms to handle concurrent activity among objects within the simulation. Objects that have embedded assumptions about the spatial or temporal structure of the simulated world are thereby packaged into replaceable modules. For the spatially explicit model builder, this feature provides superior control over simulation behavior. The framework of each simulation program is comprised of a Simulation object that contains the three interacting Scheduling, Modeling, and Instrumentation subsystems (Figure 2). The Simulation object is used to describe the overall structure and relationships between the components comprising the model. It also looks after the mechanics of receiving external parameters, executing the simulation, and managing the overhead required to acquire and release computing resources needed to run the program.
Scheduling Subsystem Central to the operation of the system is the Scheduling subsystem. The Clock and Schedule objects are the primary component objects of the Scheduling subsystem. Each program is constrained to include only one instance of each.
Figure 2. Depicts the main subsystems that compose a Simulation object (Note that in the World object the Agents cannot know about Layers except through a Probe in the Instrumentation interface.)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
280
Robinson & Graniero
Any object in the simulation program may access the Clock’s time or add actions to the Schedule. The Schedule object keeps track of all pending actions. It decides which action should occur next and triggers that event. Currently, scheduling is an event-driven structure, but discrete time step models may be constructed by adding regularly occurring “step” actions that reschedule themselves every time step.
Modeling Subsystem The Modeling subsystem provides the main components for constructing a simulated world. The spatial and temporal structure of the world is defined by the specific choice of object modules. The primary high-level object is the World,
Figure 3. World Layer and Grid object classes, components, relationships (Note that BoundaryTopology is an abstract class that defines how Locations outside the physical boundary of a Grid behave topologically by throwing an exception or logically remapping the Location into the physical extent of the Grid.)
(Adapted from Graniero, 2001)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 281
which organizes the model components into collections of “landscape” objects and “individual agent” objects. The “landscape” collection is made up of Layers representing various attributes of the study area’s extent. Layers are typically represented using a Grid, though other spatial representations are possible. Figure 3 illustrates the relationship between the World, Layer, and Grid objects along with many of the methods attached to each object. Of particular relevance to this work are methods such as getProbe() attached to the Layer objects and getValueAt() attached to the Grid object. Grid is a specialized subclass of Layer, and a World object is composed of one or more Layer objects. Although Grids can be generated and their grid cell values populated entirely within the simulation, Grids can also reference an external, abstracted GridSource to set the grid geometry and populate the grid cell values. For example, an EsriAsciiGridSource would import data layers exported from the ArcGIS© GRID module (McCoy & Johnston, 2001), or other GridSource specializations might directly read and write native GIS formats. Each Layer can have a StepRule that, when triggered by the Schedule, can calculate a new state for each grid cell based on the current state of the cell and its neighbors, as well as the state in other Layers at the corresponding location. This allows the landscape to evolve following ecological processes operating in the simulated ecosystem. The “individual agent” collection is organized into one or more Populations, each of which contains zero or more Agents. A Population is used to group Agents that share common traits, with a separate Population for each type of Agent. Populations can also be used to organize Agents that are of similar type, but in different fundamental states. In addition, population-level monitoring is useful for controlling the simulation Schedule. For example, it may be used to add a TerminateAction when all Agents are in “dead” or “home” Populations, and there are no Agents left in the “active” Population. An Agent is a model component that operates autonomously, located on the landscape and obtaining information about other agents or the local landscape in order to make decisions about changes in its own state, movement on the landscape, or changes to the local state of one or more landscape Layers. Access to information about other model components is controlled by Probe objects described below. All Agent specializations share a similar data-access and processing structure but differ in the specific details of their informationprocessing and decision-making algorithms. Such differences are what can evoke important differences in behavior across Agent types. Each individual instance of a particular type of Agent shares the same decision-making algorithm. Variations in individual responses are easily achieved by using different values for fundamental parameters or by using different informationgathering “filters” that modify the individual’s perceptions of their surroundings.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
282
Robinson & Graniero
Instrumentation Subsystem The Instrumentation subsystem provides the information-access structures that allow model components to discover the state of other components in a controlled and safe fashion, ensuring the consistency and integrity of the source databases and the model’s overall operating state. The ability to collect data from the running model is made possible by the Probe/Probeable interface mechanism. Many of the objects in the Modeling subsystem implement the Probeable interface as well as fulfill their own modeling functions. Probes can only be created by Probeable objects; a request is made to the target Probeable object via its getProbe() method, specifying the desired type of Probe using a keyword. Each type of Probe is designed to query a specific aspect of the Probeable object’s state. Whenever the Probe’s probe() method is invoked (e.g., by a ProbeCommand on the Schedule, or by an Agent requiring current information about another object), the Probeable’s appropriate private data access method or database query is invoked. As an example, in order to access the data within a Grid (which is a Probeable object), the client object must call the Grid’s getProbe() method, and the Grid will return an appropriate Probe object. When that Probe’s probe() method is invoked, it will invoke its target Grid’s getValueAt() method using the Probe’s current Location as a parameter. The resulting value is passed to the Probe, which in turn queries the Grid’s state at that Location and passes the result to the object using the Probe. Using this structure, a Probeable object only exposes attributes that are deemed “public knowledge” to external objects. In order to keep other attributes inaccessible, it does not distribute Probe objects that expose those attributes. At the same time, the Probeable object keeps the access mechanism for those attributes hidden from “public knowledge.” All Probes simply respond to a probe() method, and what happens within that method is kept opaque to the user. This allows database sources, implementations, or architectures to change with Figure 4. Structure of the Probe, Probeable, and ProbeWrapper relationship
(Adapted from Graniero, 2001)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 283
no effect on the other model components. All Probes create read-only mechanisms; external objects never have direct access to the Probeable object’s state, which means that they cannot accidentally change the object due to programming errors. ProbeWrappers extend the power of the Probe mechanism. A ProbeWrapper is a specialized Probe that has another Probe embedded within it (Figure 4). A ProbeWrapper is used to modify the “pure” result retrieved from a Probeable object in some way (Figure 5). For example, the land-cover type observed at a distance may be subject to random misclassification due to limits of perceptual range. Alternatively, the state’s description scheme may be modified to suit the purpose of the observer: the grid cell may be described as “mature oak” in the land- cover Layer, but the observing Agent may perceive it as “suitable location for inhabiting.” Because ProbeWrappers are also Probes, an object (such as an Agent) can use either “pure” Probes or Probes that are modified by ProbeWrappers transparently, with no knowledge of the difference. By wrapping Probes in slightly different ways for different individual Agents of a common type, it is possible for the modeler to introduce variation in an individual’s ability to perceive the world, while using the same basic decision-making process. ProbeWrappers may be nested as deeply as desired, so highly sophisticated perceptual “filters” may be constructed. In addition, some specialized ProbeWrapper objects can take the results of many nested Probes and combine their results in some fashion, for example, returning the land-cover class that appears in the majority of grid cells in a 5×5 window centered on the Probe’s Location. In this way, it is possible to create views of the modeled world and its components at different scales of observation, yet treat them all in the decision-making process as identical, localized observations. The Instrumentation subsystem also allows the modeler to “instrument” the operating simulation model in order to monitor the model’s evolution and collect data for later analysis. A Sampler is made up of a set of one or more Probes
Figure 5. When the client object invokes the Probe’s (in this case a ProbeWrapper) probe() method, the call passes through to the embedded Probe. The Probeable object returns the state value x to the Probe, which passes the value on to the ProbeWrapper. The ProbeWrapper transforms the value by some function F(x), and returns the transformed value to the client.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
284
Robinson & Graniero
that perform the actual queries about system state. The Sampler will typically take the Probe results and format them in an organized fashion for output to a file on disk, or for periodic output to the computer console to inform the user on progress. Data files produced by a Sampler may be used in other separate analysis programs to generate summary statistics from a large number of model runs. The Simulation object acts as the core engine of the simulation model. It manages the interaction of the components in the three subsystems. The setup() method structures the simulation appropriately for the desired model, attaches any instrumentation desired, and acquires any necessary memory or file resources required for the model. The run() method is simple: until the Schedule is finished, it will trigger the next pending item on the Schedule. The teardown() method releases any memory or file resources and gets ready for program termination. The Simulation object may be instantiated and executed as an independent, stand-alone program. It can also act as a “pure” object that is contained in a larger program, such as a simulation experiment that executes many instances of the Simulation object, each of which has slight variations in its selection and configuration of model components.
Fuzzy Spatial Relations and Database Query A crucial concept implemented in many spatially explicit IBMs is the perceptual range of individuals. In our application domain, an Agent’s perceptual range represents its informational window on its surroundings. It determines how much of the surrounding area an individual can perceive in terms of habitat quality and presence of conspecifics. Thus, the perceptual range of an agent is equivalent to specifying a fuzzy spatial relation that constrains the Agent’s view of the data to a particular fuzzy region. Let X = {x}be a finite set of locations bounded by the limits of the study area. Let dcx be the Euclidean distance from the location of the dispersing animal object, c, to location x. P(x) is the fuzzy set defining the perceptual range for a single individual. Thus, the support of P is 0+P and can be used to limit the extent of data operated on or retrieved from the spatial database used to support the model.
⎧ 1 if ⎪ c P ( x; β ,θ ) = µ p ( x) = ⎨θ ( β − d x ) + 1 if ⎪ 0 if ⎩
d xc ≤ β
β < d xc < β + 1 / θ β + 1 / θ ≤ d xc
(1)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 285
Using lyrPerceptualRange.setValueAt(), the Layer object lyrPerceptualRange is populated by membership values based on Equation (1). The support of P is 0+ P and is defined by regPerceptualRange as the set of locations for Agent, where lyrPerceptualRange.valueAt() > 0. The statement regPerceptualRange = FuzzySpatialOp.support( lyrPerceptualRange ) creates a Region object named regPerceptualRange that contains only those locations where fuzzy membership in lyrPerceptualRange > 0.0. This regPerceptualRange is what is referenced by the Agent as its individual perceptual range at that particular location at that time step in the simulation. Using regPerceptualRange to specify the Region defined by 0+P, the support of the perceptual range P for a particular Agent, and then limiting all further processing and decision-making to regPerceptualRange provides three benefits: 1.
Semantics: We effectively shrunk the simulated world to align with the Agent’s entire perceptual world for the duration of that Agent’s processing and decision making. Although the World’s extent may be larger than that of the defined Region, the Agent has no way to access it without changing its Location.
2.
Performance: We ensure that we only iterate over layer locations that require processing. This saves unnecessary processing in “zero” locations, hence, boosting computational performance. In the illustrated case of a squirrel dispersing within a National Recreation Area, this can be a significant savings.
3.
Object-oriented design integrity: By creating an object that defines the processing region and controls access to that region, we guarantee that other client objects do not accidentally process inappropriate locations. Control over the processing region is handled by one object (namely, the regPerceptualRange Region), whereas control over processing behavior is handled by another object (namely, the CompSquirrel Agent). This enforces clear lines of responsibility within the object model. Furthermore, the method of determining the processing region can be modified by changing the code that creates the region. This code is isolated from the processing steps, which means that we can reduce the likelihood of introducing erroneous programming artifacts (thereby increasing confidence in model results), and it becomes easy to make variants in perceptual definition for different agents. To do so, encapsulate the region definition code in an interchangeable object, and the rest of the model is left unchanged.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
286
Robinson & Graniero
Once the perceptual range over an assumed flat surface, i.e., P, is specified, the next step is to determine to what degree each location is within the visible perceptual range. In other words, the influence of local topography is taken into consideration. Let L: X → [0,1] be the fuzzy set describing the degree to which location x is visible from a particular squirrel. The membership function for L is defined by Equation (2) as a closed-form triangular function, where loscx is the angle at which location x is visible from location c. It is based on the output style of GRASS GIS (Neteler & Mitasova, 2002), where 90° is looking straight ahead, below the line of sight is less than 90° , and above the line of sight is greater than 90°. If the local terrain creates a physical obstruction to visibility between c and x, then L = 0. ⎛ los c − α γ − los xc , L ( x;α , β , γ ) = µ L ( x ) = max(min ⎜⎜ x γ −β ⎝ β −α
⎞ ⎟, 0 ) ⎟ ⎠
(2)
The degree to which a cell is both visible and falls within the perceptual range is defined by ψ = P∩L. This operation takes into account the level plain perceptual distance and the potential effect topography may have on the ability of an object to perceive a location. To make it an efficient process, we need only calculate the value of L for the locations that fall in 0+P, thus 0+P defines spatial extent over which information from the spatial database is extracted and utilized by the individual agent. In the code for defining an Agent, the statement spots = regPerceptualRange.getAllLocations(); in effect limits the calculation of L to those locations (x), spots that fall within the set . Subsequently, the membership values in lyrPerceptualRange and lyrVisibility are combined using an aggregation operator to arrive at a spatial object, lyrVisiblePerceptual, which is referenced by an Agent as its individual visible perceptual range at that particular location at that time step in the simulation.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 287
Representing and Processing Fuzzy Geographic Data It is almost unheard of for spatially explicit ecological models to use GIS data that are represented as fuzzy data in a fuzzy object-oriented database. In fact, there are no practical cases known to us. Therefore, in our illustrative example, we will first show how a World object is derived from an open-source GIS in such a way that the Probes associated with an Agent are able to retrieve fuzzy information from nonfuzzy database representations. Then we will discuss how this approach can be extended to account for other fuzzy representations that may be relevant to this modeling domain. The use of the Grid as a representation framework allows for a straightforward interface to most common raster-based GIS databases. Figure 6 shows how the specialized GridSource, called GrassAsciiGridSource, is used to create a World object from GIS data layers stored in a GRASS GIS. The question arises at this point whether we extract the raw, nonfuzzy, GIS data from GRASS and manipulate it with ECO-COSM to fuzzify it in a manner meaningful to this particular problem, or we preprocess the raw, nonfuzzy GIS data to produce fuzzified GIS data that is then integrated as Layer objects into ECO-COSM. We will first present the latter, as it was already demonstrated (Robinson & Graniero, in press), and then we will discuss how the former may be implemented. For illustrative purposes, let us consider a key component of the residence decision model goal set. In our formulation, the quality of the habitat (LC) at the Agent’s location and the area of the habitat patch (HA) are combined to define the goal set H = LC∩HA (Robinson & Graniero, in press). Typically, habitat quality is inferred from a layer where each grid cell is classified as particular land-cover type. Taking from our previous study, the degree to which a land-cover type is considered quality habitat for a gray squirrel is summarized in Equation (3). For simplicity, the layers of land-cover type in the GRASS GIS were processed so that each grid cell was coded with its membership value according to Equation (3) before the data are loaded into the model. Figure 6. Code example extracted from an Agent constructor, showing how Probes are retrieved from the modeled World (The notation ‘spot’ indicates that the probe should access a single Location (i.e., grid cell), as opposed to a moving window or other spatial construct
prbLCHabitat = (SpatialProbe) world.getLayerProbe( "lchabitat", "spot" ); prbForestArea = (SpatialProbe) world.getLayerProbe( "forarea", "spot" );
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
288
Robinson & Graniero
⎧1.0 ⎪0.9 ⎪ ⎪0.75 ⎪ LC (κ ) = µ LC (κ ) = ⎨0.0 ⎪0.0 ⎪ ⎪0.0 ⎪ ⎩0.0
if
oak _ forest
if
oak / deciduous _ bottomland
if if
deciduous _ forest conifer _ forest
if
early _ successional _ deciduous _ forest
if if
wetland , pasture, grassland , agriculture water
(3)
When the Agent must assess the habitat quality, it does so by requesting the habitat quality membership value from its corresponding Probe, which acts as its sensory interaction with the surrounding environment. Operationally, the Probe queries the spatial database for the habitat quality membership value at the Agent’s current location and returns that membership value to the Agent. In addition to land cover, we use the size of an oak/deciduous forest patch as an important factor in the residence decision. In Equation (4), we define a fuzzy set, HA, to express the degree to which a location falls within the class of minimum_habitat_area. The setting of the parameters αHA and βHA will vary depending on the species being modeled. The area measurement is based on the sizes of patches formed from contiguous cells that were classified as oak, deciduous, or oak/deciduous bottomland. Let farea(κ) be the area in hectares of the oak/deciduous forest patch within which that location κ falls. Cognitively, the Agent is assessing the size of the oak/deciduous forest patch; operationally, it is calculating a new fuzzy membership based on forest patch sizes encoded in a raster, which resulted from a “clumping” operation on the same land-cover raster used for evaluating habitat quality. The minimum area Probe accesses the value of the forest patch grid cell corresponding to the Agent’s location and returns the value to the Agent, which then calculates the fuzzy membership according to Equation (4). Thus, each Agent has a number of SpatialProbes, that is, Probes that can each be directed to a specified Location on a target Layer in order to collect information from that specific Layer. Figure 6 shows how an Agent gets the Spatial Probe prblLCHabitat for the Probeable Layer lchabitat, which corresponds to LC above, and the Spatial Probe prbForestArea for the Probeable Layer forarea, which corresponds to farea(κ) in Equation (4). ⎛ ⎛ ⎡ farea (κ ) − α HA ⎤ ⎞ ⎞ HA(κ ) = µ HA (κ ) = max⎜ 0, min⎜⎜1, ⎢ ⎥ ⎟⎟ ⎟⎟ ⎜ − β α HA HA ⎣ ⎦ ⎠⎠ ⎝ ⎝
(4)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 289
Figure 7. Code example of Probes being used in the formulation of the residence model goal set (The code operated within the Agent’s model logic, perceiving its environment via the probe() methods of its associated Probes.) _HabitatArea = Math.max( 0, Math.min( 1, (((Number)(prbForestArea.probe())).doubleValue() - _HAalpha)/(_HAbeta - _HAalpha)
) ); HabitatGoal = FuzzyOp.compensatoryIntersection( ((Number)(prbLCHabitat.probe())).doubleValue(), _HabitatArea );
Recall from above that the land-cover type layer was preprocessed by the GIS; lchabitat contains the fuzzy membership value, not the actual land-cover type. This means that the SpatialProbe retrieves the fuzzy membership value and passes it to Agent without any intermediate processing. Notice that in the case of forarea, the Agent must do additional processing on the Probe’s result before forming the goal set, as shown in Figure 7. In contrast to the preprocessed fuzziness for LC, HA is fuzzified after crisp data are queried from the database. In the earlier description of how an Agent uses a Probe to assess the local habitat suitability, the entire land-cover raster was preprocessed according to Equation (3), and the Probe accessed the grid cell values in the transformed raster. This approach requires that each grid cell be converted only once rather than every time the grid cell is considered by an Agent, thus streamlining the computation. However, this restricts the flexibility for more sophisticated IBM models, because it presumes that all Agents in the system perceive the habitat quality of a particular land-cover type in the same way. Different animal species, and perhaps even different individuals of the same species, may map land-cover classes to slightly different membership values. This necessitates the calculation of separate rasters for each remap equation, which creates a much larger database. It also creates risk for database integrity should the original land-cover map change and the remapped rasters not be updated accordingly. Also, consider the case of a more “intelligent” agent that evolves its perception of habitat quality as it gains experience over its lifetime. Each change to the remap equation, i.e., each evolution in the Agent’s perception, would require a recalculation of its corresponding habitat quality raster, increasing the computational burden for the model.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
290
Robinson & Graniero
The ProbeWrapper provides the key mechanism with which to avoid these problems. Recall that every instance of a ProbeWrapper implementation has another Probe (possibly a ProbeWrapper) embedded within it. When the ProbeWrapper’s probe() method is invoked, it, in turn, invokes the embedded Probe’s probe() method. When it receives the embedded Probe’s result, the ProbeWrapper may perform any kind of operation on it before passing it on as its own result. As such, the remap equation can be embedded within a habitat quality ProbeWrapper that contains the following: 1.
A Probe to access the land-cover database
2.
Program logic to transform the land-cover query result according to the remap table
3.
An association with an Agent to receive and act on the result
Each Agent can be assigned a customized ProbeWrapper with a slightly different remap table. For the case of minimum habitat area, a minimum habitat area ProbeWrapper would contain the following: 1.
A Probe to access the patch area database
2.
Program logic that applies Equation (4) with the ProbeWrapper’s particular parameters
3.
An association with an Agent to receive and act on the result
The transformation code shown in Figure 7 moves out of the Agent and into its minimum habitat area ProbeWrapper. By using the ProbeWrapper approach, the Agent directly “perceives” the habitat quality of its current position according to its own value scheme, and all model logic occurs within the universe of discourse defined in the fuzzy problem domain. The Probe handles the mechanics of accessing the spatial database, thereby insulating the Agent’s model logic from database-dependent programming issues. The ProbeWrapper takes the query result from the Probe and independently manages the transformation from the GIS’ relatively applicationneutral, crisp land-cover scheme to the Agent’s application-specific, fuzzy perception of habitat quality. They may all access a single, shared land-cover raster, and they may modify their perceptions of habitat quality at any time, with no risk of compromising the database integrity or the behavioral integrity of other Agents. There are many other ways in which the ProbeWrapper structure may be used to control fuzzification of a spatial database. To illustrate, take an example based on an early work demonstrating the use of fuzzy sets in the query of land-cover databases (Robinson, 1988). Rather than simply retrieving a membership value
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 291
that was assigned to a land-cover class that represents the land-cover class’s degree of membership in the habitat set, let there be a similarity relation between the land-cover types. This similarity relation can be a function of the degree of confidence, or accuracy, felt to be likely at the location. If we assume that a location has deciduous forest, then we would expect the similarity to be greater with other forest types, especially oak. Consequently, the ProbeWrapper may take this knowledge into consideration by assigning a final membership in the set habitat based on a combination of the land-cover type at a location and its similarity relation with other land-cover types. A more realistic approach would be to take into consideration the surrounding cells as an additional information channel to inform the ProbeWrapper how similar the location is to surrounding locations. This can provide additional information to be used to estimate how well the location fits in habitat. For example, a deciduous forest cell surrounded by water, i.e., an island, would be poor habitat, whereas a deciduous forest cell surrounded by deciduous forest might be judged high-quality habitat. As another example, it is well known that no land-cover database is error free. One long-standing problem has been the mixed pixel problem, where one grid cell may have more than one land cover present but be forced by classification methods to be classified as being in a single type of land cover (Robinson & Thongs, 1986). The inherently fuzzy nature of land-cover classifications was discussed by many researchers (Robinson & Frank, 1985; Robinson, 2002; Matsakis et al., 2000; Cross & Firat, 2000; Hagen, 2003; Foody, 1996; Zhang & Stuart, 2001). In ECO-COSM, ProbeWrappers can be used to implement an information-processing function that applies a mixed pixel model to the underlying land-cover data, allowing the Agent to evaluate how closely its current location conforms to a particular land-cover type. Because land-cover classification is accomplished using remote sensing or other classification methods that can incorporate fuzziness, the process can be used to generate fuzzy geographical objects (Matsakis et al., 2000; Cross & Firat, 2000; Foody, 1996). In a simple case that is analogous to the Semantic Import model (Robinson, 1988), each cell would have a vector of membership values indicating the degree to which it belonged to a particular land-cover type. Thus, a ProbeWrapper can use a Probe to access that information and process it before passing it to the Agent. For example, a vector might look like {0.8, 0.75, 0.66, 0.3, 0.2, 0.2, 0.0}. Now, what information is passed to the Agent? Perhaps the whole vector is passed, which means that the Agent would need to have a method of combining it with the function that determines how well the location fits the set habitat. Notice in Equation (3) that each land-cover type is associated with a membership in LC, and that in the vector {0.8, 0.75, 0.66, 0.3, 0.2, 0.2, 0.0}, associated with a single grid cell, provides information on the degree to which that grid cell belongs to a particular land-cover class. Let µkLC(x) be the membership in LC of land-cover type k while µkGIS(x) is the membership value of grid cell x in land cover k. Thus, we have two vectors LC and GIS: Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
292
Robinson & Graniero
µ1LC µ1GIS LC GIS µ µ LC = 2 , GIS = 2 M M LC GIS µk µk
that can be used to arrive at:
µ1LC ,GIS LC ,GIS µ LC GIS min( µk , µk ) = 2 M . LC ,GIS µk
Then take the maximum value from this vector to represent the degree to which the grid cell falls in the set habitat. This formulation is able to be quickly computed by a ProbeWrapper, and there would be no changes required in the Agent code. In this manner, the Agent only “sees” the information presented to it by the ProbeWrapper, and it focuses strictly on the behavioral elements of the model and leaving the retrieval or derivation of the fuzzy value to the ProbeWrapper. Thus, with this simple example, we illustrated how fuzziness could be represented in a geographic database in two different ways and be used by a ProbeWrapper to deliver meaningful fuzzy information to an Agent, with no need to adjust the decision model of the Agent. The other major informational component of the habitat portion of the residence decision model is membership in HA, the minimum habitat area. It is a function of the area of the forest patch. The forest patch is defined in a raster GIS as a collection of grid cells contiguous with one another and of the same type. In a vector representation, it would be a polygon. One approach is to represent a fuzzy region, A, as composed of three parts: the core, the indeterminate boundary, and the exterior. The indeterminate edge can further be decomposed into the inside edge and the outside edge. If Z is a referential set of a finite number of attributes and region A is a fuzzy subset defined in a two-dimensional space 42 over Z, the membership function of A can be defined as µA: X × Y × Z → [0,1]. Each point is assigned a membership value for attribute z, where z ⊆ Z (Zhan, 1998). This suggests several possible approaches to representing forest area patches in this problem domain. In the current illustrative example, the forest patches are determined according to a crisp membership rule of adjacency, and then the area is calculated, followed by calculation of HA for each grid cell. This Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 293
means that the Layer forarea is composed of grid cells, each of which is coded with membership values that are a function of the area of the patch in which it belongs. However, if forest patches are fuzzy regions, then this simplistic approach would need to be changed. Because the grid cell is the atomic spatial element in our GIS database, the upshot of this approach would be that each cell (i.e., location) could be a member of more than one patch. In other words, a patch object may “share” a location (cell) with another patch object. This problem has been addressed elsewhere, so the problem is one that has received some attention in the fuzzy database community (Yazici & Akkaya, 2000; Cross & Firat, 2000; Cheng et al., 2002; Robinson, 2000; Bordogna & Chiesa, 2003). Of course, this implies that when estimating the area of a patch for habitat selection purposes, a location (cell) will contribute to the area of more than one patch. Hence, fuzzy set theory effectively expands the conventional assumptions regarding the total area extent of thematic map classes used in nonfuzzy geographic databases (Ricotta & Avena, 1999). Due to this characteristic of fuzzy regions, a number of approaches were suggested for estimating the area of a fuzzy region (Ricotta & Avena, 1999; Schneider, 2001; Yuan & Shen, 2001). There has been some work on modeling fuzzy regions that exploits the concept of the α-cut, some of which is explicitly linked to the query process (Morris, 2003; Zhan, 1998; Schneider, 2001; Schneider, 2000). Previous work suggests that the area of a fuzzy region might be computed as a weighted sum of the areas of all α-level regions (Schneider, 2001). Consider that if F% is a fuzzy region, i.e., a forest patch, and consists of a finite collection {Fα1, ..., Fαn} of crisp α-level regions, then the area of F% can be computed as in Equation (5):
n
n
area ( F ) = ∑ ∫∫ α i dxdy = ∑ α i ⋅ area ( Fαi ) i =1 ( x , y )∈F i =1 εi
(5)
In this case, area ( F% ) is a real number that could be used in Equation (4), corresponding to farea(κ). There is a problem with this straightforward linkage, because it is entirely possible, given the nature of fuzzy region objects, that a single cell will be associated with more than one fuzzy region with a membership level greater than 0.0. In such a case, a simple rule can be used such that area ( F% ) is calculated for the fuzzy region that bestows the highest membership value on cell κ. An Agent obtains information about the area of forest patch through the Probe prbForestArea that samples the Layer forarea that contains the value of farea(κ). Likewise, it is possible to construct a Layer forarea that would be the Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
294
Robinson & Graniero
value of area ( F% ) for that region to which cell κ belongs to the greatest degree. Using Probes and ProbeWrappers, it would be possible to develop a multiLayer, multi-Probe approach so that all degrees of membership could be “seen” by the Agent. This would necessitate the management of multiple Probes by a ProbeWrapper. It is possible that the ProbeWrapper might then combine that information before passing it to the Agent, which would still rely on something like Equation (4) in its decision-making model. Thus, the decision model would be kept essentially the same, but through the use of Probes and ProbeWrappers, the values of inputs used in the decision model would be changed.
Modeling Semantic Variation One of the rationales for the use of IBMs in ecological modeling is the ability to explicitly model variations in individual behavior. However, this variation is typically induced by resorting to drawing choices from a random distribution rather than endeavoring to explicitly model variations in decision making among individuals. The combination of fuzzy sets and object-oriented modeling allows for the construction of variations in individual behaviors without resorting to random draws. Fuzzy sets research and related fields are exceptionally rich in methods for aggregation and combination. It was shown elsewhere that differing schemes of aggregation can be used to operationalize the movement and residence decision models. Compensatory, noncompensatory, Yager, and crisp versions of the decision models can be constructed (Robinson & Graniero, in press). Each describes a particular class of Agents. In our simulation modeling effort, the program SquirrelDispersal manages a simulation of squirrel dispersal. This program not only creates the World object from layers drawn from the GRASS GIS database but also activates different Agent classes. The classes CompSquirrel, NoncompSquirrel, YagerSquirrel, and CrispSquirrel correspond, respectively, to Agents using decision models based on compensatory, noncompensatory, Yager, and crisp aggregation methods. Thus, each class would view the landscape somewhat differently as a consequence of the methods underlying the decision models. It is important to note that all the Agent classes use Probes and ProbeWrappers to retrieve data from the same database of Layers held within the World object. Figure 8 illustrates one example of how the behaviors of four individual Agents varied according to the decision models used to model the dispersal process. Thus, although the information contained in the World object is the same, the way it is viewed and processed by each Agent can lead to variations in spatial behavior.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 295
Figure 8. Resulting dispersal behavior of Agents with the same starting location but using different fuzzy aggregation methods in the decision model
Concluding Discussion We used a spatially explicit individual-based model of a small mammal species’ natal dispersal behavior across a real-world landscape to illustrate an objectoriented approach to creating and managing operational fuzzy information in a spatial database for use in a spatially explicit simulation model. Even though a small subset of problems in spatially explicit ecological modeling was addressed in this chapter, it highlights the breadth and depth of the problems that can be usefully explored in this problem domain. The illustrative problems presented here have demonstrated that this is a database and modeling domain rich in fuzzy information-processing challenges. Hence, it is a scientific field of endeavor that can benefit greatly from advances in fuzzy database modeling and application. We would also argue that advances in the theoretical realm of fuzzy objectoriented databases could result by devoting attention to the needs of this problem domain. One of the major consequences of our use of the ECO-COSM modeling framework has been our demonstration of the utility of using Probe objects and
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
296
Robinson & Graniero
ProbeWrappers to manage the interface between individual objects and the spatial database. Combining the Probes and ProbeWrappers with Agent objects is a promising avenue of research for using fuzzy sets in an object-oriented environment to retrieve, manage, and process geographical data that may be represented in a nonfuzzy or a fuzzy scheme. The ability to handle semantic variations was demonstrated to be feasible using this approach, as each type of Agent “saw” the same data but “interpreted” it differently as a function of variations in the semantics of the underlying decision process of an Agent. Progress in global connectivity has led to a situation in which we now need to deal with more heterogeneous geographic information that may be spatially distributed in a Web-based environment. Because other research (Leclercq et al., 1999; Cobb et al., 2000; Petry et al., 2002) has found the object-oriented, agentbased approach to be effective, it is reasonable to suggest that the ECO-COSM approach can be expanded to address problems of combining heterogeneous geographic data from spatially distributed sources. The flexibility and power of the Probe/ProbeWrapper concept could easily be extended so that mobile agents are able to move about on a spatially distributed network to identify and assemble required information and computational resources to support a large-scale IBM effort.
Acknowledgments Partial support in the form of operating research grants to each of the authors from the Natural Sciences and Engineering Research Council (NSERC) of Canada is gratefully acknowledged. We are especially grateful to Professor Haluk Cetin and the Mid-America Remote Sensing Center (MARC) for graciously providing the digital elevation and Kentucky GAP land-cover datasets. Comments by anonymous reviewers improved the quality of this chapter.
References Allen, A. W. (1987). Habitat suitability index models: Gray squirrel, revised (United States Fish Wildlife Service Biological Report 82 10.135). Washington, D.C.: United States Department of the Interior. Anderson, J. (2002). Providing a broad spectrum of agents in spatially explicit simulation models: The Gensim approach. In H. R. Gimblett (Ed.), Integrating geographic information systems and agent-based modeling
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 297
techniques for simulating social and ecological processes (pp. 21–58). Oxford: Oxford University Press. Bellman, R. E., & Zadeh, L. A. (1970). Decision-making in a fuzzy environment. Management Science, 17(4), 141–164. Bian, L. (2000). Object-oriented representation for modelling mobile objects in an aquatic environment. International Journal of Geographical Information Science, 14(7), 603–623. Bian, L. (2003). The representation of the environment in the context of individual-based modeling. Ecological Modelling, 159(2–3), 279–296. Bordogna, G., & Chiesa, S. (2003). A fuzzy object-based data model for imperfect spatial information integrating exact objects and fields. International Journal of Uncertainty Fuzziness and Knowledge-Based Systems, 11(1), 23–41. Burrough, P. A., & Frank, A. U. (1996). Geographic objects with indeterminate boundaries. London: Taylor & Francis. Cheng, T., Molenaar, M., & Lin, H. (2002). Formalizing fuzzy objects from uncertain classification results. International Journal of Geographical Information Science, 15(1), 27–42. Cobb, M., Foley, H., Petry, F., & Shaw, K. (2000). Uncertainty in distributed and interoperable spatial information systems. In G. Bordogna, & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 85–108). Berlin: SpringerVerlag. Cross, V., & Firat, A. (2000). Fuzzy objects for geographical information systems. Fuzzy Sets and Systems, 113(1), 19–36. Fahse, L., Wissel, C., & Grimm, V. (1998). Reconciling classical and individualbased approaches in theoretical population ecology: A protocol for extracting population parameters from individual-based models. The American Naturalist, 162(6), 838–852. Foody, G. M. (1996). Fuzzy modelling of vegetation from remotely sensed imagery. Ecological Modelling, 85, 3–12. Gimblett, H. R. (2002). Integrating geographic information systems and agent-based modeling techniques for simulating social and ecological processes. New York: Oxford University Press. Graniero, P. A. (2001). The effect of spatiotemporal sampling strategies and data acquisition accuracy on the characterization of dynamic ecological systems and their behaviours. Ph.D. dissertation, University of Toronto. Graniero, P. A., & Robinson, V. B. (2003). A real-time adaptive sampling method for field mapping in patchy, heterogeneous environments. Transactions in GIS, 7(1), 31–54. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
298
Robinson & Graniero
Grimm, V. (1999). Ten years of individual-based modelling in ecology: What have we learned and what could we learn in the future? Ecological Modelling, 115, 129–148. Hagen, A. (2003). Fuzzy set approach to assessing similarity of categorical maps. International Journal of Geographical Information Science, 17(3), 235–249. Harper, S. J., Westervelt, J. D., & Shapiro, A. -M. (2002). Modeling the movements of cowbirds: Application towards management at the landscape scale. Natural Resource Modeling, 15(1), 111–131. Klir, G. J., & Yuan, B. (1995). Fuzzy sets and fuzzy logic: Theory and applications. Upper Saddle River, NJ: Prentice-Hall. Koyuncu, M., & Yazici, A. (2003). IFOOD: An intelligent fuzzy object-oriented database architecture. IEEE Transactions on Knowledge and Data Engineering, 15(5), 1137–1154. Leclercq, E., Benslimane, D., & Yetongnon, K. (1999). ISIS: A semantic mediation model and an agent based architecture for GIS interoperability. In Database Engineering and Applications, IDEAS ’99 International Symposium Proceedings (pp. 87–91). Washington, D.C.: IEEE Press. Li, Q., Huang, X., & Wu, S. (2001). Applications of agent techniques on GIS. In Proceedings International Conferences on Info-Tech and Info-Net (ICII) (pp. 238–243). Washington, D.C.: IEEE Press. Lima, S. L., & Zollner, P. A. (1996). Towards a behavioral ecology of ecological landscapes. Trends in Ecology and Evolution, 11(3), 131–135. Lomnicki, A. (1999). Individual-based models and individual-based approach to population ecology. Ecological Modelling, 115, 191–198. Mackay, D. S. (1999). Semantic integration of environmental models for application to global information systems and decision-making. SIGMOD Record, 28(1), 13–19. Mackay, D. S., & Robinson, V. B. (2000). A multiple criteria decision support system for testing integrated environmental models. Fuzzy Sets and Systems, 113, 53–67. Matsakis, P., Andrefouet, S., & Capolsini, P. (2000). Evaluation of fuzzy partitions. Remote Sensing of Environment, 74, 516–533. McCoy, J., & Johnston, K. (2001). Using ArcGIS spatial analyst: GIS by ESRI. Redlands, CA: Environmental Systems Research Institute. Mech, S. G., & Zollner, P. A. (2002). Using body size to predict perceptual range. Oikos, 98, 47–52. Morris, A. (2003). A framework for modeling uncertainty in spatial databases. Transactions in GIS, 7(1), 83–103. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Managing Fuzziness in Spatially Explicit Ecological Models 299
Neteler, M., & Mitasova, H. (2002). Open source GIS: A GRASS GIS approach. Boston, MA: Kluwer Academic Publishers. Petry, F. E. (1996). Fuzzy databases, principles, and applications. Boston, MA: Kluwer Academic Publishers. Petry, F. E., Cobb, M. A., Ali, D., Angryk, R., Paprzycki, M., Rahimi, S., Wen, L., & Yang, H. (2002). Fuzzy spatial relationships and mobile agent technology in geospatial information systems. In P. Matsakis, & L. M. Sztandera (Eds.), Applying soft computing in defining spatial relations (pp. 121–155). Heidelberg: Physica-Verlag. Petry, F. E., Cobb, M. A., Wen, L., & Yang, H. (2003). Design of system for managing fuzzy relationships for integration of spatial data in querying. Fuzzy Sets and Systems, 140, 51–73. Rickel, B. W., Anderson, B., & Pope, R. (1998). Using fuzzy systems, objectoriented programming, and GIS to evaluate wildlife habitat. AI Applications, 12(1–3), 31–40. Ricotta, C., & Avena, G. C. (1999). The influence of fuzzy set theory on the areal extent of thematic map classes. International Journal of Remote Sensing, 20(1), 201–205. Robinson, V. B. (1988). Some implications of fuzzy set theory applied to geographic databases. Computers, Environment, and Urban Systems, 12(2), 89–97. Robinson, V. B. (2000). On fuzzy sets and the management of uncertainty in an intelligent geographic information system. In G. Bordogna, & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 109–127). Berlin: SpringerVerlag. Robinson, V. B. (2002). Using fuzzy spatial relations to control movement behavior of mobile objects in spatially explicit ecological models. In P. Matsakis, & L. M. Sztandera (Eds.), Applying soft computing in defining spatial relations (pp. 158–178). Heidelberg: Physica-Verlag. Robinson, V. B., & Frank, A. U. (1985). About different kinds of uncertainty in collections of spatial data. In Proceedings of Seventh International Symposium on Automated Cartography (Auto-Carto 7) (pp. 440–450). Bethesda, MD: American Society for Photogrammetry and Remote Sensing and American Congress on Surveying and Mapping. Robinson, V. B., & Graniero, P. A. (in press). Spatially explicit individual-based ecological modeling with mobile fuzzy agents . In M. A. Cobb, F. E. Petry, & V. B. Robinson (Eds.), Fuzzy modeling with spatial information for geographic problems. Heidelberg: Springer. Robinson, V. B., & Thongs, D. (1986). Fuzzy set theory applied to the mixed pixel problem of multispectral landcover databases. In B. K. Opitz (Ed.), Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
300
Robinson & Graniero
Geographic information systems in government (pp. 871–885). Hampton, VA: A. Deepak Publishing. Ruckelshaus, M., Hartway, C., & Kareiva, P. (1997). Assessing the data requirements of spatially explicit dispersal models. Conservation Biology, 11(6), 1298–1306. Russell, S., & Norvig, P. (1995). Artificial intelligence: A modern approach. Upper Saddle River, NJ: Prentice Hall. Schneider, M. (2000). Metric operations on fuzzy spatial objects in databases. In Proceedings of the Eighth ACM International Symposium on Advances in Geographic Information Systems (pp. 21–26). New York: ACM Press. Schneider, M. (2001) Fuzzy topological predicates, their properties, and their integration into query languages. In Proceedings of the Ninth ACM International Symposium on Advances in Geographic Information Systems (pp. 9–14). New York: ACM Press. Westervelt, J. D. (2002). Geographic information systems and agent-based modeling. In H. R. Gimblett (Ed.), Integrating geographic information systems and agent-based modeling techniques for simulating social and ecological processes (pp. 83–103). Oxford: Oxford University Press. Westervelt, J. D., & Hopkins, L. D. (1999). Modeling mobile individuals in dynamic landscapes. International Journal of Geographical Information Science, 13(3), 191–208. Wolff, J. O. (1999). Behavioral model systems. In G. W. Barrett, & J. D. Peles (Eds.), Landscape ecology of small mammals (pp. 11–26). New York: Springer. Yazici, A., & Akkaya, K. (2000). Conceptual modeling of geographic information system. In G. Bordogna, & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 129–151). Berlin: Springer-Verlag. Yuan, X., & Shen, Z. (2001). Notes on “Fuzzy plane geometry I, II”. Fuzzy Sets and Systems, 121, 545–547. Zhan, F. B. (1998). Approximate analysis of binary topological relations between geographic regions with indeterminate boundaries. Soft Computing, 2, 28– 34. Zhang, J., & Stuart, N. (2001). Fuzzy methods for categorical mapping with image-based land cover data. International Journal of Geographical Information Science, 15(2), 175–195. Zollner, P. A. (2000). Comparing the landscape level perceptual abilities of forest sciurids in fragmented agricultural landscapes. Ecology, 80(3), 1019–1030.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
301
Chapter X
Object-Oriented Publish/Subscribe for Modeling and Processing Imperfect Information Haifeng Liu, University of Toronto, Canada Hans Arno Jacobsen, University of Toronto, Canada
Abstract In the publish/subscribe paradigm, information providers disseminate publications to all consumers who expressed interest by registering subscriptions with the publish/subscribe system. This paradigm has found widespread applications, ranging from selective information dissemination to network management. In all existing publish/subscribe systems, neither subscriptions nor publications can capture uncertainty inherent to the information underlying the application domain. However, in many situations, knowledge of either specific subscriptions or publications is not available. To address this problem, this chapter proposes a new object-oriented
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
302 Liu & Jacobsen
publish/subscribe model based on possibility theory and fuzzy set theory to process imperfect information for expressing subscriptions, publications, or both combined. Furthermore, the approximate publish/subscribe matching problem based on fuzzy measures is defined, and the real-world A-ToPSS™ system is described.
Introduction A new data-processing paradigm — publish/subscribe — is becoming increasingly popular for information dissemination applications. Publish/subscribe systems anonymously interconnect information providers with information consumers in a distributed environment. Information providers publish information in the form of publications, and information consumers subscribe their interests in the form of subscriptions. The publish/subscribe system performs the matching task and ensures the timely delivery of published events (a.k.a. notifications) to all interested subscribers. Publish/subscribe has been well studied, and many systems have been developed supporting this paradigm. Existing research prototypes include, among others, Gryphon (Aguilera, 1999), LeSubscribe (Fabret, 2001), and ToPSS (Liu, 2002); industrial strength systems include various implementations of JMS (Happner, 2002; Monson-Haefel, 2000), the CORBA® Notification Service (OMG, 2002), and TIB/RV. All of these systems are based on a crisp data model, which means that neither subscribers nor publishers can express imperfect information in subscriptions and publications, respectively. In this crisp model, subscriptions are evaluated to be true or false for a given publication. Moreover, most of these systems do not expose a wellstructured subscription language model and publication data model. However, in many situations, knowledge to specify subscriptions or publications is not available. In these cases, uncertainty about the state of the world has to be cast into the crisp data model that defines absolute limits. Moreover, for a user of the publish/subscribe system, it may be simpler to describe the state of the world with imperfect concepts — we say, in an approximate manner. In a selective information dissemination context, for instance, users may want to submit subscriptions about an apartment with a constraint on rent that is “cheap.” On the other hand, information providers may not have exact information for all items published. In a secondhand market, a seller may not know the exact age of a vase, so the seller can describe it as an “old” vase but cannot describe it with an exact age. Temperature and humidity information collected by sensors is
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
303
often not precise but only correct within a certain error interval around the value measured. It would be more appropriate to publish such imperfect information rather than a wrong exact value if such publish/subscribe capabilities were possible. Moreover, the underlying publish/subscribe system may need to store the publications submitted for ulterior processing (i.e., for subscriptions that are submitted to the system after publication submission). For these reasons, it is an advantage to provide a publish/subscribe data model and a matching scheme that allow for the expression and processing of imperfect information for both subscriptions and publications. In a publish/subscribe system, we are concerned with two major types of imperfect information as defined in Smets (1997): imprecision and uncertainty. Imprecision is related to the content of the statement. Publications and subscriptions are statements about events and users’ interests. The expressions may be incomplete, ambiguous, or not well-defined, but involve the content of the statements. Thus, we refer to this type of imperfection in publications and subscriptions as imprecision. Another type of imperfection exists in the matching between publications and subscriptions, which we refer to as uncertainty. Uncertainty concerns the state of knowledge about the relationship between the world and the statement about the world. All publish/subscribe systems developed to date are based on the assumption that a match between a subscription and a publication is either true or false. However, it is difficult to decide whether a publication matches a subscription involving imprecision in the publication and the subscription. We call the imperfection inherent to the matching problem uncertainty. To illustrate the difference between imprecision and uncertainty, consider these two examples: (1) Charles is a tall guy, and I am sure of it. (2) Charles is six feet tall, but I am not sure of it. The height of Charles is imprecise in the former case, but it is certain. In the latter statement, the height is precise but uncertain. To support imperfect information in publish/subscribe, we extend current subscription and publication languages to incorporate the expression of imprecision at the language level and develop a matching mechanism to support processing of the extended language in publish/subscribe systems. To simplify the terminology, we use approximate as a general term for all types of imperfection involved. The extended subscriptions and publications supporting imprecision will be called approximate subscriptions and publications. The matching between approximate publications and approximate subscriptions is called approximate matching. And the systems (or models) that support approximate subscriptions/publications and implement approximate matching are called approximate publish/subscribe systems (or models). Crisp is used to refer to the traditional publish/subscribe systems.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
304 Liu & Jacobsen
There are five interesting cases according to the different combinations of subscriptions and publications with imprecision. These are as follows: 1.
Crisp subscriptions and crisp publications (conventional publish/subscribe)
2.
Approximate subscriptions and crisp publications
3.
Crisp subscriptions and approximate publications
4.
Approximate subscriptions and approximate publications
5.
A combination of crisp and approximate constraints in subscriptions and publications
Models 2 to 5 constitute new publish/subscribe system models not previously investigated. All existing publish/subscribe systems are based on a crisp data model that cannot process imprecision in publications or subscriptions. The exception is A-ToPSS, the Approximate Matching-Based Toronto Publish/ Subscribe System (Liu, 2002, 2003, 2004a, 2004b) that introduced a subscription language model and a publication data model that can express imprecise information, such as “cheap,” “large,” and “close to” as constraints. In this chapter, we discuss how to efficiently support all the above cases with the AToPSS approach. This raises questions regarding matching between crisp subscriptions and approximate publications, as well as matching between approximate subscriptions and approximate publications. We propose a novel object-oriented data model that can model all five cases described above. We also define a matching mechanism that applies to the cases involving uncertainties. Moreover, our approach follows an object-oriented design, treating subscriptions as objects, publications as objects, and notifications as objects. The latter entities are modeled by classes, thus supporting a well-structured design that can be cleanly integrated with other object-oriented technologies (objectoriented databases, distributed objects systems, etc.). From a database point of view, publications in the publish/subscribe system can be seen as data items (e.g., tuples, columns, or tables) in a database model, and subscriptions closely resemble database queries. Publish/subscribe systems solve a problem inverse to database query processing. Therefore, a wellstructured, object-oriented subscription language model and publication data model will give rise to a clean integration of the publish/subscribe paradigm with (object-oriented) database technology complementing database query evaluation techniques with publish/subscribe query indexing techniques.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
305
Information Dissemination with Publish/Subscribe Systems Publish/Subscribe Messaging Paradigm The publish/subscribe paradigm is an interaction model that consists of information providers who publish events to the system, and information consumers who subscribe to specific interests in events within the system. The publish/subscribe system matches events with subscriptions and ensures the timely notification of subscribers upon event occurrence. Figure 1 shows the paradigm of publish/ subscribe systems. Events are published in the form of publications, and users’ interests are subscribed in the form of subscriptions. A publication describes the attributes of a real-world artifact. A subscription defines a user’s interest through a list of predicates, where each predicate is a constraint on an attribute domain. The matching problem is to filter all satisfied subscriptions with constraints that are matched by an incoming publication.
Overview of Publish/Subscribe Systems Publish/subscribe has been well studied, and many systems were developed to support this paradigm. The current publish/subscribe models can be classified according to three main categories: information categorization, expressiveness of the system, and treatment of data persistence.
Information Categorization Publications and subscriptions are information in publish/subscribe systems. There are three common approaches to grouping the information to help query and search: channel-based, hierarchical, and type-based. In the channel-based approach, information is grouped together under different channels. A channel is a medium that carries information of related meaning. To publish a message to a channel implies that this message will be broadcasted to all subscribers who have subscribed to this channel, and vice versa. Newsgroups are an example of the use of a channel-based publish/subscribe system. CORBA event service, CORBA notification service, and Java™ Message Service (JMS) also use the channel-based data model.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
306 Liu & Jacobsen
Figure 1. Publish/subscribe paradigm
Subscribers
subscriptions
Publishers
Matching Filtering Matched
publications
subscriptions
Notification Engine notifications
The hierarchical approach uses a tree structure to classify information. This approach is also refered to as topic-based from the expressiveness aspect. Each node of the tree is a topic. The matching between publications and subscriptions depends on the associated topic with the right content and the appropriate level of granularity. The subject-based addressing technology of TIBCO Rendezvous™ allows publications and subscriptions to be categorized in an hierachical fashion.
Expressiveness Expressiveness refers to the ability of publishers and subscribers to express their interests and events in the form of publications and subscriptions. A higher level of expressiveness usually requires more computation power and a more advanced algorithm design. The content-based data model provides more expressive power than other models to filter publications and is more easily customized for individual subscribers. The match between subscriptions and publications involves only the content of the information without any other concerns. JMS lets subscribers define message selectors, which are based on a subset of the SQL-92 conditional expression syntax used in the WHERE clauses of SQL statements. CORBA Notification Service takes a similar approach. From the aspect of information transmission between subscribers and publishers through broker, content-based routing is also an interesting research topic that improves the information delivery efficiency.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
307
Another publish/subscribe model that concerns event correlation uses a rulebased approach (Chakravarthy, 1994; Samani, 1997) with which subscriptions and publications are expressed as a composition of events. An event is a happening of interest. It is a state transmission within the system, triggered internally or externally. Brokers that can process composite events can make publisher and subscriber processes easier to implement, because the event correlation logic no longer needs to be handled programmatically.
Persistence Persistence refers to the storage of data and states of publish/subscribe systems. The ability of data and state persistence affects the behavior and efficiency of systems. Most publish/subscribe systems are designed as memory-less messaging systems that do not save the contents or states of publications. The limitation of a memory-less model can be overcome by an event history persistence model, where all messages received by the broker are persisted, forming an event history. It is common to use conventional relational databases as offline storage systems. However, traditional databases are not designed to process data streams (continuous sequence of messages entering the broker) efficiently. The STREAM project (Bahu, 2001) led by Standford University studies techniques for special storage management and query processing for data streams. A state-persistent publish/subscribe system stores the states of publications and subscriptions. In such a system, a publication represents the state of some objects of interest, and a subscription specifies a state that consitutes the interests of the subscriber. The broker should only send notifications of a publication to those subscribers whose subscriptions undergo state transitions in the relationship with the publication. In other words, the broker component only notifies subscribers of publications that enter the states specified by their subscriptions. Hubert and Jacobsen (2003) proposed a subject space model for state-persistent publish/subscribe systems. The objective of this data model is the introduction of state-persistence into publish/subscribe systems and its symmetrical treatment of data and query.
Type-Based Publish/Subscribe The type-based publish/subscribe model was proposed as an alternative to express publications, subscriptions, and their interactions. The type-based model uses features of high-level, strongly typed programming languages, such as strong typing, scoping, objects, classes, and inheritance to define matching semantics between subscriptions and publications. In type-based publish/sub-
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
308 Liu & Jacobsen
scribe (Eugster, 2000), each topic is represented by a type definition. Subtopics can be formed by inheriting from other topic class definitions. Also, a publication can conform to more than one topic type by using multiple inheritances or implementing multiple interfaces. Publications are considered as objects, which are strongly typed, as known from many object-oriented languages. A subscriber to publications of type T receives all publication objects that conform to T. This model is used in several standard implementations, such as, in part, in the CORBA Event Service, the CORBA Notification Service, and JMS.
Application Domains Publish/subscribe is a messaging paradigm and an information management methodology. It is desirable that the technologies developed for publish/subscribe systems be generic and applicable in many application domains. Most research studies on publish/subscribe systems use the stock-brokering application as an example and the motivation of various algorithm designs of publish/ subscribe systems. The stock-brokering application is a typical example, because the roles of publishers, subscribers, and brokers are well defined. However, there are many other application domains with information management characteristics that satisfy the definition of the publish/subscribe paradigm. Selective information dissemination is the class of distributed applications that distributes information according to some restrictions or conditions. Conventional Internet search engines, such as Google™, can be modeled as publish/ subscribe systems. The search engine indexes many Web pages, and users can execute search queries on the indexed pages. A more general form of data subscription is exemplified by the emerging peer-to-peer file sharing and publishing systems, such as Napster™, Gnutella, Mojo Notion, Free Haven (Dingledine, 2000), and Freenet (Clarke, 2000). These systems are forms of publish/subscribe systems, where the broker component is physically distributed. They attempt to solve the problems of scalable distributed data storage and retrieval. A geographic information system is an example where an application can possess the roles of multiple logical components of a publish/subscribe system. The location information of mobile users is used to provide users with relevant information based on their positions. There are many other applications to which the publish/subscribe paradigm is applicable, such as workflow management (Cugola, 2001), intraenterprise process automation, supply chain management, enterprise application integration (Barrett, 1996), and network monitoring.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
309
Subscription Language and Publication Data Model Object-Oriented Publish/Subscribe Model In this section, we show one possible object-oriented design for the public interfaces of a publish/subscribe system. Various design can be found in the literature (OMG, 2001, 2002, 2004; Sun, 2002). Our design is simple and has proven itself in the design of the ToPSS system (Liu, 2004). Our design is based on two class hierarchies. One, the User class hierarchy to represent publisher, subscriber and notifier. Second, the Information class hierarchy to represent publications, subscriptions and notifications. These class hierarchies are shown in Figure 2 and Figure 3, respectively. The publisher class serves a publishing entity to submit information as publications to the system. The subscriber class serves a subscribing entity to submit interest specifications (i.e., subscriptions) to the system. The notifier class Figure 2. Definition of User class and its subclass User string Username
… public void login(… )
…
Subscriber
Publisher
public int subscribe (subscription s) public int unSubscribe (subscription s)
public int publish (publication e)
…
…
Notifier public int register_callback(subscription s, cb_info i ) public notification getNotification (subscription s)
…
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
310 Liu & Jacobsen
Figure 3. Definition of Information class Information string Id char type
…
Subscription
Publication
int numOfPred predicate [ ] preds float threshold
int numOfAttr attr_value [ ] av_pairs float threshold
public Subscription ( ) public addPred(predicate p )
public Publication ( ) public addAttrValue (attr_value av )
public float getThreshold ( ) public char getSubType ( ) public String getId ( ) public String toString ( )
public char getPubType ( ) public String getId ( ) public String toString ( )
…
… Notification Publication e Subscription s float matching_degree int nofityType public Notification ( ) public sentNotification(subscription s) public getNotifyType( )
…
allows the programmer to design entities that can poll for notification information or can register callbacks for notifications. These notifier objects can be different from the actual subscriber objects. In this design, the notifier objects are tied to a specific subscription by passing it to the system through the method call. In our design, subscriptions are represented by their subscription objects; an alternative may be to identify subscriptions, publications, and notifications with identifiers that are passed back upon successful submission of these objects. The ToPSS system uses that approach. The Information class hierarchy in Figure 3 foresees subscriptions, publications and notifications. Subscriptions define user interests through Boolean combinations of predicates. The subscription type is determined by the predicate types. We allow in our model the specification of crisp types, approximate types, and Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
311
mixed types. In most systems, a subscription is a conjunct of predicates for which a simple list suffices for the representation of the subscription. More complex subscription formulae must be represented as a tree-structured expression. Publications are defined as sets of attribute-value pairs. Notifications are essentially publications. However, certain applications may only forward part of the publication to the interested subscribers and filter out, combine or suppress part of a publication. To enable this semantic, we define an additional notification object.
Traditional (Crisp) Publish/Subscribe Model In the crisp publish/subscribe system, users’ intents have to be cast into a certain model with specific requirements. A subscription s is a Boolean formula (often simply a conjunction) over predicates, each of which is a triple consisting of an attribute, a value, and a relational operator (<, ≤,=,!=,≥,>). A publication (a.k.a. event) is a set of attribute-value pairs, where each pair consists of an attribute and a value. Any two pairs cannot have the same attribute. For example, {(car, Honda Accord),(price, $30,000),(age, new)} is an event. An attribute-value pair (a’,v’) matches a subscription predicate (a,v, relop) if a=a’ and v’ relop v. For example, (price, $30,000) matches (price, $35,000, ≤) because they share the same attribute and $30,000 ≤ $35,000. An event e satisfies a subscription s if every predicate in s is matched by some pair in e. For example, the event {(car, Honda Accord),(price, $30,000),(age, new)} satisfies the subscription s=(car, Honda Accord, =) and (price, $35,000, ≤) and (price, $20,000, ≥). The matching problem is as follows: Given an event e and a set of subscription S find all subscriptions that are satisfied by e.
Publish/Subscribe Model Supporting Imperfect Information-Processing Subscription Language Model Subscriptions are Boolean formulae over predicates. Each predicate is a constraint over a domain of values. A predicate is represented as (a i, µ i). a i is the attribute of the predicate; µi is a membership function (Zadah, 1989) that represents a fuzzy constraint on the attribute. We use R to represent the Boolean relation of predicates within one subscription (R can be intersection, union, or any other relation), then a subscription is formalized as follows: s = R ((a1 , µ1 ), ( a 2 , µ 2 ),L , (a m , µ m ))
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
312 Liu & Jacobsen
For example, a student is looking for an apartment with constraints on price, size, and age. Her subscription in natural language that specifies these constraints is: S: (size is medium) AND (price is no more than 1500) AND (age is not very old) The first predicate approximates the constraint using an uncertain notion “medium.” A membership function is used to represent it:
0 if x ≤ 40 x − 40 if 40 < x < 50 10 µ medium ( x) = if 50 ≤ x ≤ 70 1 1 − x − 70 if 70 < x < 80 10 if x ≥ 80 0
The second predicate constrains the attribute price. It is defined in a crisp manner. It can be represented by a characteristic function: 1 if x ≤ 1500 µ ≤1500 ( x) = 0 if x > 1500
The third predicate constitutes another approximate predicate. We use the following membership functions to represent the concept of “old”:
0 if x ≤ 40 x − 40 if 40 < x < 80 µ old ( x) = 1 − 40 if x ≥ 80 1
The three membership functions of this subscription are pictured in Figure 4. In this subscription, the relation of these three predicates is conjunctive. All predicates are linked by intersection (i.e., mathematical symbol is ∧). The formalization of this subscription is: 2 ) S = ( size, µ medium ) ∧ ( price, µ ≤1500 ) ∧ ( age,1 − µ old
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
313
Figure 4. Membership function of predicates
Publication Data Model Publications describe real-world artifacts or states of interest through a set of attribute value pairs. For certain attributes, exact values may not be available. In these cases, we use a possibility distribution to show the possibility that the attribute has a given value. A publication is thus defined as a list of attribute function pairs as follows:
e = {( a 1 , π 1 ), ( a 2 , π 2 ), L , ( a n , π n )} For example, an apartment advertised for rent may be described with a condition of 60m2 size and cheap rent. The first attribute is crisp, it defines a value for attribute size. The second attribute is approximate. It is qualified as cheap, which is represented by a possibility distribution function πcheap. πcheap defines the possibility of each value in the domain of discourse (i.e., all admissible rent values) as being “cheap.” The graphical representation of this event is shown in Figure 5. Formally, this publication can be represented by a set of attribute function pairs as follows: P = {( size, π 60 ), (rent , π cheap )}
Figure 5. Possibility distributions for publication
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
314 Liu & Jacobsen
where x = 60 1 if π 60 ( x) = 0 if ( x < 60) ∨ ( x > 60)
and
1 if x ≤ 1200 x − 1200 if 1200 < x < 1500 π cheap ( x) = 1 − 300 if x ≥ 1500 0
Matching in Publish/Subscribe In the general approximate model, the subscription, the publication, or both may refer to imperfect concepts. The truth value, true or false, is no longer sufficient for representing the state of a match between a publication and a subscription. We need a value between 0 and 1 to represent the degree of the match between a subscription and each publication processed by the system. Individual subscription can match a given publication, more or less, depending on this degree of match. Recall that subscriptions and publications are represented as follows: s = R ((a1 , µ1 ), ( a 2 , µ 2 ),L , (a m , µ m ))
e = {( a 1 , π 1 ), ( a 2 , π 2 ), L , ( a n , π n )} The semantics of matching subscriptions with publications is to measure the possibility and necessity (Dubois, 1988) with which the publication satisfies the expectation expressed by a subscription. Based on possibility theory, we use a pair (Πi, Ni) to denote the evaluation of the possibility and necessity of how the publication satisfies each predicate i (i.e., the match between µi and πi in a subscription). This measure is done by computing the intersection between µi and πi. In the following, we will discuss the match on the basis of predicate, then introduce the matching problem for the whole subscription. The possibility and necessity of a match between two functions µi and πi are computed by
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
315
Figure 6. Cases of possibility measure
Π i = sup min( µ i ( x), π i ( x)) x∈D
N i = inf max( µ i ( x),1 − π i ( x)) x∈D
A degree of possibility can be viewed as an upper probability bound. Π is not enough for defining the matching degree between a publication and a subscription since it is too coarse. We need its dual measure, necessity N, as a complementarity to possibility. In Figure 6, we show several cases of the possibility measure. In Figure 7, we show cases of the necessity measure. With the possibility and necessity degrees for each predicate, the overall matching degree for a subscription is evaluated using the s-norm or t-norm function according to whether the relation of predicates contained in the subscription is conjunctive or disjunctive. Usually we choose the maximum operation as the t-norm function and the minimum as the s-norm. We generalize the computation of the matching between a subscription and a publication into a formula:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
316 Liu & Jacobsen
Figure 7. Cases of necessity measure
S i ( x1 , x 2 ,L , x n ) = R ( µ i1 ( x1 ), µ i 2 ( x 2 ),L , µ in ( x n )) e( x1 , x 2 ,L , x n ) = {π 1 ( x1 ), π 2 ( x 2 ),L, π n ( x n )} Π e⊗Si ( x1 , x 2 , L, x n ) = t (sup min( µ i1 ( x1 ), π 1 ( x1 )),L , sup min( µ in ( x n ), π n ( x n ))) N e⊗Si ( x1 , x 2 , L , x n ) = t (inf max( µ i1 ( x1 ), π 1 ( x1 )),L , inf max( µ in ( x n ), π n ( x n ))) .
We take x1, …, xn as the attributes that are concerned by subscriptions and publications, thus attribute names are omitted in the representations. e⊗Si stands for “e matches Si’.” The t is the operator to treat relation R for overall evaluation. For example, if the relation R of the predicates is conjunctive and we choose min as the operation t, then the overall match degree of a subscription is the minimum of the degrees of predicates this subscription contains. With this matching semantic, a much larger number of subscriptions will match than before, as all matches with degrees greater than 0 are prospective matching candidates. Users’ perceptions of what constitutes a “good” match versus a “bad” match will certainly differ. Furthermore, a large number of slightly matching subscriptions, i.e., with a low degree of match, may not be useful, because users may be overwhelmed with the number of matches returned. For these reasons, the approximate matching model introduces two parameters to control the tolerance of a match on a per-predicate basis for each subscription. They are θΠ and θN, and they define users’ satisfaction of the possibility and Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
317
necessity of how their interests are matched. Users’ constraints are matched if both the possibility and necessity degrees are larger than the thresholds θΠ and θN. The general representation of a subscription is modified to: sub = R ((a1 , µ1 ,θ Π1 ,θ N1 ),L, (a m , µ m ,θ Π m , θ N m ))
Now we give the definition of matching between subscriptions and publications. Given a set of subscriptions S and a publication p, the matching problem in the approximate publish/subscribe system is to identify all s∈S such that s and p match with degrees greater than the thresholds defined on s by any subscriber.
Core Engine Design To demonstrate the viability of the approximate publish/subscribe model, the Approximate Toronto Publish/Subscribe System (A-ToPSS) was implemented. Next, we will describe the overall system architecture of A-ToPSS and features supported by its Web interface. The functions of a control panel will be explained to show how to adjust experimental values and monitor the behavior of the system.
System Architecture The main challenge in applying publish/subscribe systems to real-world applications lies in the design of efficient matching algorithms that exhibit scalability. At Internet-scale, such a system has to be able to process millions of subscriptions and react to thousands of publications. The A-ToPSS is implemented based on this consideration. Figure 8 shows the architecture of A-ToPSS. Publishers and subscribers send requests through a Web server (e.g., Apache) to the system. The requests include personal information registration, subscribing their interests and publishing data information. Subscriptions and publications are processed by a matching engine. At the same time, all of the users’ information passes through a script engine [e.g., PHP, JavaServer Pages™(JSP), or Meta-HTML™, etc.], and is stored in a database. The matching engine matches publications against subscriptions and returns the matched subscriptions to a notification engine. The pervasive notification engine sends different types of notifications (e.g., e-mail, ICQ, TCP/UDP, etc.) to the subscribers according to their requests.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
318 Liu & Jacobsen
Figure 8. Overall architecture of publish/subscribe system
Web Interface A-ToPSS provides a Web interface for users to interact with the system. The interactive user interface is implemented by Meta-HTML Web programming language. Meta-HTML is a powerful, extensible server-side programming language specifically designed for working on the World Wide Web. It resembles a hybrid of HTML and Lisp languages and has a huge existing function library, including supports for sockets, image creation, perl, GNU plot, etc. It is extensible in both Meta-HTML and other languages (C, etc.). A-ToPSS offers four classes of normal operations: registration, subscribing, publishing, and notification. The first time a user visits the Web interface, registration is required to access the information resource. A user needs to create an ID and set a password. Personal information such as name and address is optional. However, the contact information relevant to the notification must be provided in order to successfully receive notifications. For example, e-mail address must be provided by the user if the user wants to receive notifications via e-mail. These are administrative operations, which are common to most Web applications. Next, we will describe features specific to publish/subscribe systems. For simplicity, we will explain the operations for subscribing as an illustration. Operations for publishing are similar, and we will not elaborate here. There are two types of users in the system: administrators and regular users. Only administrators have the privilege of creating new subscription types, editing the Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
319
existing ones, or deleting them. Subscription types are templates for subscriptions. These templates specify the number of predicates and whether an attribute accepts crisp or approximate values. Before the modification or deletion of a subscription type, the system will check whether any subscription is defined under this type. Subscription types can only be edited when no subscriptions are defined under them. The user-level operations on subscriptions are designed for typical users. Subscribers can add new subscriptions, edit them, or delete the subscriptions they previously defined. When adding a new subscription, the user first chooses a type, and then our system will ask users to input corresponding information according to the requirements specified by the subscription type. For crisp subscriptions, users need to provide attribute names, operators (e.g., >, <, =, ≠, and ≠), and values (e.g., integers, floats, strings, etc.). For approximate subscriptions, it is more complicated. In addition to attribute names, users need to provide the number of approximate constraints for each attribute. For example, the “price” attribute may have three approximate constraints, which are “expensive,” “reasonable,” and “cheap.” For the representation of each constraint, the Web interface provides a trapezoidal membership function where the default values are set with public common sense. The Web interface also gives users flexibility in adapting the membership according to their specifications. A user chooses among a family of functions to represent the imperfect information and set the parameters. Figure 9 shows a screen shot of the subscription entry panel of our system, where a user can view and adapt the membership function representing his or her predicate. Figure 9. Power user’s interface for defining approximate subscriptions
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
320 Liu & Jacobsen
After users submit subscriptions and publications, their information will be stored in a database and transmitted to the matching engine at the same time to be processed. After the matching, matched subscriptions are sent back to the Web interface and stored in the database. For the moment, A-ToPSS supports notification only by a pull model. When a user clicks the “notification” button, the results of matched publications for subscriptions will be displayed on the Web. The user can browse the information through a link to the publication that matches his or her subscription. If any subscription or publication is deleted, the match related to it will be broken and will not be sent back to the user.
Control and Monitoring Experiments There are many variables, such as users’ satisfaction thresholds and publication rates, that may affect system behavior. In order to illustrate the effects of these parameters on the performance of the system, we developed a control panel for adjusting the values of system parameters and a monitoring panel for displaying system metric and observing the system behavior in real time. Both the control panel and the monitoring panel are written as Java applets. To demonstrate the differences between the crisp and approximate publish/ subscribe models, for each model we deploy an experiment control panel (a Java applet) where users can manage the change of parameters, and a monitoring panel (a Java applet) that observes and displays system metrics. Figure 10 displays a screen shot for part of the control panel. On the control panel, users can adjust the following parameters (for crisp and approximate models): rate of subscription generation, rate of publication generaFigure 10. Control panel
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
321
Figure 11. Monitoring panel
tion, rate of subscription deletion, rate of publication deletion, and thresholds of users’ satisfaction. Because the number of predicates and subscriptions in the system is large, it is difficult to control the thresholds for each predicate or subscription. In the control panel, we use the one pair of thresholds for all subscriptions to check their overall matching degrees. The control of the representation of membership functions is implemented in the normal system operations part. Users can choose a form from a function family and adjust the shape of the function according to their own specifications. The effect of the representation of functions on the number of matched subscriptions is still in progress. On the monitoring panels, the following metrics are observed and displayed: subscription loading time, matching time, number of matched predicates, and number of matched subscriptions. These metrics are taken at monitoring and control points, as indicated in Figure 10. This part aims at experimenting with the matching model to demonstrate and exploring its degrees of freedom. We can see that with the increase of the subscription thresholds, the number of matched subscriptions decreases, as we expect. Figure 11 shows the monitoring panel.
Evaluation The performance is evaluated with respect to time and memory to confirm the efficiency of the algorithms and compare the differences between a crisp publish/subscribe model and an approximate model. Experiments are processed under various subscription and publication workloads.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
322 Liu & Jacobsen
Figure 12. Performance evaluation
Performance Evaluation To evaluate performance metrics, the following metrics are considered: subscription loading time, overall system throughput, and used memory. Time measurements are taken in milliseconds and memory measurements in KB. In Figure 12, we can see that there is a trade off between the loading time and matching time. Spending more time to load subscriptions in a good organization will decrease the matching in evaluation against event coming. In real-world applications, most subscriptions are static (i.e., they are stored in the system for a long time), and therefore, the matching time is more important than the loading time. Moreover, because the publication rate is usually high, it is more important to have a fast matching algorithm that responds in a very short time. In the memory comparison, the char-wise algorithm uses less memory than the floatwise algorithm due to the space saved by using 1 byte chars instead of 4 byte floats.
Comparison between the Crisp and Approximate Models There are several properties unique to our publish/subscribe model with uncertainties, such as the expression of predicates, the truth value, and the possibilities. Here the differences beetween crisp and approximate publish/subscribe models
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
323
Table 1. Comparison of the number of matched subscriptions for various subsciption types (Publication type is approximate; the number of subscriptions is 70,000; and the number of publications is 10.) Subscription type Approximate Pessimistic Middle Optimistic
á=0 4628 4628 4438 3763
á = 0.5 184 804 184 47
á=1 7 281 39 7
are compared in two scenarios. In one scenario, the type of publication is fixed, and we vary the types of subscriptions and thresholds to compare crisp matching and approximate matching. The other scenario is the opposite of the first. Table 1 shows the different numbers of matched subscriptions when a fixed publication is published to the system and matched against various types of subscriptions with different α-cuts. ( α is used as the thresholds for possibilities and necessities.) For each subscription type, the number of matches decreased with the increase of α-cut values, which displayed the threshold effect of α. With the same α, the pessimistic case resulted in the largest number of matches, and the optimistic case resulted in the fewest matches. The approximate case and the middle case had almost the same results, because the less restrictive the subscription, the higher the probability of being matched. Table 2 shows the numbers of matched subscriptions for different types of publications when the subscription type is fixed. The graphical explanation is shown in Figure 13. When α = 0, the approximate publication returned the largest number of matches, and the point type returned the least number of matches. This happened because the value of the approximate publication has a wider domain, and thus, there is a higher possibility that subscriptions’ constraints are matched. However, with higher values of α, the results reversed: the approximate publication matched a very small number of subscriptions, while the point
Table 2. Comparison of the number of matched subscriptions for various publication types (Subscription type is approximate; the number of subscriptions is 70,000; and the number of publications is 10.) Publication type Approximate Interval Point
á=0 4628 3720 2960
á = 0.5 184 474 1932
á=1 7 170 868
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
324 Liu & Jacobsen
Figure 13. Number of matches for different publication types
type matched a larger number of subscriptions. This phenomenon can be explained by the intuitive interpretation of possibility and necessity definition. Compared to the point type publication, the approximate publications have a wider domain of possible values for each attribute. Though there is a higher possibility that the publication satisfies the predicate constraint, it is also more likely for the publication to intersect with the complementary region of subscriptions, in which case the necessity degree of match will be 0. Therefore, the necessity threshold cannot be reached.
Related Work Industry Standards There have been a number of standardization efforts on middleware architectures and distributed system interfaces to promote interoperability. The Common Object Request Broker Architecture (CORBA) is a middleware architecture standardized by the Object Management Group (OMG). The CORBA Event Service (OMG, 2001) and Notification Services specifications (OMG, 2002) augment the CORBA middleware platform with event-based messaging capa-
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
325
bilities. The Java Message Service (JMS) is the standard Java API for messageoriented middleware proposed by Sun Microsystems to add messaging integration capabilities into the J2EE platform. The CORBA Event Service specification defines an indirect channel-based event transport for distributed object frameworks. An event channel decouples event suppliers and consumers. Suppliers generate events and place them onto a channel. Consumers obtain events from the channel. Two serious limitations of the Event Service Specification are that it only supports limited event-filtering capabilities, and it cannot be configured to support different qualities of service. Most Event Service implementations deliver all events that are sent to a particular channel to all consumers connected to that channel on a best-effort basis. A primary goal of the Notification Service is to enhance the Event Service by introducing the concepts of event filtering and quality of service specifications. Clients of the Notification Service can subscribe to events by associating filter objects with the proxies through which the clients communicate with event channels. These filter objects encapsulate specific constraints on the events to be delivered to the client. Furthermore, the Notification Service enables each channel, each connection, and each message to be configured to support the desired quality of service with respect to delivery guarantees, event aging characteristics, and event priorities. The JMS is an API for enterprise messaging created by Sun Microsystems. JMS is not a messaging system. It is an abstraction of the interfaces and classes needed by messaging clients when communicating with messaging systems. JMS provides publish/subscribe and point-to-point messaging models. Under the JMS publish/subscribe model, publishers can send messages to many consumers through a virtual channel called a “topic.” All messages addressed to a topic are delivered to all the topic’s subscribers. The message delivery is push-based, and no polling is required. The point-to-point messaging model uses queues to store and forward messages from suppliers to consumers. A given queue may have multiple receivers, but only one receiver may consume each message. It is a oneto-one communication model.
Continuous Queries Continuous queries are issued once and are logically run continuously over a database. Sometimes they are referred to as queries for future data, because data included in the result set may not exist at the time when the query was created, but will be created in the future. Traditional “one-time queries,” in contrast, run only once to completion and return a result based on the current data
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
326 Liu & Jacobsen
sets. The notion of continuous queries is similar to subscriptions in publish/ subscribe systems. A publish/subscribe system will continuously evaluate a subscription against the new incoming publication stream, until the subscription is removed from the system. Two research projects, Open CQ (Liu, 1999) and NiagaraCQ (Chen, 2000), support continuous queries for monitoring persistent datasets spread over a wide-area network. Open CQ uses a query processing algorithm based on incremental view maintenance. NiagaraCQ addresses scalability in number of queries by proposing techniques for grouping continuous queries for efficient evaluation. STREAM (Stanford stream data management) is a research project at Stanford that focuses on query processing of continuous queries over data streams. It provides a general and flexible architecture for query processing in the presence of data streams.
Database Trigger Technology The study on active databases and database triggers are relevant to continuous queries. Triggers are event-condition-action rules that are used to monitor events and conditions in databases, and to execute actions automatically when specific situations are detected. Wolski et al. (1998) proposed a fuzzy trigger to incorporate imprecise reasoning in active databases. The rules that control the event–condition–action are modeled by fuzzy membership functions. This work proposes two trigger models: C-fuzzy trigger and CA-fuzzy trigger. The C-fuzzy trigger involves fuzzy inference only in the process of evaluation of the condition. If actions are also expressed in fuzzy terms and integrated with the condition part, it leads to the CA-fuzzy trigger.
Other Publish/Subscribe Research Much work has been devoted to developing publish/subscribe systems and event notification services such as Gryphon (Aguilera, 1999), LeSubscribe (Fabret, 2001), and ToPSS (Liu, 2002). Industrial strength systems include various implementations of JMS, the CORBA Notification Service, and TIB/RV. Common to all current systems is the crisp matching semantic — neither subscriptions nor publications can express uncertain information, and a match is either established or not. These systems are different in the subscription language and publication data model they offer and algorithms performing the matching task. LeSubscribe aims at publish/subscribe support for Web-based applications. It focuses on the algorithmic efficiency in supporting millions of subscriptions and
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
327
high event-processing rates. The language and data models are based on an LDAP-like semistructured data model for expressing subscriptions and publications. In this system, a subscription is a conjunction of predicates, each of which is a triplet (attribute, operator, value). Supported relational operators include <, ≤, ≠, ≥, >. This system supports both push- and pull-based information dissemination. The matching engine of LeSubscribe falls within the class of two-step matching algorithms — a predicate matching step and a subscription evaluation step. In the first step, all predicates are matched against the publication. In the second step, subscriptions are evaluated based on the set of matched predicates. Instead of two-step matching algorithms, Gryphon uses a tree-based data structure to index subscriptions, which leads to another category of matching algorithms. In Gryphon, all subscriptions are preprocessed into a tree where each non-leaf node is a test for one attribute, and the edges derived from that node represent different results. During matching, the incoming publication goes down through the branch it matches until it arrives at the leaf nodes containing the matched subscriptions. Another approach using a tree-based algorithm is binary decision diagrams (BDDs) (Compailla, 2001). In this model, each subscription is a Boolean function represented by a BDD. This approach is distinguished in two aspects: one is that it can support any Boolean formula; the other is that overlapping subscription expressions are operated only once if the variable ordering was chosen properly. Elvin (Segall, 1997) is a content-based notification/messaging service that targets application integration environments and monitoring of distributed systems. Elvin supports a more expressive subscription language that is created as strings. Subscriptions contain powerful string-processing functions and operators on built-in data types covering integer, string, and Boolean relations. In addition to the traditional comparison operators like <, ≤, =, ≠, >, ≥, Elvin supports operations such as matching extended regular expressions with strings. SIENA (Scalable Internet Event Notification Architectures) (Carzaniga, 1998) comprises another example of a publish/subscribe event-notification service that presents a similar publication and subscription language model. This research project is based on a content-based networking service and focuses on the routing of subscriptions and publications in a distributed environment so that both services — notification selection (i.e., determining which publication matches which subscription) and notification delivery (i.e., distributing matching notifications from publishers to subscribers) — are balanced. The advantage of this infrastructure is that it maximizes expressiveness in the selection mechanism without sacrificing scalability in the delivery mechanism. The last research project we introduce here is READY (Gruber, 2000), led by the AT&T research lab. READY is an implementation of the CORBA Notification Service. The specific features of READY, which are not offered by existing commercial products, include information consumer specifications that Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
328 Liu & Jacobsen
can be matched over single and compound event patterns, and quality of service (QoS) that is managed by providing ordering properties for event delivery.
The Toronto Publish/Subscribe System Family Recently, the publish/subscribe paradigm has gained wide-spread interest for modeling applications like selective information dissemination services and location based services. The Middleware System Research Group at the University of Toronto is working on the Toronto Publish/Subscribe System family of research projects in this context including A-ToPSS (Liu, 2002; 2003; 2004 ICDE; 2004 VLDB), S-ToPSS (Petrovic, 2003), L-ToPSS (Burcea, 2003; Xu 2004), persistent-ToPSS (Leung, 2003), M-ToPSS, and P2P-ToPSS (Tam, 2003). S-ToPSS (Semantic Toronto Publish/Subscribe System) is a semantic-aware system where the matching between subscriptions and publications is based on the semantic of terms rather than on the syntax. For example, publications about “automobiles” may be sent to subscribers who are interested in “vehicles”. SToPSS uses three approaches to support semantic matching capabilities. The first one is the use of synonyms. The second one uses a concept hierarchy which provides the relationships (specialization and generalization) between attributes and values. The third approach defines a set of mapping functions that allow arbitrary relationships between schemas and attribute values. The added semantic capability is realized by passing the incoming publications and subscriptions through three components that implement the above stages, respectively. A set of semantically equivalent publications and subscriptions are generated and then matched by the existing algorithm. L-ToPSS (Location-aware ToPSS) uses the publish/subscribe paradigm to implement push-oriented location based services. On top of the filtering engine, L-ToPSS adds a location staging component to periodically process users’ location updates. A location matching engine is used to match the location constraints exposed by subscriptions and publications. Considering the limited power and input capability of mobile devices, this prototype provides services in a push-oriented style, thus offering an efficient notification mechanism for mobile users. Persistent ToPSS develops a new publish/subscribe model that accommodates subscription and publication state. Traditional publish/subscribe models are stateless, however, the here developed state persistent subject spaces model tracks publications and subscriptions throughout their lifetime. M-ToPSS (Mobile ToPSS) develops efficient state transfer protocols to support disconnected operations in distributed publish/subscribe broker networks. A subscriber connected to one broker may travel to another part of the network Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
329
connecting to a new broker. The publish/subscribe broker network has to store any matching publications for the subscriber, forward these subscriptions to the new location, and change the routing information in the network to route future traffic directly to the new location of the subscriber. P2P-ToPSS (peer-to-peer ToPSS) develops techniques to layer a content-based publish/subscribe protocol on top of a peer-to-peer substrate, thus leveraging the p2p network’s benefits (i.e., scalability, fault tolerance, and resource availability.)
Summary In this chapter, we presented the publish/subscribe paradigm and introduced a model that allows expression of imperfect information in both subscriptions and publications. Fuzzy set theory and possibility theory are used to represent notions of imprecision in predicates and publications. The most important property of this approximate publish/subscribe model is that the language model is flexible and powerful in that it allows subscriptions and publications to be either crisp or approximate. Furthermore, the possibility and necessity measures used to calculate the degree of match are expressive. The two measures can be used to model users with different preferences, such as optimistic and pessimistic.
References Aguilera, M. K., Strom, R. E., Sturman, D. C., Astley, M., & Chandra, T. D. (1999). Matching events in a content-based subscription system. Presented at the Symposium on Principles of Distributed Computing. Bahu, S., & Widom, J. (2001). Continuous queries over data streams. ACM Special Interest Group on Management of Data (SIGMOD) Record, 2001(3), 109–120. Banavar, G., Chandra, T. D., Mukherjee, B., Nagarajarao, J., Storm, R. E., & Sturman, D. C. (1999). An efficient multicast protocol for content-based publish/subscribe systems. Presented at the International Conference on Distributed Computing Systems. Barrett, D. J., Clarke, L. A., Tarr, P. L., & Wise, A. E. (1996). A framework for event-based software integration. In ACM Transaction on Software Engineering and Methodology, 5(4), 378–421.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
330 Liu & Jacobsen
Burcea, I., & Jacobsen, H. A. (2003). L-ToPSS — Push-oriented locationbased services. Presented at the Fourth VLDB Workshop on Technologies for E-Services (TES’03). Humboldt University, Berlin, Germany. Burcea, I., Jacobsen, H.A., DeLara, E., Muthusam, V., & Petrovic, M. (2004). Disconnected operations in publish/subscribe. In 2004 IEEE International Conference on Mobile Data Management (MDM). Carzaniga, A., Rosenblum, D. S., & Wolf, A. L. (1998). Design of a scalable event notification service: Interface and architecture. Technical Report CU-US-863-98, Department of Computer Science, University of Colorado. Chakravarthy, S., & Mishra, D. (1994). Snoop: An expressive event specification language for active databases. Data and Knowledge Engineering, 14(1):1-26, Nov. Chen, J., Dewitt, D. J., Tian, F., & Wang, Y. (2000). NiagaraCQ: A scalable continuous query system for internet databases. In Proceedings of the 2000 ACM Special Interest Group on Management of Data (SIGMOD) International Conference on Knowledge Discovery and Data Mining (pp. 9–17). Clarke, I., Sandberg, O., Wiley, B., & Hong, T. W. (2000). Freenet: A distributed anonymous information storage and retrieval system. In Proceedings of ICSI Workshop on Design Issues in Anonymity and Unobservability, International Computer Science Institute. Compailla, A., Chaki, S., Jha, S., & Veith, H. (2001). Efficient filtering in publish/ subscribe system using binary decision diagrams. In the Proceedings of the 23rd International Conference on Software Engineering (ICSE). Cugola, G., Nitto, E. D., & Fuggetta, A. (2001). The JEDI event-based infrastructure and its application to the development of the OPSS WFMS. IEEE Transaction on Software Engineering, 27(9), 827–850. Dingledine, R., Freedman, M. J., & Molnar, D. (2000). The Free Haven project: Distributed anonymous storage service. In Proceedings of Workshop on Design Issues in Anonymity and Unobservability. Dubois, D., & Prade, H. (1988). Possibility theory: An approach to computerized processing of uncertainty. New York: Plenum Press. Eugster, P. Th., Guerraoui. R., & Sventek, J. (2000). Distributed asynchronous collections: Abstractions for publish/subscribe interaction. In 14th AITOEuropean Conference on Object Oriented Programming (ECOOP 2000), pp. 252-276. Fabret, F., Jacobsen, H. A., Lirbat, F., Pereira, J., Ross, K. A., & Shasha, D. (2001). Filtering algorithm and implementation for fast publish/subscribe systems. Presented at the ACM Special Interest Group on Management of Data (SIGMOD) Conference, Santa Barbara, CA. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Object-Oriented Publish/Subscribe for Modeling
331
Gruber, R. E., Krishnamurthy, B., & Panagos, E. (2000). READY: A high performance event notification service. In Proceedings of the 16th International Conference on Data Engineering. San Diego, California, USA. Happner, M., et al. (2002). Java message service API TUtorial and Reference: Messaging for the J2EE platform. Addison-Wesley Pub Co. Leung, H., & Jacobsen, H. (2003). Efficient matching for state-persistent publish/subscribe systems. In Proceedings of the 2003 Conference of the Centre for Advanced Studies Conference on Collaborative Research. Toronto, Canada. Liu, H., & Jacobsen, H. A. (2002). A-ToPSS – a publish/subscribe system supporting approximate matching. Presented at The 28th International Conference on Very Large Data Bases, Hong Kong, China. Liu, H., & Jacobsen, H. A. (2003). Approximate matching in publish/subscribe. In Proceedings of the Fifth IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA2003). Kobe, Japan. Liu, H., & Jacobsen, H. A. (2004a). Modeling uncertainties in publish/subscribe system. In Proceedings of 20th International Conference on Data Engineering, Boston, MA. Liu, H., & Jacobsen, H.A. (2004b). A-ToPSS – A publish/subscribe system supporting imperfect information processing. In Proceedings of the 30th International Conference on Very Large Data Bases, Toronto, Canada. Liu, L., Pu, C., & Tang, W. (1999). Continuous queries for internet scale eventdriven information delivery. IEEE Transaction on Knowledge and Data Engineering, 11(4), 583–590. Monson-Haefel, R., & Chappell, D. (2000). Java message service. O’Reilly. Object Management Group. (2001). Event Service Specification, Version 1.1. Object Management Group. (2002). Notification Service Specification, version 1.0.1. Object Management Group (2004). Data Distribution Service Specification. Version 1.0 finalization underway. Petrovic, M., Burcea, I., & Jacobsen, H. A. (2003). S-ToPSS: Semantic Toronto publish/subscribe system. In Proceedings of 29th International Conference on Very Large Data Bases. Humboldt-University, Berlin, Germany. Samani, M.M., & Sloman, M. (1997). Gem-a generalized event monitoring language for distributed systems. In Joint International Conference on Open Distributed Processing (ICODP) and Distributed Platforms (ICDP) ’97, Toronto, Canada.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
332 Liu & Jacobsen
Segall, B., & Arnold, D. (1997). Elvin has left the building: A publish/subscribe notification service with quenching. Proceedings of the Australian UNIX and Open Systems User Group Conference (AUUG’97). Brisbane, Australia. Smets, P. (1997). Imperfect information: Imprecision-uncertainty, uncertainty management in information systems: From needs to solutions (pp. 225– 254). Dordrecht: Kluwer Academic Publishers. Sun Microsystems Inc. (2002). Java message service specification. Version 1.1. Tam, D., Azimi, R., & Jacobsen, H. A. (2003). Building content-based publish/ subscribe systems with distributed hash tables. Presented at the International Workshop on Databases, Information Systems and Peer-to-Peer Computing. Humboldt University, Berlin, Germany. Wolski, A., & Bouaziz, T. (1998). Fuzzy triggers: Incorporating imprecise reasoning into active database. In Proceedings of the 14th International Conference on Data Engineering. Xu, Z., & Jacobsen, H.A. (2004). Efficient constraint processing for highly personalized location based services. In Proceedings of the 30th International Conference on Very Large Data Bases, Toronto, Canada. Zadeh, L. A. (1989). Knowledge representation in fuzzy logic. IEEE Transaction on Knowledge and Data Engineering, 1, 89–100.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
About the Authors 333
About the Authors
Zongmin Ma received his Ph.D. from the City University of Hong Kong (2001). His current research interests include intelligent database systems, knowledge management, Web-based data management, e-learning systems, intelligent planning and scheduling, decision making, robot path/motion planning, engineering database modeling, and enterprise information systems. He published many papers in journals, conferences, edited books, and encyclopedias in these areas. Also, he is currently authoring and editing several upcoming books being published by Kluwer Academic Publishers and Idea Group Inc., respectively. * * * Rafal Angryk received a Ph.D. in computer science from Tulane University (USA) and also has an M.A. in business management and an M.Sc. in computer systems. He worked as a research assistant at the Center for Computational Sciences, a program organized in cooperation between Stennis Space Center (NASA) and University of Southern Mississippi. Previously, he was on the faculty at the Institute of Computer Science, University of Szczecin, Poland. His current research interests are large databases (data mining, spatial databases), mobile agents technology (distributed processing, Web-mining), and artificial intelligence (fuzzy modeling, neural networks), and he has over a dozen papers in these areas. Fernando Berzal is an assistant professor in the Department of Computer Science and Artificial Intelligence at the University of Granada, where he is a
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
334 About the Authors
member of the Intelligent Databases and Information Systems research group (IdBIS, for short). His current research interests include knowledge discovery in databases and data mining, OLAP and data warehousing, intelligent information systems, and almost anything related to software development, from modeldriven development to design patterns and software engineering practices. Tru Hoang Cao is currently vice dean of the Faculty of Information Technology, Ho Chi Minh City University of Technology. He received his B.Eng. (Gold Medal) in computer science and engineering from Ho Chi Minh City University of Technology (1990), M.Eng. (Tim Kendall Memorial Prize) in computer science from Asian Institute of Technology (1995), and a Ph.D. in computer science from University of Queensland (1999). He then spent more than two years doing postdoctoral research in the Artificial Intelligence Group — University of Bristol and Berkeley Initiative in Soft Computing — University of California at Berkeley. His research interests are uncertain and imprecise knowledge representation and reasoning, conceptual structures, nonclassical logics and their applications, object-oriented systems, and intelligent Internet. He is author and co-author of more than 30 research papers in international journals, edited books, and conference proceedings. Rita de Caluwe studied mathematics at Ghent University, earning an M.Sc. in computer science (1965) from the “Université Scientifique et Médicale” of Grenoble (France) and graduating with a Ph.D. in 1973. She has been professionally active as an assistant at the Computing Centre of Ghent University. Her academic career as a professor in computer science at the same university started in 1974, and she leads a research group on fuzzy databases. She is (co)author of a number of publications in this field, has served as a reviewer of many conference and journal papers, and has participated actively in the elaboration of major conferences. She organized a series of “Lectures on Fuzziness and Databases” at Ghent University (1992-1997). Furthermore, she has been involved in IFIP activities for more than 25 years, representing Belgium in the General Assembly (1998-2002). Elena García-Barriocanal obtained a university degree in computer science from the Pontifical University of Salamanca in Madrid (1998) and a Ph.D. from the Computer Science Department of the University of Alcalá. In 1998, she joined the Computer Science Department of this university as assistant professor. Starting from 2000, she has been associate professor with the Computer Science Department of the University of Alcalá and she is a member of the Knowledge and Soft Computing group of this university. Her research interests mainly focus on topics related to human-computer interactions and knowledge
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
About the Authors 335
representation; concretely she actively works on ontological aspects in usability and accessibility areas. Phil Graniero is an assistant professor in the Earth Sciences Department at the University of Windsor and a researcher at the Great Lakes Institute for Environmental Research, with more than 10 years of experience in GIS-related research and development in academia and industry. His research combines environmental science with computer science, emphasizing investigations in spatial sampling strategies, eco-hydrological modeling, and wetland dynamics. His primary research interest is the integration of GIS, artificial intelligence, data acquisition technologies, and ecosystem models into innovative tools that maximize spatial information effectiveness. He teaches undergraduate and graduate courses in GIS, spatial problem solving, and environmental modeling. José A. Gutiérrez obtained university degrees in computer science (Polytechnic University of Madrid), mathematics (Complutense University), and library science (University of Alcalá), and a Ph.D. from the University of Alcalá. He has worked in several companies as project manager and he held the position of head of information systems at the University of Alcalá, vice-dean of the Polytechnic School of University of Alcalá. He currently works as a full professor at the Computer Science Department of University of Alcalá, and supervises several Ph.D. works in the areas of fuzzy sets and software engineering. Sven Helmer studied computer science (Informatik) at the University of Karlsruhe in Germany (1989-1995). Following that, he acquired a Ph.D. doing research in the area of database performance at the University of Mannheim, Germany (2000). Currently, he is working on his Habilitation (postdoctoral lecture qualification, roughly comparable to an assistant professorship) in the area of native XML database systems. He published more than 25 papers in various journals, conference proceedings, and books. Furthermore, he served as a reviewer for different journals and as a member in several program committees. Hans Arno Jacobsen holds a faculty position with the Department of Electrical and Computer Engineering and the Department of Computer Science at the University of Toronto (Canada), where he leads the Middleware Systems research group. His principal areas of research include middleware systems, distributed systems, and information systems. He received a Ph.D. from Humboldt University, Berlin (1999), and his M.A.Sc. from the University of
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
336 About the Authors
Karlsruhe (1994). From 1992-1998, he conducted research at various institutes around the globe, including LIFIA in Grenoble, France; ICSI in Berkeley; LBNL in Berkeley; and INRIA in Rocquencour, France. He served as a program committee member of numerous international workshops and conferences, including ICDCS, OOPSLA, Middleware, and VLDB. He is the program chair of the Fifth International Middleware Conference, in Toronto, Canada. For more information, please visit http://www.eecg.toronto.edu/~jacobsen. Roy Ladner received an M.S. in computer science and a Ph.D. in engineering and applied science from the University of New Orleans. He works as a research scientist at the Naval Research Laboratory at Stennis Space Center, Mississippi (USA). His work emphasizes the investigation of spatiotemporal database issues and advanced methods to improve delivery of spatiotemporal data over the Internet. His research was published in national and international conference proceedings and journals. Haifeng Liu is a Ph.D. student in the Department of Computer Science at the University of Toronto (Canada). Her research areas include database technology, information systems, distributed system, and Web information retrieval. Haifeng Liu received her master’s degree from the University of Toronto (2003) and bachelor’s degree from the University of Science and Technology of China (2001). She interned as a visiting student in Microsoft Research Asia in Beijing (July-September 2003). Nicolás Marín received a Ph.D. in computer science from the University of Granada, Spain (2001). He currently works as a full-time assistant professor in the Department of Computer Science and Artificial Intelligence at the University of Granada, where he is a member of the Intelligent Databases and Information Systems Research Group of the Andalusian Government. He is a member of the team of several financed projects, and his research interest is focused on the fields of fuzzy databases, knowledge discovery and data mining, fuzzy sets theory, soft computing, OLAP, and data warehousing. Hoa Nguyen is a lecturer of the Faculty of Information Technology, Ho Chi Minh City Open University. He received his B.Sc. in mathematics from Vinh Pedagogical University (1982), and M.Eng. in computer science and engineering from Ho Chi Minh City University of Technology (2003). His research interests are mathematical logic and their applications, fuzzy and probabilistic database modeling, and technologies for constructing intelligent systems.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
About the Authors 337
Frederick E. Petry received a Ph.D. in computer science from Ohio State University, was on the faculty of the University of Alabama in Huntsville and Ohio State and is currently a full professor in Electrical Engineering & Computer Science at Tulane University (USA). His recent research interests include representation of imprecision via fuzzy sets and rough sets in databases, GIS, and other information systems. Dr. Petry has more than 300 scientific publications, and his monograph on fuzzy databases was widely recognized as a definitive volume on this topic. He was selected an IEEE Fellow in 1996 for research on fuzzy sets for modeling imprecision in databases, and in 2003 he was made an IFSA Fellow. Olga Pons received a Ph.D. in computer science from the University of Granada, where she currently works as an associate professor in the Department of Computer Science and Artificial Intelligence. She participated in several financed research projects of the Spanish Ministry of Science and Technology. She wrote several book chapters for important editorial companies and more than 20 articles that appeared in international journals. She also participates in well-known congresses on the fields of soft computing and fuzzy sets (EUFIT, IPMU, FuzzyIEEE, IFSA, ISMIS, FQAS, etc.), where she also chaired sessions and participated in the program committees. Her research interest is focused on the fields of fuzzy databases, knowledge discovery and data mining, fuzzy sets theory, and soft computing. Vincent B. Robinson is an associate professor in the Department of Geography at the University of Toronto, Canada. He held the Alberta Forestry, Lands, and Wildlife Professorship in Digital Mapping and Spatial Data Management at The University of Calgary and came to the University of Toronto as director of the Institute for Land Information Management. He published extensively on topics relating to fuzzy information processing to problems of geographic information systems. His current research is a strong interdisciplinary blend of geographical information science, intelligent systems, and landscape biogeography. He teaches undergraduate and graduate courses in geographic information processing and landscape biogeography. Jonathan Michael Rossiter is a lecturer in artificial intelligence in the Department of Engineering Mathematics, University of Bristol, UK. He is currently a JSPS and royal society research fellow spending two years in the Biologically Integrative Sensory Systems Laboratory, Bio-mimetic Control Systems Laboratory, RIKEN (the Institute of Physical and Chemical Research), Japan. He received his B.Eng. in electronics (1992), his M.Sc. in computer
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
338 About the Authors
science (1996), and his Ph.D. in artificial intelligence (2000), all from the University of Bristol. His research interests include humanist computing, uncertain reasoning, uncertain conceptual structures, information fusion, image processing, and medical information processing. He is author and co-author of more than 20 research papers in international journals, edited books, and conference proceedings. Miguel Á. Sicilia obtained a university degree in computer science from the Pontifical University of Salamanca in Madrid, Spain (1996) and a Ph.D. from Carlos III University in Madrid, Spain (2002). In 1997 he joined an objecttechnology consulting firm, after enjoying a research grant at the Instituto de Automática Industrial (Spanish Research Council). From 1997-1999, he worked as assistant professor at the Pontifical University, after which he joined the Computer Science Department of the Carlos III University in Madrid as a lecturer, working simultaneously as a software architect in e-commerce consulting firms, and as a member of the development team of a personalization engine. From 2002-October 2003, he worked as a full-time lecturer at Carlos III University working actively in the area of adaptive hypermedia. Currently, he works as a full-time professor at the Computer Science Department, University of Alcalá (Madrid). His research interests are primarily adaptive hypermedia, learning technology, and human-computer interaction, with special focus on the role of uncertainty and imprecision handling techniques on those fields. María-Amparo Vila received her M.S. in mathematics (1973) and her Ph.D. in mathematics (1978), both from the University of Granada. Since 1992, she is a professor in the Department of Computer Science and Artificial Intelligence. Since 1997, she is also head of the department and the IdBIS research group. Her research activity is centered around the application of soft computing techniques to different areas of computer science and artificial intelligence, such as theoretical aspects of fuzzy sets; decision and optimization processes in fuzzy environments; fuzzy databases, including relational, logical, and object-oriented data models; and information retrieval. She has been responsible for 10 research projects and the advisor of seven Ph.D. theses. She published more than 50 papers in prestigious international journals, more than 60 contributions to international conferences, and many book chapters.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index 339
Index
A A-ToPSS 317 access patterns 209 access support relations (ASRs) 227 access via type hierarchies 209 adjustment belief revision 138 agent 272 application programming interfaces (APIs) 246 approximate matching 303 approximate publish/subscribe systems 303 artificial intelligence 128 association rules 87 associations 158 atomic fuzzy selection expression 64 atomic type 54 attribute generalization 96 attribute generalization algorithm 85 attribute-oriented induction 86
B B-trees 215 basic type 11, 15 Bayesian network-based 129
body clauses 123 body phase 123
C cardinality ratio 184 Cartesian product 69 CG-trees 233 CH-index 233 class 115 class hierarchy 48, 193 class inspector 202 class recognition 119 closeness of mapping 245 clusters 259 collection type 11, 15 complex objects 185 concept hierarchy 85, 96 conceptual data model 153 conceptual data modeling 153 conditional probability 48 consistent fuzzy concept hierarchy 98 constraint 23 constraint system 23, 24 continuous queries 325 core engine design 317 crisp concept hierarchy 97
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
340 Index
D data browser 130 data cube 88 data definition operators 32 data generalization 86 data graph 186 data manipulation operators 32 data mining 85 data warehouse 87 database management systems (DBMS) 207 database model 1, 31 database query 273, 284 database researchers 178 database scheme 29 database trigger technology 325 db4o 255 decision model 277 dependency 158 difference 69 disjunctive fuzzy set 183 dispersal model 277
E ECO-COSM 269 ecological models 273 ellipse problem 130 entity-relationship (ER) 154 enumeration type 12, 15 equality constraint 6 existing database system 177 expressiveness 306 extended possibilistic truth value (EPTV) 8 extendible hashing 216 extendible signature hashing index (ESH) 225 extent cardinality 251 external hashing 215
F face problem 130 FILUM 138 FIRMS model 5 flat hierarchy 105
flexible inheritance 247 food model 4 FOODBS architecture 198 FOODM model 5 FPOB instances 60 FRIL++ 113, 123 fuzzily described objects 185 fuzzy aggregation 164 fuzzy algebra 4 fuzzy association 166, 249 fuzzy association algebra 5 fuzzy association design 255 fuzzy atom 116 fuzzy attributes 210, 247 fuzzy class 159, 179, 247 fuzzy class extents 190 fuzzy class hierarchy 91 fuzzy class schema 93 fuzzy clustering algorithm 88 fuzzy collections 183 fuzzy concept 178 fuzzy concept hierarchy 97 fuzzy conceptual modeling 246 fuzzy constructor 192 fuzzy constructs 244 fuzzy data 159 fuzzy data mining 88 fuzzy database model 272 fuzzy databases 246 fuzzy dependency 169 fuzzy extensions 197, 244 fuzzy extents 190 fuzzy generalization 161 fuzzy generalization relation 163 fuzzy generalization-specialization 248 fuzzy graph constraint 7 fuzzy information modeling 153 fuzzy logic 47, 114 fuzzy measure 302 fuzzy modeling 178 fuzzy object oriented database model 1 fuzzy object-oriented capabilities 177 fuzzy object-oriented concepts 197 fuzzy object-oriented databases (FOODBSs) 206 fuzzy object-oriented model 47, 86, 90
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index 341
fuzzy property 115 fuzzy region 284 fuzzy relation cardinality 251 fuzzy relations 248 fuzzy selection conditions 64 fuzzy selection expression 63 fuzzy set 88, 156, 210, 276 fuzzy set comparison 183 fuzzy set theory 48, 302 fuzzy spatial relation 273, 284 fuzzy superclass 162 fuzzy type 191 fuzzy values 247
G G-trees 221 general two-dimensional indexes 222 generalization 158 generalized constraint 6 generalized resemblance operator 184 geographic database 270 geographic information systems (GISs) 274 geographical data 270 global connectivity 270 global depth 215 grid files 216
H H-trees 232 hierarchical model 146 hierarchical signature organization 224
I imprecise value 181 imprecision 303 inclusion operator 184 index structures 214 individual-based modeling (IBM) 270 inducer 129 information categorization 305 information dissemination 305 inheritance 208 inheritance relationship 190 instrumentation subsystem 282
interpretation of path expressions 64 intersection 69 interval supports 140 iterated prisoner’s dilemma (IPD) 141
J Java™ data objects (JDO) 242, 252 join-compatible 74
K K-d trees 217 KBLIMS 272 knowledge discovery 86 knowledge representation 146
L label clauses 123 label phase 123 landscape 271 linguistic labels 55, 181 lisp 125 location-aware ToPSS 328 logic programming 114
M machine learning 86, 128 mass assignment 49 membership degree 47, 163 membership function 285 meta-meta-model layer (M3) 246 modeling subsystem 280 modeling with words 143 monitoring experiments 320 multikey Index 234 multivalued attributes 209 multivalued reference type 14
N navigational access 213 navigational access via paths 209 necessity measure 211 neural network-based 129 nonempty intersection query 219
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
342 Index
O object bases 48 object constraint language (OCL) 248 object data management group (ODMG) 208, 242 object identifier (OID) 208 object persistence sources 242 object scheme 29 object-centered model 3 object-oriented data paradigm 178 object-oriented database 113, 274 object-oriented database management systems (OODBM) 178 object-oriented database model 2 object-oriented databases (OODBs) 85, 242 object-oriented logic programs 123 object-oriented model 3, 47 object-relational database management systems 178 ObjectStore 259 orthogonal persistence interfaces 244 orthogonal persistence system 242
P partition tree 99 partitioned signature organization 225 path expression 63 pattern recognition 86 perceptual range 271 persistent object 21 physical storage models 250 polymorphy 209 possibilistic constraint 6 possibility distribution 156, 181 possibility measure 211 possibility theory 302 preferred default subset 118 probabilistic combination strategies 53 probabilistic constraint 7 probabilistic default reasoning 117 probabilistic extent 60 probabilistic interpretation 48 probabilistic object base 46 probabilistic tuple values 56
probability degree 47 probability distribution 48 probability theory 47 probability-value constraint 7 probe 272 programming language 113 projection 69 PROLOG 123 properties 115 property inheritance 119 prototype 201 publication data model 309, 312 publish/subscribe messaging paradigm 305 publish/subscribe paradigm 301 publish/subscribe systems 305
Q query language 241 query-directed approach 272 querying 211
R random set constraint 7 recursion 186 reference instance 21 reference type 14 reflection capability 193 relational interval trees 219 renaming 69 renaming expression 70 resemblance relationship 181 RI-trees 219 role-expressiveness 245 “rough” object-oriented database 5
S SC-trees 232 segments 259 selection expression 63 selection operation 63 semantic data model 179, 189 semantic representation 181 semantic structure 273 semantics of a constraint 23
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Index 343
sequential signature file (SSF) 224 set type 54 signature tree (ST) 224 signatures 222 similarity relations 101 similarity relationship 86 similarity-based model 4 simple user recognition 136 simulation models 270 single-valued attributes 209, 214 single-valued reference type 14 soft computing 178 software development 180 sophisticated access method 206 spatial data 270 specialization process 190 standard database architecture 198 standard index structure 206 state-persistent publish 307 storage hierarchy 207 structured type 12 subscription language 309 subscription language model 311 subset query 219 superclasses 59 superimposed coding technique 222 support pairs 116 syntax of a constraint 23 syntax rules 11
uncertainty 188, 303 uncertainty degrees 232 unified modeling language (UML) 243 union 69 user model layer 246 user modeling 134 user object layer 246 user recognition 137 usuality constraint 7
V veristic constraint 7 virtual memory mapping architecture 259 void type 14 voting model 49
T top-level attributes 55 Toronto publish/subscribe system family 327 transient object 21 translation layer 179 tree hierarchy 145 tuple type 54 type 10 type hierarchies 213, 232 type system 10 type-based publish/subscribe 307
U UFO model 4 UML 153
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Organizational Data Mining:
Leveraging Enterprise Data Resources for Optimal Performance Hamid R. Nemati, University of North Carolina at Greensboro, USA Christopher D. Barko, Laboratory Corporation of America, USA Successfully competing in the new global economy requires immediate decision capability. This immediate decision capability requires quick analysis of both timely and relevant data. To support this analysis, organizations are piling up mountains of business data in their databases every day. Terabyte-sized databases are common in organizations today, and this enormous growth will make petabyte-sized databases a reality within the next few years. Those organizations making swift, fact-based decisions by optimally leveraging their data resources will outperform those organizations that do not. A technology that facilitates this process of optimal decision-making is known as organizational data mining (ODM). Organizational Data Mining: Leveraging Enterprise Data Resources for Optimal Performance demonstrates how organizations can leverage ODM for enhanced competitiveness and optimal performance. ISBN 1-59140-134-8 (h/c) • US$79.95 • ISBN 1-59140-222-0 (s/c) • US$64.95 • 388 pages • Copyright © 2004 “This book provides a timely account of data warehousing and data mining applications for the organizations. It provides a balanced coverage of technical and organizational aspects of these techniques, supplemented by case studies of real commercial applications. Managers, practitioners, and research-oriented personnel can all benefit from the many illuminating chapters written by experts in the field.” - Fereidoon Sadri, University of North Carolina, USA
Its Easy to Order! Order online at www.idea-group.com or call 717/533-8845 x10 Mon-Fri 8:30 am-5:00 pm (est) or fax 24 hours a day 717/533-8661
Idea Group Publishing Hershey • London • Melbourne • Singapore
An excellent addition to your library