Zongmin Ma and Li Yan (Eds.) Soft Computing in XML Data Management
Studies in Fuzziness and Soft Computing, Volume 255 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 238. Atanu Sengupta, Tapan Kumar Pal Fuzzy Preference Ordering of Interval Numbers in Decision Problems, 2009 ISBN 978-3-540-89914-3 Vol. 239. Baoding Liu Theory and Practice of Uncertain Programming, 2009 ISBN 978-3-540-89483-4 Vol. 240. Asli Celikyilmaz, I. Burhan Türksen Modeling Uncertainty with Fuzzy Logic, 2009 ISBN 978-3-540-89923-5 Vol. 241. Jacek Kluska Analytical Methods in Fuzzy Modeling and Control, 2009 ISBN 978-3-540-89926-6 Vol. 242. Yaochu Jin, Lipo Wang Fuzzy Systems in Bioinformatics and Computational Biology, 2009 ISBN 978-3-540-89967-9 Vol. 243. Rudolf Seising (Ed.) Views on Fuzzy Sets and Systems from Different Perspectives, 2009 ISBN 978-3-540-93801-9 Vol. 244. Xiaodong Liu and Witold Pedrycz Axiomatic Fuzzy Set Theory and Its Applications, 2009 ISBN 978-3-642-00401-8 Vol. 245. Xuzhu Wang, Da Ruan, Etienne E. Kerre Mathematics of Fuzziness – Basic Issues, 2009 ISBN 978-3-540-78310-7 Vol. 246. Piedad Brox, Iluminada Castillo, Santiago Sánchez Solano Fuzzy Logic-Based Algorithms for Video De-Interlacing, 2010 ISBN 978-3-642-10694-1
Vol. 247. Michael Glykas Fuzzy Cognitive Maps, 2010 ISBN 978-3-642-03219-6 Vol. 248. Bing-Yuan Cao Optimal Models and Methods with Fuzzy Quantities, 2010 ISBN 978-3-642-10710-8 Vol. 249. Bernadette Bouchon-Meunier, Luis Magdalena, Manuel Ojeda-Aciego, José-Luis Verdegay, Ronald R. Yager (Eds.) Foundations of Reasoning under Uncertainty, 2010 ISBN 978-3-642-10726-9 Vol. 250. Xiaoxia Huang Portfolio Analysis, 2010 ISBN 978-3-642-11213-3 Vol. 251. George A. Anastassiou Fuzzy Mathematics: Approximation Theory, 2010 ISBN 978-3-642-11219-5 Vol. 252. Cengiz Kahraman, Mesut Yavuz (Eds.) Production Engineering and Management under Fuzziness, 2010 ISBN 978-3-642-12051-0 Vol. 253. Badredine Arfi Linguistic Fuzzy Logic Methods in Social Sciences, 2010 ISBN 978-3-642-13342-8 Vol. 254. Weldon A. Lodwick, Janusz Kacprzyk (Eds.) Fuzzy Optimization, 2010 ISBN 978-3-642-13934-5 Vol. 255. Zongmin Ma, Li Yan (Eds.) Soft Computing in XML Data Management, 2010 ISBN 978-3-642-14009-9
Zongmin Ma and Li Yan (Eds.)
Soft Computing in XML Data Management Intelligent Systems from Decision Making to Data Mining, Web Intelligence and Computer Vision
ABC
Editors Zongmin Ma College of Information Science and Engineering Northeastern University 3-11 Wenhua Road Shenyang, Liaoning 110819 China E-mail:
[email protected]
Li Yan School of Software Northeastern University 3-11 Wenhua Road Shenyang, Liaoning 110819 China
ISBN 978-3-642-14009-9
e-ISBN 978-3-642-14010-5
DOI 10.1007/978-3-642-14010-5 Studies in Fuzziness and Soft Computing
ISSN 1434-9922
Library of Congress Control Number: 2010929475 c 2010 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed on acid-free paper 987654321 springer.com
Preface
Being the de-facto standard for data representation and exchange over the Web, XML (Extensible Markup Language) allows the easy development of applications that exchange data over the Web. This creates a set of data management requirements involving XML. XML and related standards have been extensively applied in many business, service, and multimedia applications. As a result, a large volume of data is managed today directly in XML format. With the wide and in-depth utilization of XML in diverse application domains, some particularities of data management in concrete applications emerge, which challenge current XML technology. This is very similar with the situation that some database models and special database systems have been developed so that databases can satisfy the need of managing diverse data well. In data- and knowledge- intensive application systems, one of the challenges can be generalized as the need to handle imprecise and uncertain information in XML data management by applying fuzzy logic, probability, and more generally soft computing. Currently, two kinds of situations are roughly identified in soft computing for XML data management: applying soft computing for the intelligent processing of classical XML data; applying soft computing for the representation and processing of imprecise and uncertain XML data. For the former, soft computing can be used for flexible query of XML document as well as XML data mining, XML duplicate detection, and so on. Additionally, it is crucial for Webbased intelligent information systems to explicitly represent and process imprecise and uncertain XML data with soft computing. This is because XML has been extensively applied in many application domains which may have a big deal of imprecision and vagueness. Imprecise and uncertain data can be found, for example, in the integration of data sources and data generation with nontraditional means (e.g., automatic information extraction and data acquirement by sensor and RFID). Also XML has been an important component of the Semantic Web framework, and the Semantic Web provides Web data with well-defined meaning, enabling computers and people to better work in cooperation. Soft computing has been a crucial means of implementing machine intelligence. Therefore, soft computing cannot be ignored in order to bridge the gap between human-understandable soft logic and machine-readable hard logic. It can be believed that soft computing can play an important and positive role in XML data management. Currently the research and development of soft computing in XML data management are attracting an increased attention.
VI
Preface
This book covers in a great depth the fast growing topic of techniques, tools and applications of soft computing in XML data management. It is shown how XML data management (like model, query, integration) can be covered with a soft computing focus. This book aims to provide a single account of current studies in soft computing approaches to XML data management. The objective of the book is to provide the state of the art information to researchers, practitioners, and graduate students of the Web intelligence, and at the same time serving the information technology professional faced with non-traditional applications that make the application of conventional approaches difficult or impossible. This book, which consists of twelve chapters, is organized into three major sections. The first section containing the first four chapters discusses the issues of uncertainty in XML. The next four chapters, covering the flexibility in XML data management supported by soft computing, comprise the second section. The third section focuses on the developments and applications of soft computing in XML data management in the final four chapters. Chapter 1 proposes a general XML Schema definition for representing and managing fuzzy information in XML documents. Different aspects of fuzzy information are represented by starting from proposals coming from the classical database context. Their datatype classifications are extended and integrated in order to propose a complete and general approach for representing fuzzy information in XML documents by using XML Schema. In particular, a fuzzy XML Schema Definition is described taking into account fuzzy datatypes and elements needed to fully represent fuzzy information. Chapter 2 aims to satisfy the need of modeling complex objects with imprecision and uncertainty in the fuzzy XML model and the fuzzy nested relational database model. After presenting the fuzzy DTD model and the fuzzy nested relational database model based on possibility distributions, the formal approach is developed in order to map a fuzzy DTD model to a fuzzy nested relational database schema. Chapter 3 describes a fuzzy XML schema to represent an implementation of a fuzzy relational database that allows for similarity relations and fuzzy sets. A flat translation algorithm is provided to translate from the fuzzy database implementation to a fuzzy XML document that conforms to the suggested fuzzy XML schema. The proposed algorithm is implemented within VIREX. A demonstrating example is presented to illustrate the power of VIREX in converting fuzzy relational data into fuzzy XML. Chapter 4 aims at automatically integrating data sources, using very simple knowledge rules to rule out most of the nonsense possibilities, combined with storing the remaining possibilities as uncertainty in the database and resolving these during querying by means of user feedback. For this purpose, the chapter introduces this “good is good-enough” integration approach and explains the uncertainty model that is used to capture the remaining integration possibilities. It is shown that using this strategy, the time necessary to integrate documents drastically decreases, while the accuracy of the integrated document increases over time.
Preface
VII
Chapter 5 focuses on the retrieval of XML data from heterogeneous multiple sources and proposes a new approach enabling the retrieval of meaningful answers from different sources, by exploiting vague querying and approximate join techniques. It essentially consists in first applying transformations to the original query obtaining relaxed versions of it, each matching the schema adopted at a single source, then using relaxed queries to retrieve partial answers from each source and finally combining them using information about retrieved objects. The approach is experimentally validated and has proved effective in a P2P setting. Chapter 6 presents a fuzzy-set-based extension to XQuery which allows user to express preferences on XML documents and retrieves documents discriminated by their satisfaction degree. This extension consists of the new xs:truth built-in data type intended to represent gradual truth degrees as well as the xml:truth attribute to handle satisfaction degrees in nodes of fuzzy XQuery expressions. XQuery language is extended to declare fuzzy terms and use them in query expressions. Additionally, several kinds of expressions as FLWOR are fuzzified. An evaluation mechanism is presented in order to avoid superfluous calculation of truth degrees. Chapter 7 describes the design and implementation of a fuzzy nested querying system for XML databases. The research involved is outlined and examined to decide on the most fitting solution that incorporates fuzziness into a user interface intended to be attractive to naive users. The findings are applied via the implementation of a prototype which covers the intended scope of a demonstration of fuzzy nested querying. This prototype is integrated into VIREX (a user-friendly system allowing users to view and use relational data as XML) and includes an easy to use graphical interface that will allow the user to apply fuzziness in order to easier search XML documents. Chapter 8 focuses on fuzzy duplicate detection in XML data, a crucial task in many applications such as data cleaning and data integration. By using two main dimensions, which are the methods effectiveness and efficiency, four algorithms that have been proposed for XML fuzzy duplicate detection are described and analyzed for comparison purpose. Also a comparative experimental evaluation performed on both artificial and real-world data is presented. The comparison shows the performances of these four algorithms. Chapter 9 proposes a machine-readable fuzzy-EPC representation in XML based on the EPC Markup Language (EPML) to conceptually represent fuzzy business process models. It reports on the design of the Fuzzy-EPC compliant schema and shows major syntactical extensions. A realistic example (sales order checks) is sketched, showing that Fuzzy-EPML is able to serve as an adequate interchange format for fuzzy business process models. Chapter 10 aims to design and develop an XML based framework to represent and merge the statistical information of clinical trials in XML documents. This framework considers any valid clinical trial including trials with partial information, and merges statistical information automatically with the potential to add a component to extract clinical trials information automatically. A method is developed to analyze inconsistencies among a collection of clinical trials and if necessary to exclude any trials that are deemed to be illegible. Moreover, two sets
VIII
Preface
of clinical trials, trials on Type 2 diabetes and on neurocognitive outcomes after off-pump versus on-pump coronary revascularisation, are used to illustrate the framework. Chapter 11 presents the main characteristics of a new Fuzzy Database Aliança (Alliance). The system is the union of fuzzy logic techniques, a database relational management system and a fuzzy meta-knowledge base defined in XML. Aliança accepts a wide range of data types, including all information already treated by traditional databases, as well as incorporating different forms of representing fuzzy data. The system uses XML to represent meta-knowledge. The use of XML makes it easy to maintain and understand the structure of imprecise information. Also Aliança is designed to allow easy upgrading of traditional database systems. The Fuzzy Database Architecture Aliança approximates the interaction with databases to the usual way in which human can reason. Chapter 12 presents SUNRISE (System for Unified Network Routing, Indexing and Semantic Exploration) for XML data sharing. Aiming at semantic interoperability in heterogeneous networks, SUNRISE is a PDMS (Peer Data Management System) infrastructure, which leverages the semantic approximations originating from schemas’ heterogeneity for an effective and efficient organization and exploration of the network. SUNRISE implements soft computing techniques which cluster peers in Semantic Overlay Networks according to their own contents, and promote the routing of queries towards the semantically best directions in the network.
Acknowledgements We wish to thank all of the authors for their insights and excellent contributions to this book and would like to acknowledge the help of all involved in the collation and review process of the book. Thanks go to all those who provided constructive and comprehensive reviews. Thanks go to Janusz Kacprzyk, the series editor of Studies in Fuzziness and Soft Computing, and Thomas Ditzinger, the senior editor of Applied Sciences and Engineering of Springer-Verlag, for their support in the preparation of this volume. The idea of editing this volume stems from our initial research work which is supported by the National Natural Science Foundation of China (60873010), the Fundamental Research Funds for the Central Universities (N090504005 & N090604012) and Program for New Century Excellent Talents in University (NCET-05-0288).
Northeastern University, China April 2010
Zongmin Ma Li Yan
Contents
Part I: Uncertainty in XML An XML Schema for Managing Fuzzy Documents . . . . . . . . . . . Barbara Oliboni, Gabriele Pozzani Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li Yan, Jian Liu, Z.M. Ma Human Centric Data Representation: From Fuzzy Relational Databases into Fuzzy XML . . . . . . . . . . . . . . . . . . . . . . . ¨ Keivan Kianmehr, Tansel Ozyer, Anthony Lo, Jamal Jida, Alnaar Jiwani, Yasin Alimohamed, Krista Spence, Reda Alhajj Data Integration Using Uncertain XML . . . . . . . . . . . . . . . . . . . . . Ander de Keijzer
3
35
55
79
Part II: Flexibility in XML Data Management Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Bettina Fazzinga Fuzzy XQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Marlene Goncalves, Leonid Tineo Attractive Interface for XML: Convincing Naive Users to Go Online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Keivan Kianmehr, Jamal Jida, Allan Chan, Nancy Situ, Kim Wong, Reda Alhajj, Jon Rokne, Ken Barker An Overview of XML Duplicate Detection Algorithms . . . . . . . 193 P´ avel Calado, Melanie Herschel, Luıs Leit¨ ao
X
Contents
Part III: Developments and Applications Fuzzy-EPC Markup Language: XML Based Interchange Formats for Fuzzy Process Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Oliver Thomas, Thorsten Dollmann An XML Based Framework for Merging Incomplete and Inconsistent Statistical Information from Clinical Trials . . . . . . 259 Jianbing Ma, Weiru Liu, Anthony Hunter, Weiya Zhang Alian¸ca: A Proposal for a Fuzzy Database Architecture Incorporating XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Raquel D. Rodrigues, Adriano J. de O. Cruz, Rafael T. Cavalcanti Leveraging Semantic Approximations in Heterogeneous XML Data Sharing Networks: The SUNRISE Approach . . . . . 315 Federica Mandreoli, Riccardo Martoglia, Wilma Penzo, Simona Sassatelli, Giorgio Villani Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Part I: Uncertainty in XML
An XML Schema for Managing Fuzzy Documents Barbara Oliboni and Gabriele Pozzani
Abstract. Topics related to fuzzy data have been investigated in the classical database research field, and in the last years they are becoming interesting also in the XML data context. In this work, we consider issues related to the representation and management of fuzzy data by using XML documents. We propose to represent different aspects of fuzzy information by starting from proposals coming from the classical database context. We extend and integrate their datatype classifications in order to propose a complete and general approach for representing fuzzy information in XML documents by using XML Schema. In particular, we describe a fuzzy XML Schema Definition taking into account fuzzy datatypes and elements needed to fully represent fuzzy information.
1 Introduction Issues related to the representation, processing, and management of information in a flexible way appear in several research areas (e.g., artificial intelligence, databases and information systems, data mining, and knowledge representation). Requirements related to fuzziness come from the observation that human reasoning is not exact and precise as happen usually in personal computers. Humans do not follow precise and always equal rules. Moreover, in some applications data come with errors or are inherently imprecise since their values are subjective (e.g., values for representing customer satisfaction degrees). Thus, it has been natural for researchers try to incorporate flexible features in software. Hence, several proposals deal with Barbara Oliboni Department of Computer Science, University of Verona, Italy e-mail:
[email protected] Gabriele Pozzani Department of Computer Science, University of Verona, Italy e-mail:
[email protected]
Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 3–34. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
4
B. Oliboni and G. Pozzani
problems related to the representation and processing of imprecise data. Many of them starts from theories formulated by Zadeh [36]. Zadeh formalized notions related to fuzziness and uncertain data representation by presenting a theory about fuzzy sets, possibility theory, and similarity relations. These notions are the basic ones used in many proposals related to the representation of imprecise data in classical databases for making them more flexible [13, 21, 25, 24, 27, 28, 40, 41]. As an example, fuzzy databases allow one to represent the uncertainity of physical measures or subjective human preferencies. On the other hand, fuzzy processing of data allows one to reply to queries not only returning exact matching data but also data similar to the requested ones. In this way, the system is able to get around errors in queries formulation coming from user misunderstanding or from incomplete information representation. Among all proposals about fuzzy databases, we consider the GEFRED one [21], which is based on generalized fuzzy domains and relations and allows one to represent possibility distributions, similarity relations, linguistic labels and all other fuzzy concepts and datatypes. The GEFRED model was extended by Galindo et al. [13] to define a complete database system capable to manage fuzzy information. For extending the GEFRED model they define a fuzzy ER conceptual model, a fuzzy relational database and an extended SQL language (FSQL) able to manage fuzzy data. In this work, we consider the model proposed by Galindo et al., and in particular their fuzzy data types classification, as a starting point for classifying data types needed to represent fuzzy information in XML documents. Since XML is imposing itself as a standard for representing and exchanging information on the net, topics related to the modeling of fuzzy data can be considered very interesting also in the XML data context. Few proposals in the literature deal with the representation of fuzzy information in XML documents [14, 19, 20, 26] by considering different aspects. In our proposal, we adopt the data types classification defined in [13] for the relational database context, and adapt it to the XML data context. In order to manage data types, differently from other related approaches, we choose to use XML Schema [32] instead of DTD [23]. DTD is included in the XML 1.0 standard [23], and thus it is widely used and supported in applications. However DTD has some limitations: it does not support new XML features (e.g. namespaces), it has some lack of expressivity and it uses a non-XML syntax to describe the grammar. All these limitations are overcame by XML Schema [32]. XML Schema can be used to express a set of rules to which an XML document must conform in order to be considered “valid” (with respect to that schema), and provides an object oriented approach to the definition of XML elements and datatypes. Moreover, it is compatible with other XML technologies like Web services, XQuery (for XML document querying) and XSLT (for XML document presentation). Thus, we propose a general approach for representing fuzzy information in XML documents by using XML Schema. We describe a fuzzy XML Schema definition taking into account fuzzy data types and elements needed to fully represent fuzzy information.
An XML Schema for Managing Fuzzy Documents
5
Our proposal for an XML Schema able to represent fuzzy data can be used by any organization or system managing uncertain data. These users may have the necessity to exchange fuzzy information through different subsystems, locally or over the net, and the use of fuzzy XML documents may represent a good solution. Moreover, fuzzy XML documents can be used by these systems as a storage method for collected fuzzy data. Since, actually, there are no DBMSs implementing fuzzy capabilities and the development of a fuzzy extention for an existing DBMS may require too effort, fuzzy XML documents can represent a simple way to store and manage fuzzy information, as already happen for classical data. Our proposal can help in organizing these data providing a common and complete reference Schema for representing fuzzy data. This work is structured as follows: in Section 2 we present some background notions useful to better understand the context of this proposal. In Section 3 we present our proposal of an XML Schema definition introducing new fuzzy datatypes and elements needed to represent fuzzy information in an XML document. In Section 4 we give an example of an XML document satisfying the proposed Schema, by considering information managed by a weather station. In Section 5 we further extend the proposed Schema allowing the representation of some information useful during the fuzzy processing of an XML document. Some examples about these fuzzy processing information are illustrated in Section 6. In Section 7 we discuss how a classical XML document can be changed in order to comply with our fuzzy XML Schema proposal and be able to represent fuzzy data. In Section 8 we give a brief description of other approaches presented in the literature about representation and querying of fuzzy XML documents. Finally, in Section 9 we sketch some conclusions and future research directions.
2 Background In this section we briefly report some background notions on fuzziness, on relational databases dealing with fuzzy data, and on XML. Several proposals deal with the representation of uncertain data in databases. The relational approach [6, 7, 8] has introduced the NULL value in order to represent unknown attribute values (i.e., none value is applicable or all values in the domain are possible). NULL value introduces a tri-valued logic. Later on, for example in Umano-Fukami model [27, 28], NULL value was further differentiated introducing the fuzzy values UNKNOWN, UNDEFINED and NULL. UNKNOWN means that any value in the domain is possible, UNDEFINED means that none of the values in the domain is possible and NULL (it is different by the null pointer) means that we do not know anything, in other words it may be both undefined or unknown. However, more systematic approaches to fuzzy databases started from the notion of fuzzy set and other related notions. The definition of fuzzy set, introduced by Zadeh in [36], is based on the classical notion of set and extends it to introduce flexibility. In the classical definition, a set S
6
B. Oliboni and G. Pozzani
on a domain D is defined as a boolean function μ : D → {0, 1} that says us whether an object in D belongs (1) or not (0) to S; μ is called the membership function of S. The membership function associated to a fuzzy set F is a function μF : D → [0, 1] valued in the real unit interval [0, 1]. Thus, in a fuzzy set, each object in D belongs to the set with a certain degree; this means that each object is related to a membership degree. In 1971 Zadeh introduced the notion of similarity relation [37]: given a set of objects, a similarity relation defines the similarity degree between any pair of objects, i.e., how much two objects are similar one to each other. By using similarity relations, users can retrieve not only a requested object but also the similar ones, introducing fuzziness in queries. The use of similarity relations inside relational model was introduced in the Buckles-Petry Model [4] to get fuzzy capability to relational databases. Moreover, in [38], Zadeh has extended the fuzzy set theory introducing the possibility theory, an alternative to probabilistic theory. This notion was further extended by Dubois and Prade in [11] and subsequent work. A possibility distribution is based on the relationship between linguistic variable and fuzzy set notions. A possibility distribution is determinated by the question “Is x A?” where A is a fuzzy set on domain X and x is a variable on X. The use of possibility theory in relational model was introduced in three main different models: Prade-Testemale model [25, 24], Umano-Fukami model [27, 28] and Zemankova-Kandel model [40, 41]. All above fuzzy approaches and models have been joined in the GEFRED model of Medina, Pons and Vila [21]. The GEFRED model is based on generalized fuzzy domains and relations which extend classical domains and relations and allows one to represent possibility distributions, similarity relations, linguistic labels and other fuzzy concepts and datatypes. The GEFRED model was extended by Galindo et al. [13] by defining a complete database system able to manage fuzzy information. Extending the GEFRED model they define a fuzzy ER conceptual model, a fuzzy relational database and an extended SQL language (FSQL) capable to manage fuzzy data. In particular they define new fuzzy datatypes that allow one to store fuzzy values in database tables and fuzzy degrees which allow one to incorporate other uncertainty information with several meanings. Moreover they store some meta-data about fuzzy objects in auxiliary tables called Fuzzy Metaknowledge Base (FMB). In this work, we will start from the GEFRED model for defining a suitable approach to represent fuzzy information in XML documents. XML (eXtensible Markup Language) [23] is a markup language introduced as a simplified subset of SGML (Standard Generalized Markup Language) [16] by the World Wide Web Consortium (W3C) [29]. XML is the standard de facto for describing and exchanging data between different systems and applications using Internet. XML is extensible because it supports user-defined elements and datatypes. The grammar for tags in an XML document is defined in a DTD (Document Type Definition) [23] to which the XML document must refer. The elements in an XML document, related to a given DTD, must respect the DTD itself.
An XML Schema for Managing Fuzzy Documents
7
DTD is included in the XML 1.0 standard, and thus it is widely used and supported in applications. However DTD has some limitations: it does not support new XML features (e.g., namespaces), it has some lack of expressivity and it uses a non-XML syntax to describe the grammar. All these limitations are overcame by the XML Schema [32] (also called XML Schema Definition, XSD). XML Schema can be used to express a set of rules to which an XML document must conform in order to be considered “valid” (with respect to that schema). XML Schema provides an object oriented approach to the definition of XML elements and datatypes. Moreover it is compatible with other XML technologies like Web services, XQuery (for XML documents querying) [31] and XSLT (for XML documents presentation) [33]. Our proposal deals with the representation of fuzzy data in XML documents, is based on the extended version of the GEFRED model proposed by Galindo et al. [13], and uses XML Schema.
3 XML Schemata for Fuzzy Information In this section we propose a fuzzy XML Schema Definition containing the new fuzzy datatypes and elements needed to represent fuzzy information, accordingly to the extended GEFRED relational data model [13]. In particular, we define appropriate XML schemata for fuzzy datatypes and degrees and for the related auxiliary information stored in the Fuzzy Metaknowledge Base (FMB). The definition of an XML Schema may be divided into several related schemata. Each Schema may refer to other schemata by introducing a different namespace for each of them. Namespaces allow one to refer and use objects defined in different schemata specifying their locations. Moreover, namespaces allow one to distinguish between different elements with the same name but with different definitions, locations, and semantics. To each namespace corresponds a different XML Schema, in such a way the system can retrieve the correct definition for each element. Fig. 1 depicts relationships among the XML schemata constituting the proposed overall schema. Each line represents a reference of a Schema inside another one. Note that the Schema base.xsd is defined just one time but it is referred by all other second level schemata. In XML documents, data are represented in a structured way and their structure is defined by related XML schemata. For example, if we consider an XML document obtained by a database, its XML Schema may define that tuples are represented in elements called record and they are arranged in an element named as the table name. In this work we focus only on the structure of fuzzy information supposing the user already has a general XML Schema defining the structure of other crisp parts of the document. In the following sections we analyse all parts of the XML Schema we propose for managing fuzzy information.
8
B. Oliboni and G. Pozzani
FleXchema.xsd FuzzyOrdType.xsd base.xsd FuzzyNonOrdSimType.xsd base.xsd FuzzyNonOrdType.xsd base.xsd degrees.xsd base.xsd FMB.xsd base.xsd processing.xsd base.xsd Fig. 1 Reference relations between proposed XML schemata
3.1 The Root Schema FleXchema.xsd is the main file of the proposed schema. It defines the general structure of fuzzy datatypes, FMB (see Section 3.7), and processing information (see Section 5) recalling definitions given in several different files, that we will analyse in following sections. First of all, we introduce the definitions of the four fuzzy datatypes that our XML Schema proposal allows one to represent: 1. classical crisp (non fuzzy) data marked to be processed with fuzzy operations, represented by datatype classicType;
An XML Schema for Managing Fuzzy Documents
9
2. imprecise data over an ordered underlying domain, represented by datatype FuzzyOrdType (see Section 3.3);
3. imprecise data over a discrete nonordered domain and related by a similarity relation, represented by datatype FuzzyNonOrdSimType (see Section 3.4);
4. imprecise data over a discrete nonordered domain and not related by a similarity relation, represented by datatype FuzzyNonOrdType (see Section 3.5).
Each datatype is defined as an XML complexType with two required attributes. The first attribute (info) is an IDREF refering to the element in the FMB part of the document containing the meta-information (see Section 3.7) about the interested fuzzy object. The second attribute (type) is a fixed string encoding the datatype of the considered element. The possible codings we define for the datatypes are: • • • •
T1 for classicType datatype; T2 for FuzzyOrdType datatype; T3 for FuzzyNonOrdSimType datatype; T4 for FuzzyNonOrdType datatype.
10
B. Oliboni and G. Pozzani
This fixed attribute allows us to distinguish between the different fuzzy classes of datatypes. Some fuzzy datatypes (e.g., possdistr, null, unknown) are defined in several classes and we may need a way to distinguish them in order to process them in different ways. Finally, each datatype contains a subelement representing the actual fuzzy data. These subelements are defined by using the any XML element and each one allows one to insert an element selected from a referred different namespace. Each namespace is defined in another external XML Schema. In particular, the any subelement in classicType refers to the basic XML Schema provided by the W3C [29]. In this way, it is possible to specify any value of the classical crisp datatypes (e.g. strings, integers, timestamps). Subelements in the other three datatypes refer to namespaces defined in different XML schemata proposed by us and explained in the following sections. To better understand how these definitions may be used, let us consider the following example. It represents a classical crisp data containing the name of a customer, where type=T1 means that the name is a crisp data, and info="ABC" means that the related meta-information are contained in the FMB element with ID ABC. John
Up to now, we have defined datatypes able to represent the structure of the fuzzy information. Finally, the main Schema introduces elements defining the structure of new particular parts of a fuzzy XML document. These elements delineate the structure of the FMB and processing information. FMB is a sequence of (in some cases, optional) elements, each one describing a different meta-information (see Section 3.7). Meta-information include label definitions, default margin for approximate values, and similarity relations.
Finally, the root Schema file, FleXchema.xsd, defines the processInfo element. It is a sequence of (optional) qualifier and quantifier definitions. We will describe their definition and usage in Section 5. In particular we will see that they are useful during the fuzzy information processing.
An XML Schema for Managing Fuzzy Documents
11
3.2 Basic Datatypes In the base namespace four basic datatypes, needed in all other namespaces, are defined. The simpleType probType represents the type of a probabilistic data, hence it is defined as a decimal value in the range [0, 1].
The datatype labelRefType represents a reference to the ID of a label definition contained in the FMB. It is essentially a renaming of the IDREF datatype (defined by W3C) given in order to clarify the meaning of some attributes used in other XML schemata.
The datatype ftype is the set of integer values in the range [1, 7]. It is used in the FMB definition in order to keep information about the fuzzy type of a fuzzy object (see Section 3.7).
Finally, datatype any defines a shorthand for the any element defined by the W3C and refering to any element and type already defined in the W3C namespace.
12
B. Oliboni and G. Pozzani
3.3 Fuzzy Data over an Ordered Domain The FuzzyOrdType.xsd file contains the definition of the fuzzy datatypes representing imprecise data over an ordered underlying domain. As happen in most systems allowing null values, the null value can be compared with any other type of data. The same happens in our proposal where values unknown, undefined, and null are defined both on ordered underlying domains and on non-ordered underlying domains. Hence, their definitions are present in this namespace, defining fuzzy ordered datatypes, and in the following ones, defining fuzzy non-ordered datatypes. The duplication of these definitions is needed because in some cases we have to process these special values in a different way on the base of their datatype class (i.e., on the underlying domain).
For the same reason, in FuzzyOrdType we allow one to introduce also any crisp data (on an ordered domain).
The namespace with prefix xsb refers to the XML Schema base.xsb reported in the previous section. We define that fuzzy data over an ordered domain can include: • Linguistic labels. The use of a label lies in an IDREF to its definition. This definition, given in a name and eventually a trapezoidal form, is reported in the FMB part of the XML document (see Section 3.7). The choice to use IDREFs, storing label definitions in the FMB, reduces the data redundancy in XML documents but, on the other hand, requires a more complex data processing for querying XML data.
• Trapezoidal values. Trapezoidal values allow us to represent continuous possibility distributions defined by four decimal values [α , β , γ , δ ] (see Fig. 2). Values between β and γ have possibility degrees equal to one, values less than or equal to α and greater than or equal to δ have possibility degrees equal to zero, and values in ranges [α , β ] and [γ , δ ] have possibility degrees defined respectively by
An XML Schema for Managing Fuzzy Documents
1
1
1
0
13
α
β
γ
δ
(a) Trapezoidal distribution
0
lb
ub
(b) Interval
0
margin
d (c) Triangular distribution
Fig. 2 Continuous possibility distributions on an ordered domain
the lines connecting the two values. We will see that also labels have a trapezoidal definition; however, trapezoidal values allow us to define a trapezoidal distribution without having a label for it. Note that, trapezoidal distributions is a general case of interval values and triangular distributions.
• Intervals. Intervals are special cases of trapezoidal values where α = β and γ = δ . They are then defined by two decimal values, in the Schema named lb and ub, such that all values in the range [lb, ub] (see Fig. 2(b)) have possibility degree equal to one, while all other values have possibility degree equal to zero.
• Approximate values. Approximate values represent triangular possibility distributions. They are defined by a central value d and a margin value around the central value (see Fig. 2(c)). Hence, a triangular distribution is a special case of a trapezoidal one where β = γ and where α and δ are equidistant from the central value. Only value d has possibility degree equal to one. All values outside the range [d − margin, d + margin] have possibility degree equal to zero. In an approximate value the margin can be omitted, in this case we use the default margin stored in the FMB tables (see Section 3.7).
14
B. Oliboni and G. Pozzani
• Possibility distributions. The XML element possdistr allows one to define a discrete possibility distribution represented as a set (with finite unbounded maximum cardinality) of pairs (p, d) meaning that value d has possibility degree equal to p. We do not wrap any pair inside an ad-hoc element because we can recognize correctly pairs by reading elements two-by-two. The d value may by of any datatype on an ordered domain, however the system must check that all values inside the same possibility distribution have the same type. Possibility degrees p has got type probType defined in the base namespace, which possible values, we remark, are in the range [0, 1].
3.4 Fuzzy Data over a Nonordered Domain with Similarity Relations The datatype FuzzyNonOrdSimType defines the possible values of fuzzy objects over a nonordered domain. As we said in the previous section, possible values in this datatype include unknown, undefined, and null values, defined exactly as for the ones on an ordered domain. This datatype allows one to define possibility distributions composed by pairs (p, d) where d is a label which possibility degree is p. The d XML element is defined as a reference to a label which definition is contained in the FMB. Note that, since the underlying domain is nonordered, these labels have not a trapezoidal definition (this constraint must be checked by the system). Moreover, values (represented by labels) are related by a similarity relation. For this reason the XML element possdistr in this Schema has also a required IDREF attribute (simRel) refering to a similarity relation defined in the FMB.
An XML Schema for Managing Fuzzy Documents
15
3.5 Fuzzy Data over a Nonordered Domain without Similarity Relations The datatype FuzzyNonOrdType is very similar to the previous one, FuzzyNonOrdSimilarityType. It represents fuzzy values over a nonordered domain, including unknown, undefined, and null values, and possibility distributions. However, conversely to the previous datatype, in this case values in a possibility distribution are not related by a similarity relation. For this reason, the element possdistr does not include an attribute refering to a similarity relation definition in the FMB. Hence, possibility distributions are defined just on labels without a trapezoidal definition. The use of these labels depends only from the application and its semantics.
3.6 Fuzzy Degrees Another way to incorporate uncertainty in classical databases consists in the use of degrees. The most common use of a degree is the membership degree associated to each instance of a tuple. The membership degree says how much the instance belongs to the tuple. However, other kinds of degree have been proposed in the literature. For example, the tuple degree may represent the fulfillment degree of a condition [21], the importance degree [2], the possibility degree or the uncertainty degree [28]. Any fuzzy data model makes a different choice in the interpretation of degrees. In [13], Galindo et al. classify the degrees with respect to their use instead of with respect to their meaning. A first classification distinguishes between associated and nonassociated degrees. The former applies their value to one or more attributes, the latter (FuzzyNonAssDegree) represents an imprecise information without associating it to another attribute. Moreover, Galindo et al. classify the associated degrees
16
B. Oliboni and G. Pozzani
in degrees associated to one attribute, to a set of attributes, and to a whole tuple (FuzzyInstDegree). Since degrees associated to one attribute is a particular case of degrees associated to a set of attributes where the set is a singleton, we chose to represent only the last one (FuzzyAttrDegree). Thus, our Schema allows the definition of three kinds of degrees: FuzzyAttrDegree, FuzzyInstDegree, and FuzzyNonAssDegree. • FuzzyAttrDegree introduces fuzzy degrees associated to one or more attributes of an entity instance. They are defined as an extension of the probType introduced in the base namespace. Then, they include a possibility value (in the range [0, 1]). Moreover, in order to keep information about the attributes to which a degree is associated, it has an IDREFS attribute (refTo) that refers to the IDs of these elements. These ID references refer to the FMB definition of the elements (see Section 3.7). In order to retrieve the actual values to which the degree is associated we must find the sibling elements of the degree in this tuple that have the same IDREF. Note that this query is supported by XPath [30]. Finally, each associated degree includes an info IDREF attribute refering to the metainformation in the FMB about its definition.
• FuzzyInstDegree represents degrees associated to the whole instance of an entity, thus they do not need to refer to something, and are just reported as child of the instance with which they are associated. Their definition is equal to the one for degrees associated to attributes but without the attribute refTo.
• FuzzyNonAssDegree represents degrees that are associated neither to attributes nor to an instance. They are reported inside instances of an entity, but their
An XML Schema for Managing Fuzzy Documents
17
Table 1 ftype encoding f type 1 2 3 4 5 6 7
fuzzy object classical crisp data fuzzy data over an ordered domain fuzzy data over a nonordered domain with similarity relation fuzzy data over a nonordered domain without similarity relation degree associated to attributes instance degree non associated degree
meaning is not fixed in advance, but can be specified by the user in the string attribute meaning. As the other kinds of degrees, also non-associated degrees include a possibility value F and an IDREF attribute needed to retrieve the metainformation about degrees in FMB. The choice to include the meaning inside degrees, instead of inside their meta-information, allows the user to easier retrieve the meaning of degrees, reducing the data processing complexity.
3.7 The Fuzzy Metaknowledge Base The Fuzzy Metaknowledge Base (FMB) of an XML document contains the metainformation about all fuzzy objects defined and used in the document. The main FMB information are contained in the fcl (fuzzy column list) element that reports basic and common information about all elements that may contain fuzzy data. Information about each fuzzy object are contained in an fc (fuzzy column) element inside fcl. Among these information we note: • len reports the max lenght for possibility distributions in such element (it is valid only for elements which type includes possibility distributions); • ftype reports the type (from 1 to 7) of the fuzzy object (see Table 1); • com is an user comment; • um specifies the unit measure; • sym specifies for FuzzyNonOrdSimilarityType data whether they use a symmetric or an asymmetric similarity relation. Elements com, um, and sym are optional.
18
B. Oliboni and G. Pozzani
Since these are the main elements, they have an ID that identifies the fuzzy object. As we explained in the previous sections, any fuzzy element has an IDREF to the ID associated to its auxiliary information. These IDs are also used in other auxiliary elements to give further type-specific information. For example the user may specify the default margin for approximate values. The margins are stored in elements of type fam (fuzzy approximate much) together with the value much that defines the minimum distance needed to consider two values to be very different.
The FMB contains also the definition of similarity relations used in the XML document. Definitions of all similarity relations are wrapped in the simRelDefs element. Inside it, each similarity relation is contained in one simRel element having an id attribute that identifies univocally the relation inside the document and a name. A similarity relation is defined by a set of triples (sim), each one composed by two IDREFs (fid1 and fid2) refering to the two related labels and a value (degree), in range [0, 1], that specifies the similarity degree between them. Obviously, labels may appear in several similarity relations, and two labels may be related with different degrees in different similarity relations.
An XML Schema for Managing Fuzzy Documents
19
Finally, labelDefs stores label definitions, each one inside a labelinfo element. Each label has an ID, used to refer to this label, a name (required) and a trapezoidal definition made up of four decimal subelements.
20
B. Oliboni and G. Pozzani
Note that the trapezoidal distribution is required only for labels defined over ordered domains. However, this constraint (as any other one) must be checked by the system since it cannot be expressed directly in the XML Schema.
4 Example In this section we give a simple example of an XML document satisfying the proposed XML Schema, by considering information managed by a weather station. The document represents the tomorrow forecast and in particular the temperature and the weather at different times in the day. Each forecast is contained in a record element. The referred time in a record is a classical information but it is represented by using a fuzzy element, marking it to be processed by fuzzy querying. The temperature is a numerical datum represented with a FuzzyOrdType element (because it is based on an ordered domain), while possible weathers are represented by a FuzzyNonOrdSimType element because they are based on a nonordered domain. We associate a degree (accuracy) to the temperature for representing the accuracy of the forecasted temperature. Moreover, at each time several forecasts are calculated by using different meteorological models (e.g., LAM and GCM [22]). Thus, in each record a degree (precision) represents the precision of the forecast calculated by the model at the considered time. In this work, we focused only on the description of new elements enabling for representation of fuzzy information in XML documents. However, each document has also other classical elements and it must have its own schema. The XML Schema for the considered example has to define elements tomorrowForecast (containing all records), record, and so on, eventually by refering to proposed fuzzy elements. The following listing reports the definition of the record element in the Schema associated to the document for the weather station. We see that fuzzy objects have types refering to the proposed ones.
The following document portion reports a record about the 5 o’clock forecast calculated by the LAM model. Temperature is unknown, i.e., every value is possible, (hence, its accuracy is one), while the weather is undefined. The precision element has value zero, due to the lack of information in temperature and weather. Note that, since this degree is not associated to any attribute or instance (i.e., it has type FuzzyNonAssDegree), it contains also its own meaning.
An XML Schema for Managing Fuzzy Documents
21
LAM 05:00:00 1 0 model forecast precision
At the same time, the GCM model may report temperature by a trapezoidal distribution [24, 25, 26, 27] with an accuracy of 0, 9, while possible weather is represented by a possibility distribution based on a similarity relation SR1. In the example, with a percentage of 80%, tomorrow the weather will be sunny (referred by the label “S”), while with a percentage of 30% it will be cloudy (referred by the label “C”). We remember that label and similarity relation definitions are contained in the FMB. GCM 05:00:00 24 25 26 27 0.9 1 0.86 model forecast precision
In other cases, temperature may be represented also by an approximate value 28 ± 0, 5.
22
B. Oliboni and G. Pozzani
28 0.5
Otherwise, temperature may be represented by a label with a trapezoidal definition (contained in the FMB).
The FMB portion of the XML document reports auxiliary information about fuzzy elements. As said in Section 3.7, the fc element contains main basic information about them. For example the fc element for the temperature may be the following one: temp 2 the expected temperature Celsius degrees
where Te1 is the unique ID identifying the temp fuzzy object. Hence, it is used inside the document to link data with auxiliary information and viceversa. In the FMB, we may retrieve also definitions of the labels with ID S (representing sunny weather), C (representing cloudy weather), and k4 (representing a possible value for the temperature). sunny cloudy temperature4 27.5 29 30 30.5
An XML Schema for Managing Fuzzy Documents
23
Labels representing sunny and cloudy weathers are defined over a nonordered domain, thus they are pure linguistic labels and they have not a trapezoidal definition. The label used to represent a temperature is defined also by a trapezoidal distribution. Labels S and C are related also by a similarity relation defined inside a simRel element. This similarity relation is identified by the ID SR1 and it has also a name. Inside each sim element we may retrieve a pair of objects and their similarity degree. In the reported example sunny and cloudy is similar with a degree of 0, 3. S C 0.3
Finally, the FMB contains information about default margin for approximate values representing temperatures. Moreover, the threshold necessary to consider two temperatures very different is defined. In the example these two parameters have value 1 and 5, respectively. 1 5
5 Fuzzy Information for Processing Documents As seen in Section 8 some approaches to fuzzy databases (including the ones in the XML context) extend query languages by introducing in them fuzzy features. A possible way to incorporate fuzziness in queries is defining quantifiers and qualifiers. In this section we present our proposal for representing them in an XML document. Moreover, we continue the example from the previous section presenting definitions of some quantifiers and qualifiers about weather information.
5.1 Representing Fuzzy Quantifiers and Qualifiers A qualifier is a fuzzy constant in the context of a particular attribute or degree. It is similar to a linguistic label but it is used in queries in order to set linguistic threshold and to make them more understandable. Moreover, qualifiers allow one to tune up queries simply modifying their definitions. Qualifier definitions are wrapped all together in the qualifiers element. Inside it, a single qualifier definition is reported in a qualDef element. Each qualifier
24
B. Oliboni and G. Pozzani
has: an id attribute that identifies it, a name that represents the qualifier in queries, and a value in the range [0, 1].
Fuzzy quantifiers [17, 18, 34, 39] are linguistic labels that allow us to represent uncertain quantities. They may be used in queries in order to provide the approximate number of elements fulfilling a given condition. Quantifiers may be absolute or relative. The first ones express quantities with respect to the total number of objects in a set (e.g., “approximately between 25 and 35”, “close to 0”). Hence, absolute quantifiers range in R. The second ones represent the proportion between the total number of objects in a set and the number of objects in this set that complies with the stated condition. In other words, relative quantifiers measure the fulfillment quantity of a certain condition (e.g., “the majority”, “about half of”). For this reason relative quantifiers are valued in the range [0, 1]. Absolute and relative quantifiers may be represented in the same form by using a trapezoidal representation [α , β , γ , δ ] and keeping information about their type. Another classification of quantifiers divides them in those based on product and those based on sum. Moreover, they may have zero, one, or two arguments. A general definition of fuzzy quantifiers with respect to their arguments and operations is the following one: • quantifiers without arguments are defined simply by their trapezoidal distribution [α , β , γ , δ ]; • quantifiers with one argument x: – based on product: [x · α , x · β , x · γ , x · δ ]; – based on sum: [x + α , x + β , x + γ , x + δ ]; • quantifiers with two arguments x and y: – based on product: [x · α , x · β , y · γ , y · δ ]; – based on sum: [x + α , x + β , y + γ , y + δ ].
An XML Schema for Managing Fuzzy Documents
25
Note that, in some cases, a relative quantifier may not be inside the range [0, 1]. This problem can be addressed by considering only the intersection of trapezoidal distribution associated to the quantifier with the interval [0, 1]. In our Schema proposal, all these information about a quantifier definition are contained in a quantDef element. Each quantifier is internally identified by an unique id, while it is used by refering its name. Moreover, a quantifier definition has the following subelements: • args ∈ {0, 1, 2} specifies the number of arguments; • AR specifies whether the quantifier is absolute (A) or relative (R); • SP specifies whether the quantifier is based on sum (S) or product (P). When the quantifier has not arguments a ‘-’ is provided. Finally, all kinds of quantifiers have a trapezoidal definition provided by four elements alpha, beta, gamma, delta.
26
B. Oliboni and G. Pozzani
Although quantifiers and qualifiers are information used during the processing phase of XML documents and they are not really data, it may be useful to represent them inside documents. In fact, the processing phase is a very important issue about fuzzy databases and information. Consider cases in which XML documents are exchanged between several users. In these cases, it may be interesting also to exchange processing information in order to share not only data but also semantics and processing operators. In such a way, different users can query a document obtaining the same results. However, an user may be free to use his own qualifier and quantifier definitions instead of the document ones.
6 An Example of Quantifiers and Qualifiers Continuing the example about information managed by a weather station presented in Section 4, we may define some quantifiers and qualifiers. Their definitions are reported in the last part of an XML document. Forecasted temperature is a fuzzy data and it may be processed by fuzzy queries. Hence, an absolute quantifier Hot, without arguments and defined by the distribution [30, 35, 72, 72] (expressed in Celsius degrees), may be used in queries to classify temperatures overlapping it as “hot”. Hot 0 A - 30 35 72 72
On the other hand, we may define a qualifier High, with value 0, 8, that may be used as threshold in queries about temperature. It may be used to constraint query results to comply with the query condition with a fulfillment degree greater than 80%. High 0.8
Note that, in fuzzy queries, quantifiers and qualifiers may be used together in order to constraint results. Considering, for example, queries about temperature cited above, we may retrieve records which temperature is Hot with a High fulfillment
An XML Schema for Managing Fuzzy Documents
27
degree (i.e., temperature overlaps for at least 80% the trapezoidal distribution defining the quantifier Hot).
7 Incorporating Fuzziness in Classical XML Documents In this section we show how classical XML documents, and their schemata, can be modified to integrate our fuzzy XML Schema. This modification allows to represent also uncertain data, in addition to already represented classical data. The first step of this integration consists of to modify the Schema of the original document by using fuzzy datatypes defined by us. In particular the Schema of the original document must declare new namespaces importing our proposed Schema. The namespace declarations can be done with some definitions similar to the following ones: xmlns:fuzzy="first-2" xmlns:degree="degrees" ...
Then, the designer must decide which data must be represented with a fuzzy data type and over which kind of domain, ordered or nonordered, the interested data are. Once the domains have been decided, each original element must be redefined changing its type to one of the fuzzy proposed types. Data over an ordered domain must be declared with type FuzzyOrdType, data over an nonordered domain and with an associated similarity relation must be declared with type FuzzyNonOrdSimType, and, finally, data over an nonordered domain and without an associated similarity relation must be declared with type FuzzyNonOrdType. For instance, let us consider an XML element age representing the age of a person. The original definition of this element may be something like:
On the other hand, one possible its fuzzy definition may be:
After this change the age can be represented by using any kind of element defined for datatype FuzzyOrdType, e.g., interval, trapezoidal distribution, approximate value, and so on (see Section 3.3). Similar considerations and changes must be done also for all other elements that the designer want to be able to represent fuzzy information. Changes to different elements differ only on the fuzzy datatype the designer needs to use to represent them: classicType, FuzzyOrdType, FuzzyNonOrdSimType, or FuzzyNonOrdType. The second step of the translation of a classical XML document to one its fuzzy version consists of the modification of the document itself. Of course, the usage of elements which definition has been changed must be replaced accordingly to their new definition.
28
B. Oliboni and G. Pozzani
Continuing the example here introduced, the usage of the age element changes from: 32
to something like: 31 34
Note that the transition, from a classical XML document to a fuzzy one based on our Schema, allows one not only to change the definition of the elements to a fuzzy compliant version but also to enrich the XML document by using degrees, quantifiers, and qualifiers.
8 Related Work and Discussion In this section we briefly describe other proposals presented in the literature and related to the representation and querying of fuzzy information in XML documents. Fuzzy features may be incorporated into databases and XML data by using two main ways. The former allows the representation of fuzzy information directly in data, e.g., by extending the data model with fuzzy datatypes. The latter obtains fuzzy information by processing crisp data by using query languages extended with uncertain operators. However, considering general and complete systems for fuzzy data management, these two ideas are orthogonal and can be combined obtaining three approaches to fuzzy databases: 1. crisp querying of fuzzy information; 2. fuzzy querying of crisp information; 3. fuzzy querying of fuzzy information. The proposal by Galindo et al. is based on the last approach. They define new datatypes in order to represent fuzzy information, and, at the same time, they extend the SQL [12] query language with fuzzy operators and capabilities. In the following sections we introduce related work on the representation of XML fuzzy data (Section 8.1) and related work about fuzzy querying of XML documents (Section 8.2).
8.1 Representing XML Fuzzy Data In [14], Gaurav et al. incorporated fuzzy data in XML documents extending the XML schemata associated to these documents. They observed that fuzziness may be incorporated in values and structures of XML elements. Hence, they extended
An XML Schema for Managing Fuzzy Documents
29
the definition of values and elements introducing special elements representing possibility distributions and similarity relations. Possibility distributions may be introduced through the two elements and . The first one allows the specification of the possibility degree associated to a classical value, while the second one allows the specification of the possibility with which a sub-element belongs to its parent element. The Schema proposed by Gaurav et al. permits to introduce similarity relations by using the new element that defines pairs composing the similarity relation. The attribute may be used to refer to an already defined similarity relation. Differently from our proposal, they do not allow the use of linguistic labels and generic degrees, thus the example described in Section 4 cannot be fully implemented by using the approach proposed in [14]. The impossibility to define linguistic labels does not allow to Gaurav et al. to define trapezoidal distributions (note that trapezoidal distributions can represent also triangular distributions and intervals) with a unique name that can be referred in several point of a document. Thus, when a trapezoidal distribution is used more times inside a document, Gaurav at al. proposal must specify more times the distribution itself. Conversely, our solution permits to associate a name (i.e., a linguistic label) to a trapezoidal distribution in order to refer it by using that name instead of by specifying distribution values. This approach allows us to reuse distribution definitions, reducing documents size. Gaurav et al. do not allow to represent fuzzy degrees too. Thus, they cannot associate fuzzy information to classical data. For instance, they cannot represent fuzzy information similar to the accuracy of a forecasted temperature or the precision of a whole forecast, as we reported in the example in Section 4. We note that all fuzzy constructs proposed by Gaurav et al. have a corresponding rappresentation also in our Schema. A similarity relation, defined by Gaurav et al. through the element, is defined in our proposal in the FMB simRel element and it is referred by specifying its IDREF inside the element possdistr of datatype FuzzyNonOrdSimType. Elements and defined by Gaurav et al. represent possibility distributions and tuple degrees, respectively. Possibility distributions can be represented, by using our proposal, defining a possibility distribution possdistr as specified in the FuzzyOrdType, FuzzyNonOrdSimType, and FuzzyNonOrdType datatypes. Tuple degrees are represented in our proposal through degrees associated to a whole tuple, by using FuzzyInstDegree. In [20, 19], Ma et al. defined a model for representing fuzzy information modifying the DTD associated to an XML document. In particular they modified the DTD wrapping the original element definitions inside the new element which associates to the current element its possibility degree. The new element , composed by one or more elements, allows one to define a possibility distribution in an XML document. Moreover, Ma et al. defined two types of distribution: disjunctive and conjunctive. The former represents a set of possible values where actually only one of them is true at any moment, the latter represents a set of fuzzy values everyone true with different degrees at any moment.
30
B. Oliboni and G. Pozzani
In [35], Ma et al. extend their previous work in order to incorporate fuzziness in XML documents by using XML Schema. Hence, they define and elements also using XML Schema and then they explain how classical schemata can be modified to incorporate their new fuzzy objects. However, notice that Ma et al. introduce neither similarity relations, nor linguistic labels, nor other fuzzy datatypes. About the impossibility for Ma et al. to use linguistic labels, remarks similar to those reported about [14] are valid. Moreover, since Ma et al.’s proposal is not able to represent similarity relations, they cannot represent data similar to the weather situation we reported in Section 4. We note that similarity of two values cannot be inferred if these values are not numerical, thus our proposal can actually represent more information than [35]. Constructs introduced by Ma et al., and , correspond and can be represented by using possibility distributions, on ordered or nonordered domains, defined in our proposal. An approach similar to those reported in [20, 19], based on an extension of DTD, is used in [26]. Turowski et al. introduced new appropriate DTDs defining the elements that allow one to represent discrete fuzzy sets (that can represent possibility distributions), continuous fuzzy sets, and linguistic variables, that can be associated to fuzzy sets. Then they do not allow the use of similarity relations or degrees. On the other hand, using fuzzy sets and variables, they also define the DTDs needed to implement a fuzzy inference system able to infer the truths implied by some given facts, by using user-defined rules. Since, Turowski et al.’s proposal cannot represent similarity relations, it suffers of lacks similar to those reported for previous discussed approaches. On the other hand, we note that they can also represent continuous generic fuzzy sets that cannot be represented by our proposal. Special cases of continuous fuzzy sets are trapezoidal and triangular distributions, and intervals. Our proposal can represent these distributions, while it cannot explicitely represent distributions with a generic trend. However, these distributions can be interpoled from discrete ones, as Turowski et al. do. In our proposal the distinction from discrete and continuous distributions are implicitely defined by the semantics of data, while in [26] it is explicitely specified. In our proposal we allow the user to represent all the aspects related to fuzzy information. In particular, we define all fuzzy datatypes (e.g. possibility distributions, approximate values, intervals), fuzzy degrees (with several meanings) and labels already proposed separately in several proposals in the literature. Moreover, we define XML schemata instead of DTDs to overcome limitations due to the use of DTDs.
8.2 Fuzzy Querying of XML Documents Several proposals in literature deal with fuzzy querying of XML documents. In [3, 5, 9, 10], Campi et al. propose an extension for the XPath query language [30] by adding new constructions in order to introduce fuzzy querying capabilities. XPath language is based on path expressions able to state the structure and the value of elements required by the user. With respect to path expressions,
An XML Schema for Managing Fuzzy Documents
31
Campi et al. take into account two kinds of fuzziness: fuzziness on structure and fuzziness on values. With respect to the first one, users can submit queries without to specify in a precise way the structure of the XML document and of the required elements, while, with respect to values, queries do not look only for exact value matching but also for similar values. These features are introduced by defining new fuzzy path predicates (e.g., NEAR, ABOUT, and BESIDES). Fuzzy predicates allow one to search elements, attributes, and values similar to those really required. For example, the expression /proceedings/article[@year NEAR 2009] retrieves article elements, child of an element proceedings, which attribute year has a value close to 2009. On the other hand, the user may retrieve article elements that are close descendant of proceedings by using the expression /proceedings{/NEAR}/article. Fuzzy predicates can be partially satisfied by XML elements with several degrees. Hence, conversely to classical XPath queries, fuzzy queries return a ranked set of nodes. Ranks associated to elements represent the similarity of returned elements with the ones required by the query. Moreover, Campi et al. define a method allowing one to choose how the ranks for a query may be calculated. Users may associate to each part of a query a variable which value represents the degree of satisfaction of the conditions. Users may define how the ranks must be calculated combining values bound to variables. Finally, Campi et al. proposal allows users to use fuzzy quantifiers (e.g., tall) and qualifiers (e.g., very) inside predicates (e.g., height = very tall). A very similar approach to fuzzy querying is proposed by Goncalves and Tineo [15]. Using a different approach, Amer-Yahia et al. [1] do not extend XPath expressions with new predicates and operators, but they introduce fuzziness by query relaxations. They define four operations (e.g., axis generalization and leaf deletion) on the structure of queries that, given a query, produce an its relaxed version (i.e., a query containing the original one). Relaxations broaden the scope of the path expressions provided in the original query. A ranking strategy associates a penalty to each modification applied to a query through a relaxation operation. Penalties are then used to calculate how much retrieved elements satisfy the original query. Note that, in all proposals about fuzzy querying in the literature, query results are sets of ranked elements where ranks represent the fulfillment degrees of retrieved elements with respect to the query conditions.
9 Conclusion In this work, we proposed a general XML Schema definition for representing fuzzy information in XML documents. In our proposal, we represent different aspects of fuzzy information by adapting a data type classification already proposed for the relational database context, and by integrating different kinds of fuzzy information to compose a complete definition. For future work we plan to start from documents valid with respect to the XML Schema proposed in this paper and to study topics related to querying and
32
B. Oliboni and G. Pozzani
retrieval of fuzzy information. As we explained in Section 8, fuzzy information can be queried by using fuzzy or crisp query languages. We note that the starting point of our future research will be different from the one assumed by previous works, that have been presented in literature (see Section 8.2). Conversely from other approaches, our work will be based on fuzzy information rather than crisp information. This difference will lead, in our opinion, to a less modification of existing query languages. As a matter of fact, since fuzzy capabilities are already incorporated in the document schema, query languages can exploit the structure of documents without the need to use ad-hoc sintax constructs and features. In this case, we do not need to enrich the query language but the query engine (i.e., the part of the system liables to interpret and execute queries). On the other hand, some features (e.g., qualifier and quantifier usage) will require some little modifications to the query language. Thus, first of all, future work in this direction must understand which features must be incorporated in a query language (e.g., XPath) for fuzzy XML documents and which others need only a particular interpretation from the query engine. After that, an extended query language with desired fuzzy capabilities will be designed. Another possible research direction is about how fuzzy XML documents may be used for XML Schema versioning. Considering XML documents that are instances of different versions of a given XML Schema, fuzzy XML may be used to represent the uncertainity associated to the information contained in the documents. Moreover, considering different versions of an XML Schema, our proposal may be used to represent the uncertainity associated to elements and attributes used in the versions. Finally, fuzzy XML may represent the uncertainity associated to operations and to sequences of operations that can be used to obtain a new version of an XML Schema from other ones.
References 1. Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: flexible structure and fulltext querying for XML. In: ACM (ed.) Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data 2004, Paris, France, June 13–18, pp. 83–94. ACM Press, New York (2004) pub-ACM:adr 2. Bosc, D., Pivert, P.: Flexible queries in relational databases – the example of the division operator. TCS: Theoretical Computer Science 171 (1997) 3. Braga, D., Campi, A., Damiani, E., Pasi, G., Lanzi, P.: FXPath: Flexible querying of XML documents. In: Proceedings of EuroFuse 2002 (2002) 4. Buckles, B.P., Petry, F.E.: A fuzzy representation of data for relational databases. Fuzzy Sets and Systems 7(3), 213–226 (1982) 5. Campi, A., Guinea, S., Spoletini, P.: A fuzzy extension for the XPath query language. In: Larsen, H.L., Pasi, G., Ortiz-Arroyo, D., Andreasen, T., Christiansen, H. (eds.) FQAS 2006. LNCS (LNAI), vol. 4027, pp. 210–221. Springer, Heidelberg (2006) 6. Codd, E.F.: A relational model of data for large shared data banks. CACM: Communications of the ACM 13 (1970) 7. Codd, E.F.: Extending the database relational model to capture more meaning. ACM Transactions on Database Systems 4(4), 397–434 (1979) 8. Codd, E.F.: The relational model for database management. Addison-Wesley Longman Publishing Co. Inc., Boston (1990)
An XML Schema for Managing Fuzzy Documents
33
9. Damiani, E., Marrara, S., Pasi, G.: FuzzyXPath: Using fuzzy logic an IR features to approximately query XML documents. In: Melin, P., Castillo, O., Aguilar, L.T., Kacprzyk, J., Pedrycz, W. (eds.) IFSA 2007. LNCS (LNAI), vol. 4529, pp. 199–208. Springer, Heidelberg (2007) 10. Damiani, E., Marrara, S., Pasi, G.: A flexible extension of xpath to improve XML querying. In: Myaeng, S.H., Oard, D.W., Sebastiani, F., Chua, T.S., Leong, M.K. (eds.) Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20-24, pp. 849–850. ACM, New York (2008) 11. Dubois, D., Prade, H.: Possibility Theory: An Approach to Computerized Processing of Uncertainty. Plenum Press, New York (1988) 12. Elmasri, R.A., Navathe, S.B.: Fundamentals of Database Systems. Addison-Wesley Longman Publishing Co. Inc., Boston (1999) 13. Galindo, J., Urrutia, A., Piattini, M.: Fuzzy Databases: Modeling, Design, and Implementation. IGI Publishing (2006) 14. Gaurav, A., Alhajj, R.: Incorporating fuzziness in XML and mapping fuzzy relational data into fuzzy XML. In: Haddad, H. (ed.) Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 456–460. ACM, New York (2006) 15. Goncalves, M., Tineo, L.: A new step towards flexible XQuery. Avances en sistemas e Inform´atica 4, 27–34 (2007) 16. ISO: ISO 8879:1986: Information processing — Text and office systems — Standard Generalized Markup Language, SGML (1986), http://www.iso.ch/cate/d16387.html 17. Liu, Y., Kerre, E.E.: An overview of fuzzy quantifiers. (I). interpretations. Fuzzy Sets Syst. 95(1), 1–21 (1998) 18. Liu, Y., Kerre, E.E.: An overview of fuzzy quantifiers (II). reasoning and applications. Fuzzy Sets Syst. 95(2), 135–146 (1998) 19. Ma, Z.: Fuzzy Database Modeling with XML (The Kluwer International Series on Advances in Database Systems). Springer-Verlag New York, Inc. (2005) 20. Ma, Z.M., Yan, L.: Fuzzy XML data modeling with the UML and relational data models. DKE 63(3), 972–996 (2007) 21. Medina, J.M., Pons, O., Vila, M.A.: GEFRED: A generalized model of fuzzy relational databases. Information Sciences 76(1-2), 87–109 (1994) 22. Nebeker, F.: Calculating the Weather: Meteorology in the 20th Century. International Geophysics Series, vol. 60. Academic Press, London (1995) 23. Paoli, J., Bray, T., Sperberg-McQueen, C.M., Yergeau, F., Maler, E.: Extensible markup language (XML) 1.0 (fourth edition). W3C recommendation, W3C (2006), http://www.w3.org/TR/2006/REC-xml-20060816 24. Prade, H.: Lipski’s approach to incomplete information databases restated and generalized in the setting of Zadeh’s possibility theory. Information Systems 9(1), 27–42 (1984) 25. Prade, H., Testemale, C.: Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences 34, 115– 143 (1984) 26. Turowski, K., Weng, U.: Representing and processing fuzzy information - an XML-based approach. Knowl.-Based Syst. 15(1-2), 67–75 (2002) 27. Umano, M.: FREEDOM-O: A fuzzy database system. In: Gupta, M.M., Sanchez, E. (eds.) Fuzzy Information and Decision Processes, pp. 339–349. North-Holland, Amsterdam (1982) 28. Umano, M., Fukami, S.: Fuzzy relational algebra for possibility-distribution-fuzzyrelational model of fuzzy data. J. Intell. Inf. Syst. 3(1), 7–27 (1994)
34
B. Oliboni and G. Pozzani
29. W3C: World-Wide Web Consortium (1994), http://www.w3.org/ 30. XML Path Language (XPath) Version 1.0, W3C Recommendation (1999), http://www.w3c.org/TR/xpath 31. XQuery 1.0: An XML Query Language, W3C Recommendation (2007), http://www.w3.org/TR/xquery/ 32. XSD: XML Schema Definition (2004), http://www.w3.org/XML/Schema 33. XSL Transformations (XSLT), W3C Recommendation (1999), http://www.w3.org/TR/xslt 34. Yager, R.R.: Quantified propositions in a linguistic logic. International Journal of ManMachine Studies 19(2), 195–227 (1983) 35. Yan, L., Ma, Z.M., Liu, J.: Fuzzy data modeling based on XML schema. In: Proceedings of the 2009 ACM Symposium on Applied Computing (SAC), Honolulu, Hawaii, USA, March 9-12, pp. 1563–1567. ACM, New York (2009) 36. Zadeh, L.A.: Fuzzy sets. Information and Control 8(3), 338–353 (1965) 37. Zadeh, L.A.: Similarity relations and fuzzy orderings. Information Sciences 3, 177–200 (1971) 38. Zadeh, L.A.: Fuzzy sets as a basis for possibility. Fuzzy Sets and Systems 1, 3–28 (1978) 39. Zadeh, L.A.: A computational approach to fuzzy quantifiers in natural language. Computers and Mathematics with Applications 9(1), 149–184 (1983) 40. Zemankova, M., Kandel, A.: Fuzzy Relational Databases — A Key to Expert Systems. Verlag TUV Rheinland (1984) 41. Zemankova, M., Kandel, A.: Implementing imprecision in information systems. Information Sciences 37(1-3), 107–141 (1985)
Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema Li Yan, Jian Liu, and Z.M. Ma
Abstract. XML has been the de-facto standard of information representation and exchange over the web. In addition, imprecise and uncertain data are inherent in the real world. Although fuzzy information has been extensively investigated in the context of relational model, the classical relational database model and its fuzzy extension to date do not satisfy the need of modeling complex objects with imprecision and uncertainty, especially when the fuzzy relational databases are created by mapping the fuzzy conceptual data models and the fuzzy XML data model. Based on possibility distributions, this chapter concentrates on fuzzy information modeling in the fuzzy XML model and the fuzzy nested relational database model. In particular, the formal approach to mapping a fuzzy DTD model to a fuzzy nested relational database (FNRDB) schema is developed.
1 Introduction With the prompt development of the Internet, the requirement of managing information based on the Web has attracted much attention both from academia and industry. XML is widely regarded as the next step in the evolution of the World Wide Web, and has been the de-facto standard. It aims at enhancing content on the World Wide Web. XML and related standards are flexible that allow the easy development of applications which exchange data over the web such as e-commerce (EC) and supply chain management (SCM). However, this flexibility makes it challenging to develop an XML management system. To Li Yan School of Software, Northeastern University, Shenyang, 110819, China Jian Liu School of Information Science & Engineering, Northeastern University, Shenyang, 110819, China Z.M. Ma School of Information Science & Engineering, Northeastern University, Shenyang, 110819, China e-mail:
[email protected]
Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 35–54. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
36
L. Yan, J. Liu, and Z.M. Ma
manage XML data, it is necessary to integrate XML and databases [3]. Various databases, including relational, object-oriented, and object-relational databases, have been used for mapping to and from the XML document. At the same time, some data are inherently imprecise and uncertain since their values are subjective in the real world applications. For example, consider values representing the satisfaction degree for a film, different person may have different satisfaction degree. Information fuzziness has also been investigated in the context of EC and SCM [25, 30, 31]. It is shown that fuzzy set theory is very useful in Web-based business intelligence. Fuzzy information has been extensively investigated in the context of relational model [6, 24, 26, 28]. However, the classical relational database model and its fuzzy extension do not satisfy the need of modeling complex objects with imprecision and uncertainty. The requirements of modeling complex objects and information imprecision and uncertainty can be found in many application domains (e.g., multimedia applications) and have challenged the current database technology [2, 7]. In order to model uncertain data and complex-valued attributes as well as complex relationships among objects, current efforts have concentrated on the conceptual data models [15, 16, 21, 33], the fuzzy nested relational data model (also known as an NF2 data model) [34], and the fuzzy object-oriented databases [4, 10, 12, 13, 20]. Also there are efforts to conceptually design the fuzzy databases using the fuzzy conceptual data models [15, 16, 21, 33]. More recently, the fuzzy object-relational databases are proposed [9] which combine both characters of fuzzy relational databases and fuzzy object-oriented databases. Ones can refer to [17, 18] for recent surveys of these fuzzy data models. Despite fuzzy values have been employed to model and handle imprecise information in databases since Zadeh introduced the theory of fuzzy sets [35], relative little work has been carried out in extending XML towards the representation of imprecise and uncertain concepts. Abiteboul et al. [1] provide a model for XML documents and DTDs and a representation system for XML with incomplete information. The representations of probabilistic data in XML are proposed in other previous research papers, such as [14, 22, 27, 29]. Without presenting XML representation model, the data fuzziness in XML document is discussed directly according to the fuzzy relational databases in [11], and the simple mappings from the fuzzy relational databases to fuzzy XML document are provided also. Oliboni and Pozzani [23] propose a XML Schema definition for representing fuzzy information. They adopt the data type classification for the XML data context. A fuzzy XML data model which is based XML DTD is proposed in [19], in which the mapping of the fuzzy XML DTD (Document Type Definition) from the fuzzy UML data model and to the fuzzy relational database schema are discussed, respectively. In [32], a fuzzy XML data model based on XML Schema is developed. The classical relational database model and its fuzzy extension do not satisfy the need of modeling complex objects with imprecision and uncertainty. It is also true when the fuzzy relational databases are created by mapping the fuzzy conceptual data models and the fuzzy XML data model. Being the extension of relational data model, the NF2 database model is able to handle complex-valued attributes and may be better
Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema
37
suited to some complex applications such as office automation systems, information retrieval systems and expert database systems [34]. In [8], the fuzzy NF2 database model is proposed for managing uncertainties in images. This chapter, based on possibility distributions, concentrates on fuzzy information modeling in the fuzzy XML model and the fuzzy nested relational database model. In particular, the formal approach to mapping a fuzzy DTD model to a fuzzy nested relational database (FNRDB) schema is developed. The remainder of this chapter is organized as follows. Section 2 discusses fuzzy sets and possibility distributions. The fuzzy XML data model and fuzzy nested relational databases are introduced in Section 3. In Section 4, the approaches to mapping the fuzzy XML model to the fuzzy nested relational schema are developed. Section 5 concludes this chapter.
2 Fuzzy Sets and Possibility Distributions Different models have been proposed to handle different categories of data quality (or lack thereof). Five basic kinds of imperfection have been identified in [5], which are inconsistency, imprecision, vagueness, uncertainty, and ambiguity. Instead of giving the definitions of the imperfect information, we herewith explain their meanings. Inconsistency is a kind of semantic conflict, meaning the same aspect of the real world is irreconcilably represented more than once in a database or in several different databases. For example, the age of George is stored as 34 and 37 simultaneously. Information inconsistency usually comes from information integration. Intuitively, the imprecision and vagueness are relevant to the content of an attribute value, which means that a choice must be made from a given range (interval or set) of values without knowing which one to choose. In general, vague information is represented by linguistic values. Assume that, for example, we do not know exactly the age of two persons named Michael and John, and only know that the age of Michael may be 18, 19, 20, or 21, and the age of John is old. Then the information of Michael’s age is an imprecise one, denoted by a set of values {18, 19, 20, 21}. The information of John’s age is a vague one, denoted by a linguistic value, "old". The uncertainty is related to the degree of truth of its attribute value. With uncertainty, we can apportion some, but not all, of our belief to a given value or a group of values. For example, the possibility that the age of Chris is 35 right now should be 98%. The random uncertainty, described using probability theory, is not considered in this chapter. The ambiguity means that some elements of the model lack complete semantics, leading to several possible interpretations. Generally, several different kinds of imperfection can co-exist with respect to the same piece of information. For example, the age of Michael is a set of values {18, 19, 20, 21} and their possibilities are 70%, 95%, 98%, and 85%, respectively. Imprecision, uncertainty, and vagueness are three major types of imperfect information and can be modeled with fuzzy sets [35] and possibility theory [36].
38
L. Yan, J. Liu, and Z.M. Ma
Many of the existing approaches dealing with imprecision and uncertainty are based on the theory of fuzzy sets. The concept of fuzzy sets was originally introduced by Zadeh [35]. Let U be a universe of discourse and F be a fuzzy set in U. A membership function μF: U → [0, 1] is defined for F, where μF (u), for each u ∈ U, denotes the membership degree of u in the fuzzy set F. Thus, the fuzzy set F is described as follows: F = {μF (u1)/u1, μF (u2)/u2, ..., μF (un)/un} The fuzzy set F is consisted of some elements just like the conventional set. But, not being the same as the conventional set, each element in F may or may not belong to F, having a membership degree to F which needs to be explicitly indicated. So in F, an element (say ui) is associated with its membership degree (say μF (ui)), and they occur together in form of μF (ui)/ui. When the membership degrees that all elements in F belong to F are exactly 1, the fuzzy set F reduces to a conventional one. When the membership degree μF (u) above is explained to be a measure of the possibility that a variable X has the value u, where X takes values in U, a fuzzy value is described by a possibility distribution πX (Zadeh, 1978). πX = {πX (u1)/u1, πX (u2)/u2, ..., πX (un)/un} Here, πX (ui), ui ∈ U denotes the possibility that ui is true. Let πX be the possibility distribution representation for the fuzzy value of a variable X. It means that the value of X is fuzzy, and X may take one from some possible values u1, u2, ..., and un and each one (say ui) taken possibly is associated with its possibility degree (say πX (ui)). Definition: A fuzzy set F of the universe of discourse U is convex if and only if for all u1, u2 in U, μF (λu1 + (1 − λ) u2) ≥ min (μF (u1), μF (u2)) where λ ∈ [0, 1]. Definition: A fuzzy set F of the universe of discourse U is called a normal fuzzy set if ∃ u ∈ U, μF (u) = 1. Definition: A fuzzy set is a fuzzy subset in the universe of discourse U that is both convex and normal.
3 Representation of Fuzzy Data in XML and Nested Relational Databases This section focuses on fuzzy data modeling in XML data model and nested relational model. First we introduce some notions and notations of the fuzzy XML model proposed in [19] and then we present an extension of the extended possibility-based fuzzy nested relational databases.
Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema
39
3.1 Fuzzy XML Model There are two kinds of fuzziness in XML documents: the first is the fuzziness in elements (we use membership degrees associated with such elements); the second is the fuzziness in attribute values of elements (we use possibility distribution to represent such values). Note that, for the latter, there exist two types of possibility distribution (i.e., disjunctive and conjunctive possibility distributions) and they may occur in child elements with or without further child elements in the ancestordescendant chain. Fig. 1 gives a fragment of an XML document with fuzzy information, which appeared in [19].
1. 2. 3. 4. 5. 6. 7. 8. Frank Yager 9. Associate Professor 10. B1024 11. Advances in Database Systems 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.
Frank Yager Professor B1024 Advances in Database Systems Tom Smith 23 25 27 29 30 31 Fig. 1 A Fragment of an XML Document with Fuzzy Data
40
L. Yan, J. Liu, and Z.M. Ma
31. 33 32. 35 33. 37 34. 35. 36. Male 37. 38. 39.
[email protected] 40.
[email protected] 41.
[email protected] 42.
[email protected] 43.
[email protected] 44. 45. 46. 47. 48. 49. 50. 51. 52. Fig. 1 (continued)
The example above talks about the universities in an area of a given city, say, Detroit, Michigan, in the USA. The Wayne State University is located in downtown Detroit, and the possibility that it is included in the universities in Detroit is 1. Oakland University, however, is located in a nearby county of Michigan, named Oakland. Whether Oakland University is included in the universities in Detroit depends on how to define the area of Detroit, the Greater Detroit Area or only the city of Detroit. Assume that it is unknown and the possibility that Oakland University is included in the universities in Detroit is assigned 0.8. Also suppose that an employee, Frank Yager, at Oakland University is under the stage of promotion. The possibility that he is an associate professor, teaches a course called Advances in Database Systems, and occupies the office called B1024 is 0.8. The possibility that he is a professor, teaches a course called Advances in Database Systems, and occupies the office called B1024 is 0.6. A student, Tom Smith, has fuzzy values in the attributes age and email, which are represented by a disjunctive possibility distribution and conjunctive possibility distribution, respectively. The basic data structure of fuzzy XML data model is the data tree. In the following, we will introduce some important concepts used in our proposed fuzzy XML model.
Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema
41
Definition: Let V be a finite set (of vertices), E ∈ V × V be a set (of edges) and l : E → Γ be a mapping from edges to a set Γ of strings called labels. The triple G = (V, E, l ) is an edge labeled directed graph. Based on the data tree, we introduce the definition of fuzzy XML data tree. Definition: Fuzzy XML data tree F is a 6-tuple, F = (V, ψ, l , τ, κ, δ) where z V = {V1,…,Vn} is a finite set of vertices. z ψ ⊂ {(Vi, Vj) | Vi, Vj ∈ V}, (V, ψ) is a directed tree. z l : V → (L ∪ {null}), here L is a set of labels. For each object v ∈ V and each label∇ ∈L, l (v, ∇) specifies the set of objects that may be children of v with label∇. z τ→T, T is a set of types. z κ is mapping which constrains the number of children with a given label. κ associates with each object v ∈V and each label∇ ∈ L, an integer-valued interval function. κ (v, ∇) = [min, max], where min ≥ 0, max ≥ min. We use κ to represent the lower and upper bounds. z δ is a mapping from the set of objects v ∈V to local possibility functions. It defines the possibility of a set of children of an object existing given that the parent object exists. Definition: Suppose F = (V, ψ, l , τ, κ, δ) and f’ = (V’, ψ’, l' , τ’, κ’, δ’) are two fuzzy data trees. f’ is a sub-tree of F, written f’ ∝ F, when z V’ ⊆ V, ψ’ = ψ ∩ V’ × V’. z if i ∈ V’ and (j, i) ∈ψ, then j∈V’. l' and τ’ indicate the restriction of l and τ to the nodes in V’, z respectively. z κ’∈κ. Definition: Let fuzzy data trees f1 = (V1, ψ1, l1 , τ1, κ1, δ1) and f1 = (V2, ψ2, l 2 , τ2, κ2, δ2) be the sub-trees of F = (V, ψ, l , τ, κ, δ). f1 and f2 are isomorphic (recorded f1 ≌ f2), when z V1 ∪ V2 ⊆V, ψ1 ∪ ψ2 ⊆ ψ and τ1 ∪τ2 ⊆ τ. There is a one-to-one mapping, ξl : l1 → l 2 , which makes ∀ ξl ( l1 ) z = l2 . Theorem: Fuzzy data tree F and its sub-tree f’ are isomorphic. The above theorem follows the analysis of Definition 3 and Definition 4. It is quite straightforward. Several fuzzy constructs have been introduced for fuzzy XML data modeling. In order to accommodate these fuzzy constructs, it is clear that the DTD of the source XML document should be correspondingly modified. Next, we focus on DTD modification for fuzzy XML data modeling. First we define Val element as follows:
Then we define Dist element as follows:
42
L. Yan, J. Liu, and Z.M. Ma
Now we modify the element definition in the classical DTD so that all of the elements can use possibility distributions (Dist). For a leaf element which only contains text or #PCDATA, say, leafElement, its definition in the DTD is changed from to .
That is, leaf element leafElement may be a crisp one (e.g., sname of student in Fig.1), and then could be defined as .
Also, it is possible that leaf element leafElement may be a fuzzy one, taking a value represented by a possibility distribution (e.g., age of student in Fig.1). Then it may be defined as .
Furthermore, we have the following definition.
For the non-leaf element, say nonleafElement, first we should change the element definition from to and then add
That is, the non-leaf element nonleafElement may be crisp (e.g., student in Fig.1) and then may be defined as
When the non-leaf element nonleafElement is a fuzzy one, we differentiate two situations: the element takes a value connected with a possibility degree (e.g., university in Fig.1), and, second, the element takes a set of values and each value is connected with a possibility degree (e.g., employee in Fig.1). The former element is defined as follows.
Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema
43
The later element is defined as
Then the DTD of the XML document in Fig.1 is shown in Fig.2. Fig. 2 The DTD of the Fuzzy XML Document in Fig.1
3.2 Fuzzy Nested Relational Model A fuzzy NF2 relational schema is a set of attributes (A1, A2, ..., An, pM) and their domains are D1, D2, ..., Dn, D0, respectively, where Di (1 ≤ i ≤ n) can be one of the following:
44
L. Yan, J. Liu, and Z.M. Ma
(1) The set of atomic values. For each element ai ∈ Di, it is a typical simple crisp attribute value. (2) The set of null values, denoted ndom, where null values may be unk, inap, nin, and onul. (3) The set of fuzzy subset. The corresponding attribute value is an extended possibility-based fuzzy data. (4) The power set of the set in (1). The corresponding attribute value, say ai, is multivalued one with the form of {ai1, ai2, ..., aik}. (5) The set of relation values. The corresponding attribute value, say ai, is a tuple of the form which is an element of Di1 × Di2 × ... × Dim (m > 1 and 1 ≤ i ≤ n), where each Dij (1 ≤ j ≤ m) may be a domain in (1), (2), (3), and (4) and even the set of relation values. The domain D0 is a set of atomic values and each value is a crisp one from the range [0, 1], representing the possibility degree that the corresponding tuple is true in the NF2 relation. We assume that the possibilities of all tuples are precisely one in the chapter. Then for an attribute Ai ∈ R (1 ≤ i ≤ n), its attribute domain is formally represented as follows: τi = dom | ndom | fdom | sdom |
where B1, B2, …, Bm are attributes. A relational instance r over the fuzzy NF2 schema (A1 : τ1, A2 : τ2, ..., An : τn) is a subset of Cartesian product τ1 × τ2 × ... × τn. A tuple in r with the form of consists of n components. Each component ai (1 ≤ i ≤ n) may be an atomic value, null value, set value, fuzzy value, or another tuple. An example of the fuzzy NF2 relation is shown in Table 3.1. It can be seen that Tank_Id and Start_data are crisp atomic-valued attributes, Tank_body is a relation-valued attribute, and Responsibility is a set-valued attribute. In the attribute Tank_body, two component attributes Volume and Capacity are fuzzy ones. Table 1 Pressured air tank relation Tank_Id TA1
Body_Id BO01
TA2
BO02
Tank_body Material Volume Alloy about 2.5e+03 Steel about 2.5e+04
Capacity about 1.0e+06 about 1.0e+07
Start_Date
Responsibility
01/12/99
John
28/03/00
{Tom, Mary}
In the following, we focus on the fuzzy nested relational algebraic operations. We will start by introducing some important concepts used in our operations [21]. Definition. Let U = {u1, u2, …, un} be an universe of discourse. Let πA and πB be two fuzzy data on U based on possibility distribution. The semantic inclusion
Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema
45
degree of πA and πB SID (πA, πB), which means πA semantically includes πB, is then defined as follows: n
n
SID (πA, πB) = ∑ min (π B (u i ), π A (u i )) / ∑ π B (u i ) i =1 ui ∈U
i =1
Definition. Let πA and πB be two fuzzy data and SID (πA, πB) be the degree that πA semantically includes πB. The semantic equivalent degree of πA and πB SE ( π A, π B), denoting the degree that π A and π B are equivalent to each other, is defined as follows.
SE (πA, πB) = min (SID (πA, πB), SID (πB, πA)) Two fuzzy data πA and πB are considered β-redundant if and only if SE (πA, πB) ≥ β. For two crisp data, atomic or set-valued, their equivalent degree is one if they are equal to each other, where the same set-valued data are considered equal. Consequently, the notion of equivalence degree of structured attribute values can be extended for the tuples in the fuzzy nested relations to assess tuple redundancies. Informally, any two tuples in a nested relation are redundant, if, for pair of the corresponding attribute values, the equivalence degree is greater than or equal to the threshold value. If the pair of the corresponding attribute values is simple, the equivalence degree is one for two values. For two values of structured attributes, however, the equivalence degree is one for structured attributes. Two redundant tuples t and t’ are written t ≡ t’. Union and Difference. Let r and s be two union-compatible fuzzy nested relations. Then
r ∪ s = min ({t | t ∈ r ∨ t ∈ s}) and r − s = {t | t ∈ r ∧ (∀v ∈ s) (t ≡/ v)} Here, the operation min () means to remove the fuzzy redundant tuples in r and s. Of course, the threshold value should be provided for the purpose. Cartesian Product. Let r and s be two fuzzy nested relations on schemas R and S, respectively. Then r × s is a fuzzy nested relation with the schema R ∪ S. The formal definition of Cartesian product operation is as follows:
r × s = {t | t (R) ∈ r ∧ t (S) ∈ s} Projection. Let r be a fuzzy nested relation on the schema R and S ⊂ R. Then the projection of r on the schema S is formally defined as follows:
ΠS (r) = min ({t | (∀ v ∈ r) (t = v (S)}) Here, an attribute in S may be of the form B.C, in which B is a structured attribute and C is its component attribute. Being the same as union operation, projection operation also needs to remove fuzzy redundant tuples in the result relation after the operation.
46
L. Yan, J. Liu, and Z.M. Ma
Selection. In classical relational databases, the selection condition is of the form X θ Y, where X is an attribute, Y is an attribute or a constant value, and θ ∈ {=, ≠, >, ≥, min (Supp (Y)). (4) X f Y iff X ≈ Y or X f Y. (5) X p Y iff X ≈/ Y and min (Supp (X)) < min (Supp (Y)). (6) X p Y iff X ≈ Y or X p Y. Depending on Y, the following situations can be identified for the selection condition X θ Y. Let X be the attribute Ai: τi in a fuzzy nested relation. (1) Ai θ c, where c is a crisp constant. According to τi, the definition of Ai θ c is as follows: if τi is dom, Ai θ c is a traditional comparison and θ ∈ {=, ≠, >, Caracas Paris AL02 468 16:00 06:00 1700 Caracas New York AL03 751 08:00 13:00 1200 New York Beijing AL04 958 19:00 19:00 1300 Frankfurt Beijing AL06 601 20:00 10:00 1400 And, consider the filter expression: $flight[cheap(price)]
Fuzzy XQuery
149
The result of this expression is a sequence of nodes annotated with their truth degree according to the effective truth value of the predicate, as follows: Caracas New York AL03 751 08:00 13:00 1200 New York Beijing AL04 958 19:00 19:00 1300 Frankfurt Beijing AL06 601 20:00 10:00 1400 A predicate with the keyword threshold has the effect of reject those nodes whose xml:truth attribute is under the specified decimal value. For example: $flight[cheap(price)][threshold 0.6] With previous data and definition, this former filter expression would not include the following node in the resulting sequence: Frankfurt Beijing AL06 601 20:00 10:00 1400
6.2 Comparison Expressions XQuery provides three kinds of comparison expressions: ⎯ Value comparisons correspond to usual comparison of values in traditional programming languages; ⎯ General comparisons generalizes the previous one and they are intended to compare sequences with a comparison operator under the scope of an implicit existential quantifier; and
150
M. Goncalves and L. Tineo
⎯ Node comparisons that allow compare nodes by their identity or their position. We propose just extend the value comparisons allowing user-defined fuzzy comparators. Thus, an identifier (QName) that user has defined to be a fuzzy comparator may be used in place of traditional ones: eq, ne, lt, le, gt, and ge. ComparisonExpr CompOper ::= FuzzyComp) ValueComp ::= GeneralComp ::= NodeComp ::= FuzzyComp ::=
::= RangeExpr ( CompOper RangeExpr )? (ValueComp| GeneralComp| NodeComp| "eq" | "ne" | "lt" | "le" | "gt" | "ge" "=" | "!=" | "=" "is" | "" QName
Fuzzy comparison evaluation is similar to traditional value comparisons. First, its operands are evaluated by checking of type compatibility, as XQuery does. If operand types are not a valid combination for the given operator according to the user’s definition, an error is raised [err:XPTY0004] because of type mismatch. Finally, if operand types are a valid combination, the operator is applied to the operands. The difference is that fuzzy comparison operators are user-defined while traditional ones are built-in. We remark that in case of fuzzy comparison, evaluation result is of xs:truth datatype instead of just xs:boolean. For example, suppose the fuzzy comparator similar as declared as: declare fuzzy comparator similar ( $x as xs:string, $x as xs:string) ($x/$y) similarity ( .75, “red”, “orange”, .50, “yellow”, “orange”, .50, “yellow”, “green”, .75, “blue”, “green” ) And, consider the following fuzzy comparison expression: $car/color similar "green" The evaluation of this expression proceeds as follows: Atomizes the node(s) that is returned by the expression $car/color. If the result of atomization is an empty sequence, comparison result is an empty sequence. If atomization result is a sequence containing more than one value, a type error is raised [err:XPTY0004]. If atomization result is any atomic value different of “green”, “blue” and “yellow”, the expression returns false, while for “green”, it returns true, for “blue”, it returns 0.75, and for “yellow”, it returns 0.50.
Fuzzy XQuery
151
Now, suppose the fuzzy comparator defined as: declare fuzzy comparator mol ( $x as xs:int, $y as xs:int) Despite operand expressions have different identities and/or names as nodes in the following comparison expressions, these expressions are valid because the two constructed nodes have value compatible with xs:int after atomization. 500 mol 800 500 mol 1000 1500 mol 1000 Finally, the results of these expressions will be 0.25, false and 0.5, respectively.
6.3 Logical Expressions In XQuery, a logical expression is either an and-expression or an or-expression whose data type is xs:boolean. It does not raise an error, but always gives values true or false. We extend this kind of expressions allowing user-defined fuzzy connectives and giving a fuzzy logic semantic to built-in and and or ones. FuzzyExpr OrExpr AndExpr
::= ::= ::=
OrExpr ( QName OrExpr )* AndExpr ( "or" AndExpr )* FuzzyLiteral ( "and" FuzzyLiteral )*
The first step to evaluate a logical expression is to find the effective truth value of each of its operands. A logical expression raises an error if evaluation of at least one operand raises an error. Nevertheless, some operands might not be executed if a short-cut evaluation strategy is implemented, and in consequence, evaluation of some erroneous operands might not be performed. If no error exists, truth value of logical expression is calculated according to definition of its operands. Built-in or and and connectives are interpreted as the maximum and minimum values with respect to effective truth values of their operands, respectively. The QName that connects two OrExpr in FuzzyExpr must be a name of a user-defined fuzzy connective, whose semantic is specified by the user in the corresponding declaration. Logic expressions might use comparison expressions and/or fuzzy predicate expressions FuzzyPred. FuzzyLiteral ::= FuzzyPred ::=
( ComparisonExpr |FuzzyPred ) (“not” | ”ant” | QName)* QName "(" ExprSingle ")"
In the FuzzPred syntax, the rightmost QName corresponds to a user defined fuzzy predicate. Its application is similar to a function call. Others QName corresponds to user defined fuzzy modifiers, while keywords not and ant are
152
M. Goncalves and L. Tineo
built-in fuzzy modifiers. Continuous modifiers are applied from right to left. The result of a modifier will be the predicate to be modified for left adjacent one. Consider the following declarations: declare fuzzy predicate young ( $age as xs:int) trapezium (-INF,0,25,65) declare fuzzy predicate high ( $salary as xs:decimal) trapezium (2000,4000,xs:double(INF), xs:double(INF) The truth degree of the following expression will be 0.25 young(45) and high(2500) If you suppose the following user defined predicates: declare fuzzy predicate lucky ( $num as xs:int) extension ( (true,13), (.75,7), (1,49), (false,18), (0.33,33), (.8,40) ) declare fuzzy predicate preferred ( $color as xs:string) extension ( .33, “orange”, .66, “yelow”, .66, “blue”, 1.00, “green” ) The corresponding truth degrees of following expressions will be 0.66 and 0.8: preferred(“blue”) or lucky(666) preferred(“red”) or young(20) and lucky(40) Now, consider also the following declarations: declare fuzzy modifier really ( $dummy ) translation (+10) declare fuzzy modifier very ( $dummy) power (+2.0) The truth degrees of the following expressions will be 0,0625 and 0.5625. very really young(45) very not high(2500) Finally, consider the declarations of these fuzzy terms: declare fuzzy comparator mol ( $x as xs:int, $y as xs:int) ($x/$y) trapezium (0.5, 1, 1, 2) declare fuzzy connective imp ( $x, $y) { (not($x) or $y) } declare fuzzy connective por ( $x, $y) { ($x + $y) - ($x * $y) } Results of following expressions will be 0.75 and 0.625, respectively. 500 mol 800 imp 500 mol 1000 500 mol 800 por 1500 mol 1000.
Fuzzy XQuery
153
6.4 Quantified Expressions In XQuery, quantified expressions support existential and universal quantification. We extend them in order to allow user-defined fuzzy quantifiers as follows: QuantifiedExpr ::= Quant VarBind (,VarBind)* "satisfies" ExprSingle Quantifier ::= ("some" | "every" | QName) VarBind ::= "$" VarName TypeDeclaration? "in" ExprSingle TypeDeclaration ::= "as" SequenceType A quantified expression begins with a quantifier, which is either the keyword some or every, or a QName identifying a user-defined fuzzy quantifier. It is followed by one or more in-clauses that are used to bind variables, the keyword “satisfies” and a test expression. Each in-clause associates a variable with an expression that returns a sequence of items, called the binding sequence for that variable. The in-clauses generate tuples of variable bindings, including a tuple for each combination of items in the binding sequences of respective variables. Conceptually, test expression is evaluated for each tuple of variable bindings. The results depend on the effective truth value of test expressions. To define semantic of a quantified expression, we must distinguish two main cases. The first case is when generated tuples of variable binding are crisp, i.e., they are not provided of xml:truth degree. The second case is when generated tuples of variable binding are fuzzy, i.e., they are provided of xml:truth degree. The value of the quantified expression is defined by the following rules: First main case (crisp tuples) Given a number of n generated tuples of variable binding. Then (ρ0,…,ρn) is as follows: In case of a user defined fuzzy quantifier, the sequence of membership degree values in the fuzzy set defines the quantifier for the quantities 0,…,n if the quantifier is absolute or the quantities 0,1/n,…,n/n if the quantifier is proportional. For the built-in some quantifier ρ0=0, ρ1=1, ρ2=1,…,ρn=1. For the built-in every quantifier ρ0=0, … , ρn-2=0,ρn-1=0,ρn=1. Let’s μ0 be the truth value 1.0, μn+1 be the truth value 0.0, (μ1,…,μn) be a sequence of obtained effective truth value for the test expression given in decreasing order (μ1≥μ2 … ≥μn) Then, the quantified expression truth value will be: • •
max min (ρ i , μ i ) for increasing quantifier
i∈{0Kn}
max min (ρ i ,1 − μi +1 ) for decreasing quantifier
i∈{0Kn}
• min ⎛⎜ max min (ρi , μi ), max min (ρ i ,1 − μi+1 )⎞⎟ for unimodal quantifier ⎝ i∈{0Kn}
i∈{0Kn }
⎠
154
M. Goncalves and L. Tineo
Second main case (fuzzy tuples)
Consider n as the number of generated tuples of variable binding. For i∈{1,…,n}, let’s (ρ0,i,…,ρi,i) be as follows: In case of a user defined fuzzy quantifier, the sequence of membership degree value in the fuzzy set defines the quantifier for the quantities 0,…,i if the quantifier is absolute or the quantities 0,1/i,…,i/i if the quantifier is proportional. For the built-in some quantifier ρ0,i=0, ρ1,i=1, ρ2,i=1,…,ρi,i,=1. For the built-in every quantifier ρ0,i=0, … ,ρi-2,i=0,ρi-1,i=0,ρi,i=1. Let ρ0,0 be the truth value 1.0 when the quantifier is decreasing, otherwise let ρ0,0 be the truth value 0.0. Let’s τ0 be the truth value 1.0, τn+1 be the truth value 0.0, (τ1,…,τn) be the decreasing order sequence of the truth degrees of generated tuples of variable binding. Let’s υ0 be the truth value 1.0, υn+1 be the truth value 0.0, (υ1,…,υn) be the decreasing order sequence of degrees obtained as the minimum between truth degrees of generated tuples and respective effective truth value for the test expression. Then, the quantified expression truth value would be: •
max min ⎛⎜τ i ,1 − τ i +1 , max min (ρ j ,i ,υ j )⎞⎟ for increasing quantifier j∈{0Ki } ⎝ ⎠ • max min ⎛⎜τ i ,1 − τ i +1 , max min (ρ j ,i ,1 − υ j +1 )⎞⎟ for decreasing quantifier i∈{0Kn} j∈{0Ki } ⎝ ⎠ • max min ⎛⎜τ i ,1 − τ i +1 , max min (ρ j ,i ,υ j ), max min (ρ j ,i ,1 − υ j +1 )⎞⎟ for unimodal i∈{0Kn} j∈{0Ki } j∈{0Ki } ⎝ ⎠ i∈{0Kn}
With the defined semantics, the effective truth value of the following quantified expression would be true as we expect with traditional XQuery semantics. some $x in (1, 2, 3), $y in (4, 3, 2) satisfies $x + $y = 5
Also the following expression shows that semantic of traditional quantifiers is preserved. In this case, the expression gives the truth value false. every $x in (1, 2, 3), $y in (4, 3, 2) satisfies $x + $y = 5
Consider the user defined fuzzy predicate high as follows: declare fuzzy predicate high ( $salary as xs:decimal) trapezium (2000,4000,xs:double(INF), xs:double(INF)
The following quantified expression results as 0.50. some $salary in 2500 to 3000 satisfies high($salary)
The effective truth value of this other expression will be 0.25.
Fuzzy XQuery
155
every $salary in 2500 to 3000 satisfies high($salary)
Suppose atLeast30 is an increasing absolute fuzzy quantified defined by: declare fuzzy quantifier atLeast30 absolute trapezium (25,30,INF,INF)
And the young predicate defined as: declare fuzzy predicate young ( $age as xs:int) trapezium (-INF,0,25,65)
The following quantified expression has satisfaction degree 0.90, atLeast30 $age in 0 to 120 satisfies young($age)
If we define a unimodal behavior fuzzy quantifier of absolute nature around20 as: declare fuzzy quantifier around20 absolute trapezium (10,17,25,50)
And mol as a fuzzy comparator (more or less) defined as: declare fuzzy comparator mol ( $x as xs:int, $y as xs:int) ($x/$y) trapezium (0.5, 1, 1, 2)
Then, the truth degree of the following expression will be 0.60: around20 $x in 1 to 7, $y in 1 to 7 satisfies $x mol $y
If you suppose the decreasing proportional fuzzy quantifier fewOf: declare fuzzy quantifier fewOf proportional trapezium (-INF,0,.25,.50)
And, the fuzzy predicate preferred: declare fuzzy predicate preferred ( $color as xs:string) extension ( .33, “orange”, .66, “yelow”, .66, “blue”, 1.00, “green” ) 0.34 will be the satisfaction degree of the following quantified expression: fewOf $c in (“red”, “orange”, “yelow”, “green” , “blue” ) satisfies preferred($c)
Finally, assume the declaration of a proportional increasing quantifier: declare fuzzy quantifier mostOf proportional trapezium (.50,.75,1,+INF)
156
M. Goncalves and L. Tineo
In the following quantified expression, the generated binding tuples are provided of xml:truth degrees because of the filter expression in the in-clause. Assuming predicates young and high as above, this querying expression gives us as truth degree 0.50. mostOf $emp in ( 25 2500 35 3000 45 3500 55 4000 65 2500 25 3000 35 3500 45 4000 ) [high(salary)] satisfies not young($emp/age)
6.5 Conditional Expressions XQuery supports conditional expressions based on keywords if, then, and else: IfExpr
::=
TholdExpr
::=
"if" "(" Expr [TholdExpr]")" "then" ExprSingle "else" ExprSingle “threshold” DecimalLiteral
In our extension, the expression followed by the if keyword, called test expression, might give a satisfaction degree different of usual true and false. The first step in processing a conditional expression is to find the effective truth value of the test expression. The value of a conditional expression would be defined as: If effective truth value of test expression is false, value of else-expression is returned, this is the expression in else clause; otherwise, value of then-expression is returned, this is the expression in then clause. For example, the following conditional expression returns the value of the thenexpression when the effective truth value of the test-expression is over the threshold 0.5. if (young($emp/age) threshold 0.50) then $emp/age +“ is young.” else “”
Another example of conditional expression may be: If $color is “orange”, “yelow”, “blue” or “green” return “It might please me!”, otherwise, return “I don’t like it at all!”. if (preferred($color) then “It might please me!” then “I don’t like it at all!”
Fuzzy XQuery
157
6.6 FLWOR Expressions In order to build complex queries involving multiple document sources, XQuery provides a query structure named FLWOR expressions; FLWOR corresponds to initials of keywords identifying the clauses of this kind of expressions: For, Let, Where, Order by and Return. Its fuzzy extension involves no change in FLWOR expression syntactic rules, and therefore, we do not present here its syntax schema. Our focus is how fuzziness of others expressions and data structures may affect the result of a FLWOR expression. In the following, we discuss it each FLWOR clauses. XQuery allows iterations over sequence-contained data by means of the for clause in the FLWOR expression. Each variable in a for clause is associated to a sequence obtained from another expression. The iteration is done over all possible combinations of values for variables according to corresponding sequences. When the expression related to a variable gives a sequence of nodes, it may possibly provides of truth degrees for each node by means of the xml:truth attribute. In this case, we compute a global truth degree for each combination of variables values. This degree logically corresponds to the conjunction of conditions that originally give birth to those degrees; therefore, it is computes as the minimum. The let clause in the FLWOR expression has the effect to assign the result of another expression to be hold in a variable. If the expression gives a node with xml:truth attribute, the value of this attribute would be also combined with others truths values with the minimum operator. When each node from a node sequence has a xml:truth attribute, they are not automatically combined because they must be first aggregated. This aggregation could be done in the where clause using a quantified expression. The where clause establishes a filtering criterion. In our extension, it could be any expression giving a truth degree, i.e., expressions with effective truth values such as those presented in previous sections. The effective truth value of the where clause condition would be combined with truth degree obtained from for and let clauses. The combination is done, as ever, with the minimum operator. In case of for and let clauses are crisp, their truth degrees are 1, the neutral of minimum. The fuzzy extension presented here does not directly affect the order by clause. However, it is possible to perform a FLWOR expression query with a for clause over a xml:truth attribute provided nodes sequences ordering by such attribute. The return clause specifies the result that would be produced by iteration. The final result of the FLWOR expression would be a sequence containing all them. In case of fuzziness, in each iteration, a truth degree is calculated according to for, let and where clauses. When the return clause built a new node, the computed truth degree is added as a xml:truth attribute of the new node. Let us illustrate an extended FLWOR expression: Suppose a variable $flights that has a document comprising flights between cities as in previous examples. Another variable $opinions contains customer scores of airports in these cities. Someone searches a cheap trip from Caracas to Beijing with just one connection in an intermediate city using some of him/her preferred airlines. This person wants also to consider intermediate cities where
158
M. Goncalves and L. Tineo
most of middle-age customer’s opinions about the airport give high scores. Finally, the result must be given in decreasing order based on user’s criteria. This query would be expressed as follows: for $c in ( for $f1 in $flights//flight [origin=Caracas][good(airline)], $f2 in $flights//flight [destination=Beijing][good(airline)] let $v := $opinions//opinion [airport=$f2/origin][age=middle] where $f1/destination = $f2/origin and mostOf($x in $v) satisfies score=high and cheap($f1/price+$f2/price) return $f1 $f2 ) order by $c.xml:truth descending return $c
Using the variable $c , the outer for clause would iterate over a sequence of nodes with label obtained from the inner FLWOR expression. Each one of these nodes would be provided of a xml:truth attribute that is computed in processing of the inner FLWOR expression. Iteration in the outer FLWOR expression would be done in decreasing order of $c.xml:truth attribute, giving thus the desired output. Inner for clause iterates over all possible pairs of values for variables $f1 and $f2 obtaining from sequences of labeled nodes with origin Caracas and with destination Beijing, respectively. Each one of these nodes has a xml:truth attribute resulting form the filtering expression with predicate [good(airline)] . The combination of a pair of nodes for $f1 and $f2 would produce a truth degree minimum ($f1.xml:truth ,$f2.xml:truth). For each pair of values $f1, $f2, the let clause would instantiate the variable $v to a sequence of nodes with label and with a xml:truth attribute resulting from the filtering expression with predicate [age=middle] over opinions for the airport of the city origin of $f2 flight. These truth degrees are not immediately combined, they remain in corresponding nodes. According to quantified expression semantics, the xml:truth attribute values for nodes in $v are used for computing he effective truth value of expression mostOf($x in $v) satisfies score=high. Also effective truth value of cheap($f1/price+$f2/price) expression is computed. The condition in the where clause is a conjunction therefore its effective truth value is obtained as the minimum of three combined test expressions. Thus the where clause would reject those pairs $f1 $f2, that do not coincide at intermediate city because the effective truth value would be false. The inner return clause would build a new node with label in each iteration. The xml:truth attribute of each node would be computed as the
Fuzzy XQuery
159
minimum between the effective truth value of condition in where clause and the truth degree produced form the combination of a pair of nodes for $f1 and $f2.
7 Query Processing Beyond Fuzzy XQuery language definition, an important issue concerns fuzzy XQuery query evaluation. We propose a mechanism based in Derivation Principle [4] to evaluate of fuzzy XQuery queries. First, a regular XQuery query is derived from a fuzzy one in terms of α-cut; second, derived XQuery query retrieves data whose degrees are greater and equal than α and finally, data are sorted by degree value. Thus, a fuzzy query may be evaluated avoiding an exhaustive scan of whole input XML data. We illustrate evaluation of fuzzy XQuery queries using Derivation Principle through the following example. Consider the document "flights.xml" whose content is: Caracas New York AL01 357 07:00 12:00 1200 Caracas Paris AL02 468 16:00 06:00 1700 Caracas Los Angeles AL03 751 08:00 13:00 1200 Caracas London AL05 545 17:00 06:00 1300 Caracas Frankfurt AL06 632 17:00 08:00 1300 New York Beijing AL04 958
160
M. Goncalves and L. Tineo
19:00 19:00 1300 Paris Beijing AL02 888 7:00 21:00 1400 Los Angeles Beijing AL03 975 16:00 12:00 1400 London Beijing AL05 577 20:00 11:00 1400 Frankfurt Beijing AL06 601 20:00 10:00 1400
Suppose that the user wants to retrieve information about flights from Caracas that are served by good airlines. For this purpose, the user defines a fuzzy predicate good: declare fuzzy predicate good( $airline as xs:string) extension( .5,‘AL01’, 1.0,‘AL02’, .8,‘AL03’, .4,‘AL04’, .7,’AL05’, .3,’AL05’)
The user also wishes to restrict results to those flights obtaining truth degree to the query over the threshold 0.6. This requirement may be specified in fuzzy XQuery as the expression: doc("flights.xml")/flights/flight [origin=‘Caracas’ and good(airline)][threshold 0.6]
Applying the concept of α-cut, we can derive the classic XQuery expression: doc("flights.xml")/flights/flight [origin= ‘Caracas’ and airline = (‘AL02’, ‘AL03’, ‘AL05’)]
Fuzzy XQuery
161
This classic filter expression would produce the result: Caracas Paris AL02 468 16:00 06:00 1700 Caracas Los Angeles AL03 751 08:00 13:00 1200 Caracas London AL05 545 17:00 06:00 1300
Derivation Principle based evaluation mechanism would evaluate fuzzy conditions just for elements in this selected elements. In this way, superfluous computation of truth values for seven nodes is avoided. The final result of the query expression would be as follows. Notice that flight nodes are annotated with xml:truth attribute. Caracas Paris AL02 468 16:00 06:00 1700 Caracas Los Angeles AL03 751 08:00 13:00 1200 Caracas London AL05 545 17:00 06:00 1300
162
M. Goncalves and L. Tineo
Since our evaluation mechanism is based on Derivation Principle, membership degrees will be calculated only for these three answers. On the other hand, if Derivation Principle is not applied, then a naïve evaluation strategy must scan XML document completely, calculate membership degree for each elements and finally, discard irrelevant answers. Previous example intuitively shows us efficiency of Derivation Principle-based strategy. This strategy has been successfully used in fuzzy SQL queries [6][13][16].
8 Conclusion and Future Works We have presented here a fuzzy set based extension to XQuery. This extension allows user to specify preferences on XML queries and retrieve discriminated answers by user’s preferences. This extension comprises the new xs:truth built-in data type intended to represent gradual truth degrees. This datatype es defined as derived from xs:decimal, restricted to the interval [0,1] and at same time xs:boolean was redefined as derived from xs:truth. The concept of effective Boolean value has been replaced by effective truth value with this new type. The standard xml:truth attribute of type xs:truth was introduced in order to handing satisfaction degrees in nodes proceedings of fuzzy XQuery expressions and possibly stored in XML documents. The language is extended to declare fuzzy terms predicates, modifiers, comparators, connectives and quantifiers. These terms are treated as user defined operators that are placed in corresponding work spaces. We have extended FLWOR expressions as well as all other XQuery expressions to work with fuzzy terms and produce gradual answers. Also, an evaluation mechanism based in the Derivation Principle is presented in order to avoid superfluous computation of truth degrees. It would be interesting to incorporate in XQuery other user preference handling operators such as skyline and top-k. Acknowledgments. We give thanks to Venezuela’s FONACIT project G-200500278 and France’s IRISA/ENSSAT project Pilgrim for supporting this research work. We express a great acknowledgement to Jesus Christ, source of force and inspiration: I will lift up mine eyes unto the hills, from whence cometh my help. My help cometh from the LORD, which made heaven and earth. He will not suffer thy foot to be moved: he that keepeth thee will not slumber. Behold, he that keepeth Israel shall neither slumber nor sleep. The LORD is thy keeper: the LORD is thy shade upon thy right hand. The sun shall not smite thee by day, nor the moon by night. The LORD shall preserve thee from all evil: he shall preserve thy soul. The LORD shall preserve thy going out and thy coming in from this time forth, and even for evermore.” (Psalm 121)
References 1. Barranco, C.D., Campaña, J.R., Medina, J.M.: Towards a XML Fuzzy Structured Query Language. In: Proceedings of the Joint 4th Conference of the European Society for Fuzzy Logic and Technology and the 11th Rencontres Francophones sur la Logique Floue et ses Applications, pp. 1188–1193 (2005)
Fuzzy XQuery
163
2. Bordogna, G.: Psaila. G.: Customizable Flexible Querying Classic Relational Databases. In: Galindo, J. (ed.) Handbook of Research on Fuzzy Information Processing in Databases, Hershey, PA, USA. Information Science, vol. VIII, pp. 189– 215 (2008) 3. Bosc, P., Pivert, O.: SQLf: A Relational Database Language for Fuzzy Querying. IEEE Transactions on Fuzzy Systems 3(1), 1–17 (1995) 4. Bosc, P., Pivert, O.: SQLf Query Functionality on Top of a Regular Relational Database Management System. In: Pons, O., Vila, M., Kacprzyk, J. (eds.) Knowledge Management in Fuzzy Databases, pp. 171–190. Physica-Verlag (2000) 5. Braga, D., Campi, A., Damiani, E., Pasi, G., Lanzi, P.L.: FXPath: Flexible Querying of XML Documents. In: Proceedings of EuroFuse (2002) 6. Curiel, M., González, C., Tineo, L., Urrutia, A.: On the Performance of Fuzzy Data Querying. In: Greco, S., Lukasiewicz, T. (eds.) SUM 2008. LNCS (LNAI), vol. 5291, pp. 134–145. Springer, Heidelberg (2008) 7. Damiani, E., Marrara, S., Pasi, G.: A flexible extension of XPath to improve XML Querying. In: Proceedings of the 31st annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 849–850 (2008) 8. Eisenberg, K., et al.: SQL:2003 Has Been Published. ACM SIGMOD 33(1), 119–126 (2004) 9. Fazzinga, B., Flesca, S., Pugliese, A.: Top-k Answers to fuzzy XPath Queries. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds.) Database and Expert Systems Applications. LNCS, vol. 5690, pp. 822–829. Springer, Heidelberg (2009) 10. Galindo, J.: New Characteristics in FSQL, a Fuzzy SQL for Fuzzy Databases. WSEAS Transactions on Information Science and Applications 2(2), 161–169 (2005) 11. Galindo, J., Urrutia, A., Piattini, M.: Fuzzy Database Modeling, Design and Implementation. Idea Group Publishing (2006) 12. Goncalves, M., Tineo, L.: A New Step Towards Flexible XQuery. Revista Avances en Sistemas e Informática 4(3), 27–34 (2007) 13. López, Y., Tineo, L.: About the Performance of SQLf Evaluation Mechanisms. CLEI Electronic Journal 9(2) (2006); Paper 8. Rueda, C., et al. (eds.) 14. Ma, Z.M., Yan, L.: Generalization of Strategies for Fuzzy Query Translation in Classical Relational Databases. Information and Software Technology 49(2), 172–180 (2007) 15. Thomson, E., Fredrick, J., Radhamani, G.: Fuzzy Logic Based XQuery operations for Native XML Database Systems. International Journal of Database Theory and Application 2(3), 13–20 (2009) 16. Tineo, L.: SQLf Horizontal Fuzzy Quantified Query Processing. In: Proceedings of the XXXI Conferencia Latinoamericana de Informática (2005) 17. W3C: XQuery 1.0 and XPath 2.0 Full-Text. W3C Working Draft 3 (2005), http://www.w3.org/TR/xquery-full-text 18. W3C: XML Path Language, XPath (2007), http://www.w3.org/TR/xpath20 19. W3C: XQuery 1.0: An XML Query Language (2007), http://www.w3.org/TR/xquery/ 20. W3C: Extensible Markup Language (XML) 1.0, 5th edn. (2008), http://www.w3.org/TR/REC-xml/ 21. Zadeh, L.A.: Fuzzy sets. Information and Control 8(3), 338–353 (1965) 22. Zadeh, L.A.: Computational Approach to Fuzzy Quantifiers in Natural Languages. Computer Mathematics with Applications 9, 149–183 (1983)
Attractive Interface for XML: Convincing Naive Users to Go Online Keivan Kianmehr, Jamal Jida, Allan Chan, Nancy Situ, Kim Wong, Reda Alhajj, Jon Rokne, and Ken Barker
Abstract. Traditionally, searching in general or querying in particular required the exact matching of value to return results. As technology improves in the information sector, the complexity of these systems also increases. This is fairly common, especially in the area of databases as new models, like XML, are emerging. Searching for information is becoming more challenging for most users as the user population is increasing rapidly to include more less skilled (naive) users. This is especially true when web-based search is considered. Most users are no more familiar with structured languages like SQL and XQuery. Using relative linguistic terms for querying seems to be the most Keivan Kianmehr Computer Science Department, University of Calgary, Calgary, Alberta, Canada Jamal Jida Department of Informatics, Faculty of Sciences III, Lebanese University, Tripoli, Lebanon Allan Chan Computer Science Department, University of Calgary, Calgary, Alberta, Canada Nancy Situ Computer Science Department, University of Calgary, Calgary, Alberta, Canada Kim Wong Computer Science Department, University of Calgary, Calgary, Alberta, Canada Reda Alhajj Computer Science Department, University of Calgary, Calgary, Alberta, Canada Department of Computer Science, Global University, Beirut, Lebanon e-mail:
[email protected] Jon Rokne Computer Science Department, University of Calgary, Calgary, Alberta, Canada Ken Barker Computer Science Department, University of Calgary, Calgary, Alberta, Canada
Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 165–191. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
166
K. Kianmehr et al.
reasonable and logical approach to making any composite resource a more searchable database of information, while implementing fuzziness in XML accounts for the lack of structure that results from pooling databases together. This seems to be the natural evolution of the technology as it moves away from complex and confusing interfaces to more user-friendly, user-centric and intuitive ones. To address these concerns, this chapter describes the design and implementation of a fuzzy nested querying system for XML databases. The research involved is outlined and examined to decide on the most fitting solution that incorporates fuzziness into a user interface intended to be attractive to naive users. After researching the task, we applied our findings via the implementation of a prototype which covers the intended scope of a demonstration of fuzzy nested querying. This prototype has been integrated into VIREX (a user-friendly system that allows users to view and use relational data as XML); the developed prototype includes an easy to use graphical interface that will allow the user to apply fuzziness in order to easier search XML documents. The goal of this is to provide insight on creating more intuitive ways of searching and using XML databases; thus increasing the size of the population using and addressing XML data. We intend to expand into relational and object-oriented databases. Keywords: Fuzzy logic, XML schema, XML documents, linguistic terms, query interface, searching, nested database elements.
1 Introduction As XML (eXtensible Markup Language) is becoming the standard of Internet databases and large scale data transfer [29], it becomes ever-useful to everyone, from seasoned computer-users to typical users who see computers as black boxes. Unfortunately, large pools of resources do not have centralized controllers. The lack of structure is also followed by the disadvantage of being extremely difficult to query with imprecise quantifying search terms. Natural language terms are vital to bridging the gap between casual users and professionals. Looking at a typical commercial office serves as a scenario for a concrete example. Assume that the office employs a wide variety of personnel whose time sheets are stored in a relational database; an IT support staff member that is asked to find the employees who have taken a large number of sick days is able to do so quickly by utilizing a simple SQL query, whereas a sales manager is not expected to have expertise in coding queries and would generally attempt to go through the listings by hand. The end result is that the IT employee saves a lot of time by running a simple query, whereas the manager wasted valuable hours of time that could have been directed towards more important matters. Clearly, the IT employee is capable of using the database more efficiently and effectively than a typical user with little
Attractive Interface for XML
167
technical experience, such as the manager. This result occurs for a variety of reasons, but can be summarized to the fact that the IT employee has a greater depth of technical expertise and understanding of databases and its querying language than the manager. This presents the question of how to address this disadvantage in order to give the manager equal footing for a task of this manner. Another disadvantage present is the lack of uniform structure when several resources are combined to provide users with a large data set or pool of resources. A simple example of this problem is a large online bookstore that offers books published by several international publishing houses. In order to keep their commercial website up-to-date, they must display what books are available for ordering, which publishing house is offering the books, the prices of each book, and their print status. Each publishing house has their own internal databases that store various data regarding their own products, yet the bookstore must be able to keep up with all of them, despite their differently structured databases that may not have the same fields, or even the same number of fields for each record. For example, one publishing house may list all previous copyrighted versions of a textbook, whereas another will only list its most recent copyright date. We address these problems by focusing on the two main concerns: the lack of uniformly structured information, and primarily, the inability to translate from human natural language to a database query. These problems could be handled by first introducing XML as the umbrella for integrating the different structures, and then providing a fuzzy querying facility that encourages naive users to search structured databases.
1.1 Overview of XML XML is a general purpose format that is structured similarly to HTML. XML is a language that simply describes data, and aside from this, adds no additional functionality. It is structured from elements as a tree, with an unlimited amount of children for any one element node, thus allowing data elements to be nested within another. Elements may also have additional attributes set to describe the data further. Figure1 depicts an illustrative example of a simple XML file containing a list of undergraduate students. Note that there is only one root element, . The elements , , , , and are children elements of , and no elements are strictly required to be present, with exception again to the root element. Year is an attribute that describes what year of the program the student is in. Note that an “attribute” in XML is not defined to mean the same thing as with traditional databases. Whereas an attribute in a relational database is a column of a table, an attribute in an XML file is what is described in Figure 1.
168
K. Kianmehr et al.
Fig. 1 Example XML file
Fig. 2 Example XML Structure
Since the element tags are purely defined by the generator of the data file, the result is usually a file that is human-readable, and self-described. Most people, even those unfamiliar with XML, would be able to deduce the meaning of the data in Figure 1, just from context. Nested XML is similar to regular XML, differing in that it contains additional sub-trees on the children. For example, by looking at Figure 2, we can see that has the sub tree of and . This adds additional ease of readability and usability to XML. XML is also platform-independent, since the format itself is an open standard. Applications are free to use any parser they wish, as long as they conform to the W3C guidelines. Thus there is no monopoly on an XML API to form borders around development involving XML. XQuery is the standardized accompanying querying language for XML. It is semantically similar to SQL, but instead of Select-From-Where statements, XQuery is structured in For-Let-Where-Order-Return statements, which are more suitable to searching through tree-structured elements [29].
1.2 Overview of Fuzzy Theory Fuzzy set theory was pioneered in mid-1960 by Zadeh [9]. His theory outlined a method for defining boundaries on “humanistic” math problems [25]. The
Attractive Interface for XML
169
current standard would be bivalent set theory, which in comparison is very strict. Many databases currently use bivalent theory, or in other words, a crisp set when returning results. However, as technology improves, the inherent flaws of crisp databases are becoming more apparent, especially when non-technical users are concerned. Take, for example, a simple problem such as trying to find the amount of fish in stock at a store. We want to find the kinds of fish that we have “lots” of certain thing. Using the crisp set method, we simply find the range and return all the fish that fit into the “lots” range. This approach is very simple but also flawed. Using fuzzy sets for the same problem would return the same fish, however, the solution would also take into consideration that the kinds of fish that fall into the “lots” range are closer to the ideal “lots” amount than others in the same set. In such a search, it would be unreasonable to discard all other results that do not fit into a set solely because they differ by a relatively small amount. Fuzzy set theory is important in this application because it allows results to be imprecise - much like the human language. A major advantage of utilizing fuzzy sets, when contrasting to crisp sets, stems from the fact that human language is vague. Specifically, the meaning of the same word may not be identical to each person. Also, quantifying linguistic terms do not have set boundaries, but instead, have vague limits. Individual fuzzy membership definitions take into account that what one user may consider “lots” may not be what other users deem “lots.” By ranking results, as well as returning results that may not be correct but still applicable, fuzziness adapts to the human language much better. To accompany this informal discussion and example of fuzzy sets and bivalent theory, we can look at the details of how they operate and where they differ. In Figure 3, we have a chart displaying crisp set results. The horizontal axis shows some relative amount of fish, while the vertical axis shows the measure by which a result belongs in the set, or in other words its accuracy. Using this, we assume that the fish the store sells will fall into a certain set, based on the amount of that fish in stock. For instance, in our previous example we would return all fish that fall into the “lots” category. While not incorrect, this result set may not be as accurate as possible because the function only returns true or false as to whether or not some type of fish is in a set (1 or 0 on the vertical axis). This can lead to a few problems of non-specificity. For example, fish that are nearing the “some” category are returned with no indication that they could potentially fall out of the “lots” category relatively soon (that is, some fish are sold after the query takes place). Also, like the previous example, an element either belongs to a set or it does not. We cannot return elements that might be worth including that are outside of the discrete boundaries of a set. Figure 4 is the same chart of the fish stocking problem, but using fuzzy sets instead. By using triangles instead of rectangles to represent the function, we can now measure a proper degree of membership. This means that if a
170
K. Kianmehr et al.
Fig. 3 Crisp set results
Fig. 4 Fuzzy set representation
type of fish is at the ideal “lots” amount it will return a full 1.0, and as we get further away from the ideal amount, the degree of membership is less, signifying a less-than-ideal result. This also allows the triangle to cover a larger range of results with no loss of accuracy, as the user can easily discard results that return a small degree of membership. In addition to triangular, there are several other ways to represent the gradual membership change in fuzzy sets; trapezoidal is another commonly used representation. With a good conceptual model of how fuzzy set operation explained, we can now explore some of the more technical aspects, such as the mathematical processes involved. More specifically, we will examine how adding a membership function to normal set theory allows the fuzzy theory to operate. µA (x) : D → [0, 1]
(1)
Attractive Interface for XML
171
The main difference between a range search on a set of values versus a search over a fuzzy membership set is that within a fuzzy membership on a value each value within the membership will have a degree of how well it fits within this membership. Mathematically speaking, every value within a defined fuzzy membership will fall somewhere in the range [0, 1], to reflect how well the term matches the membership. Using this fact, we can use Equation 1 to determine the membership coefficient a value has within a membership set. In Equation 1, the value between 0 and 1 is the membership coefficient, and µA is the membership defined by the user that covers a set of values of x that belongs to a set of values that covers the rest of the domain. All of the material discussed above has been integrated into VIREX which is a system under development by our research group to allow for the representation and querying of relational data as XML. VIREX has a user-friendly interface that allows the user to specify queries with minimum keyboard input. Thus it equally addresses naive users concerns; no need to learn any query language in particular; the user runs VIREX and get on the screen a diagram summarizing all the database content and links. This way, the user will be able to code queries without any need to know the database details. Whatever a professional is expected to know before he/she codes queries is displayed by the VIREX to put all users at the same level. Queries are then coded as a sequence of mouse clicks to specify the items to show in the result. Once a query is coded, VIREX displays the result as XML schema and documents. Our extension as described in this chapter will empower VIREX with more capabilities to the benefit of naive users; instead of specifying conditions in a traditional way using a drop-down table, they will be allowed to use fuzzy terms in their queries and it is the responsibility of the VIREX engine to translate the fuzzy terms into XQuery format to be executed at the backend and the result is returned to the user in fuzzy terms. The user is given the opportunity to display the XQuery produced by the VIREX engine; it is also possible to display the corresponding SQL statement. We demonstrate the effectiveness of the proposed approach by running a user study on computer science students who were enrolled in a database course without any background related to database design or query coding; they may be considered as naive users with almost same level of potential to learn as they all passed the prerequisites and are fourth year students.
2 Related Work and Current Solutions A variety of solutions and their respective implementations currently exist for the problem at hand. However, each implementation uses different methods to achieve results that are correct, but may not be replicable by non-technical users. By researching other currently implemented solutions, we hoped to gain insight on how to construct a new, different, and improved solution.
172
K. Kianmehr et al.
While our focus is on XML, by viewing other fuzzy database implementations, we can compare the potential pitfalls of our own design. Kacprzyk [10] discussed how to make relational databases, such as SQL or Access [10, 11], fuzzy by inserting a fuzzy attribute for every numerical attribute into the database. This is similar to the method already developed by our group and outlined in [1]. This additional fuzzy attribute represents the application of fuzzy set theory on existing data; therefore the query would return all relevant records according to that fuzzy numerical attribute. While this solution addresses the core concepts of fuzzy querying, our team believes this to be a very ineffective solution for the same reasons given for the Fuzzy XML implementation.
2.1 Techniques in Relational Databases to Represent Fuzzy Data Existing literature discusses many different techniques for representing fuzziness within relational databases. In general, it seems that the following ideas are agreed upon: a fuzzy relational database (FRDB) either allows for queries that let preferences be expressed instead of exact Boolean conditions, or allows for the storage and querying of a new type of data that directly stores fuzzy sets. In other terms, a FRDB can accommodate two types of imprecision - impreciseness in the association among data values or impreciseness in the data values themselves [19]. The two most common techniques used for working with imprecision are similarity relations or possibility distributions, or a combination of the two techniques. These are discussed in the next subsections. Table 1 An instance of a Student relation FName LName Avg Marks Jeremy Scott A Jenny Wong A George Yuzwak C Jose Sanchez B
Attitude Unhappy Negative Positive Cheerful
Table 2 Similarity Relation for the ‘Attitude’ attribute of the Student relation (Table 1) Unhappy Negative Positive Cheerful
Unhappy Negative Positive Cheerful 1 0.8 0.2 0 0.8 1 0 0 0.2 0 1 0.95 0 0 0.95 1
Attractive Interface for XML
2.1.1
173
Similarity-Based Techniques
Buckles and Petry were the first to introduce the similarity-based relational model [5]. The basis of this model is the replacement of equality with a similarity relation. A similarity relation s(x, y) is a mapping of every pair of elements within the Universe of Discourse (domain of an attribute) to the unit interval [0,1] [5]. This is best visualized in the form of a matrix. An example of this, based on the Attitude attribute of the Student relation described in Table 1, is given in Table 2. The matrix illustrates that the similarity relation is reflexive and symmetric. In this model of FRDB, a similarity relation is defined over the elements in each attribute, in each relation [23]. Where a crisp definition of equality is still desired, the matrix representation of the similarity relation is reduced to the identity matrix. When queries are written for a similarity-based FRDB, a minimal similarity threshold value must be given for any attribute in the relation that is to be matched based on similarity rather than equality. If no threshold value is specified, it is assumed that the standard definition of equality applies [5]. Using the similarity relation defined in Table 2, one could construct a query on the Student relation requesting all students with ‘Positive’ attitude with threshold of 0.8. This would then include ‘Cheerful’ students as well as ‘Positive’ students. Another feature of the similarity-based FRDB, is that it allows for nonatomic domain values. In their model, Buckles and Petry [5] define that any member of the power set of the domain may be a domain value except the null set. This feature allows uncertainty of data values to be expressed, but is not in first normal form and suffers the associated implementation problems [6]. Similarity relations are best used on finite and discrete domains of linguistic sets [4]. The structure does not lend itself to infinite domains. 2.1.2
Possibility-Based Techniques
Instead of understanding a membership function µF (x) as the grade of membership of x in F , possibility-based FRDBs interpret it as a measure of the possibility that a variable Y has a value x [19]. Such fuzzy sets arereferred to as possibility distributions and are represented by the symbol . These possibility distributions can be used to indicate the possibility that a tuple has a particular value for an attribute. For example, if a tuple in a Person table has the value ‘Young’ for the attribute ‘Age’, a possibility distribution describes the likelihood that such person has a particular value for the age: young = {1.0/22, 1.0/23, 0.8/24, 0.6/25, ...} [4] So the likelihood that the Young person is 24 years old is 0.8. This allows the linguistic identifier to be used as a value in the domain, while the actual possibility distribution is given elsewhere in the database in the form of a relation having the name of the linguistic identifier [4].
174
K. Kianmehr et al.
Raju et al [19], describe two different ways of implementing a possibilitybased FRDB. Each represents a fuzzy relation r by a table with additional column for µr (t), showing the membership of tuple t in r. The first (Type-1) stipulates that the domain of each attribute is a fuzzy set (recall that a classical set is a special case of a fuzzy set). Given crisp values in a relation, there exist membership functions that map the values to linguistic terms with associated possibilities. The second implementation (Type-2) described by Raju et al [19], permits more uncertainty in the data values. It allows for ranges or possibility distributions to be the actual values of attributes. This cannot be implemented given current commercial frameworks for relational databases since it allows for different data types in the same column and/or multiple values. Representation would require a new abstract data type to handle the new possibilities for attribute values. Finally, possibility distributions work well to provide information about objects that ‘may be’ a valid response to a query [4]. This model works well to represent imprecise data values. 2.1.3
Hybrid Techniques
Other techniques for representing fuzziness in relational databases have been proposed to include characteristics of both the similarity-based and possibility-based models. This allows them to work with more than one area of imprecision. An example is GEFRED - a Generalized Model of Fuzzy Relational Databases [18]. In this model, each attribute in a relation has an underlying domain that can be represented in one of many ways. Values can contain possibility distributions, ranges of values, approximate values or linguistic terms, each denoted by a syntactic identifier. Linguistic terms are linked to possibility distributions stored in external relations. These possibility distributions generally take the form of trapezoidal functions. The model also allows for linguistic terms in a column to be related via a ‘proximity relation’, which is identical to the similarity relation described by Buckles and Petry [5]. If no proximity relation exists for the attribute, it is assumed that the classical definition of equality applies for values in this domain [18]. GEFRED would require some sort of middleware product or specialized query language to interpret its data values, but it is a good example of how the discussed techniques can be employed to represent imprecision of data values and imprecision in the relationships between data values.
3 XML Schema and Fuzzy Data in XML The W3C XML Schema has recently became a standard [13, 20]. An XML schema defines the structure of an XML document instance. XML schemas allow for strong data typing, modularization, and reuse. The XML schema specification allows a developer to define new data types (using the ¡complexType¿ tag), and also use built-in data types provided by the specification. The developer can also define the structure of an XML document instance
Attractive Interface for XML
175
and constrain its content. As well, the XML schema language supports inheritance, so that developers do not have to start from scratch when defining a new schema. These features of the W3C XML schema specification allow for schemas that are effective in defining and constraining attributes and element values in XML documents [13]. There has already been some research completed on representing fuzzy data in XML. The fuzzy object-oriented modeling technique (FOOM) proposed by Lee et al [13] is one such approach. This method builds upon objectoriented modeling (OOM) to also capture requirements that are imprecise in nature and therefore ‘fuzzy’. The FOOM schema defines a class of XML document that can describe fuzzy sets, fuzzy attributes, fuzzy rules, and fuzzy associations. This method would be useful in representing data contained in object-oriented databases. However, it is too specific in terms of its objectoriented nature to be applied directly to relational databases. Another, more general approach is proposed by Turowski et al [21]. The method described is aimed at creating a common interchange format for fuzzy information using XML to reduce integration problems with collaborating fuzzy applications. XML tags with a standardized meaning are used to encapsulate fuzzy information. A formal syntax for important fuzzy data types is also introduced. This technique of using XML to represent fuzzy information is general enough to be built upon to apply to relational databases. However, it uses DTDs, rather than the currently accepted method of XML schemas to define and constrain the information held in an XML document. It would be beneficial to extend this approach to define the XML document class for holding data from fuzzy relational databases with an XML schema, rather than a DTD. However, in this chapter, we follow a different trend by specifying/deriving membership functions for the attributes intended to be queried using fuzzy terms. The latter attributes are expected to have numeric domains; alternatively categorical values are discretized. The best starting point for our research was to examine other solutions that tied directly to the problem, thus fuzzy XML implementations were the first topic to be researched. A previous implementation by our group outlined a method of adding fuzziness to an XML database by mapping a new subelement to an existing element that would store its fuzziness value [1]. While this method is straightforward and easy to implement, it relies on changing the database, which poses a problem to us as the problem is directed towards users who might not have the technical expertise to do so. Also, ownership boundaries pose a problem if the database does not belong to the user, and thus, cannot be modified in this way. The main problem here is that the database cannot be changed and thus this mapping must be done via some other way. This solution also requires an initial calculation over the entire database, which may prove to be extremely expensive, depending on its current size. As the initial phase requires inserting new fuzzy attributes into the database for each numerical value, the overall volume of the database may increase
176
K. Kianmehr et al.
considerably. This problem may propagate when the database contains data that changes over time. In the example given above, each time the bookstore’s repository is updated with changes from a publisher’s database, recalculation of fuzzy values for each updated attribute must be performed, thus increasing overhead even more. In addition, this approach also requires that fuzzy linguistic terms be pre-defined, which disallows the user from customizing their own terms. As each user’s definition of quantifiers and qualifiers are usually somewhat deviant from another’s, this approach fails to provide users with personal flexibility. However, the logic behind the previous effort by our group was sound and is useful to us as the ideas of applying fuzziness to XML can be used in our own implementation.
4 The Proposed Solution This chapter proposes to solve the problems with two conceptual decisions. The first idea deals with the inability to handle several differently structured databases when combined together. This concern is especially prominent when attempting to combine several relational databases, which have rigid schema, together. The second is concerned with making the data more accessible to users who may lack in-depth technical knowledge about database querying. Combining the two solutions creates one that handles data under variable schema, and makes it more searchable. We developed a stand alone prototype that incorporates all of these ideas to act as a proof of concept. Then we integrated these ideas into VIREX in order to empower VIREX with more sophisticated capabilities.
4.1 Variable Schema Compromise This chapter proposes to solve the structure variability issue by using XML to eliminate the barrier of rigid database schema. By nature, XML is a markup language that constitutes of user-defined nested elements, as discussed previously. This prevents the need to have a strictly defined schema as relational databases do, especially in the scenario that several databases are combined to form a large repository. It is also an ideal choice for cross-database querying, since similar databases may be combined into an XML file for querying. By designing our solution for XML data files, this database may be read and used by a variety of other applications on different systems, thus making this type of database a preferable choice for platform independence. This involves working from a database in the form of an XML data file which consists of regular data and nested data. This chapter will not cover the methods in which other databases may be converted into XML data, as it is outside of the scope of this chapter and has already handled by our group and others [7, 12, 15, 16, 22]. We will assume that we begin with an existing XML data set and its corresponding schema file. Our assumption is valid because it is well done by our group and will be used as the testbed for our research.
Attractive Interface for XML
177
4.2 Fuzzy Querying As previously discussed, fuzzy set theory plays a major role in our approach. It is integrated into data querying to provide imprecise searching to nontechnical users. Specifically, we use fuzzy set theory to develop the ability to perform fuzzy searching on all numerical attributes in an XML database. We also focus on extending this functionality to nested XML elements. User-defined fuzzy relations will allow users to search for sets of data by abstracting the strict data into broader linguistic terms, allowing them to effectively query the database in a way that is more understandable and intuitive, regardless of their level of technical knowledge. We also perform this fuzzy calculation completely dynamically as opposed to changing the data layer of a system as the previously referenced solutions suggest. We believe it is a faster, more efficient, and more intuitive solution for non-technical users.
4.3 Combining Two Solutions into One The use of an XML database and fuzzy logic combined into an intuitive graphical user interface (GUI) allows users to create search queries with fuzzy linguistically-based terms. The interface allows the user to freely associate imprecise linguistic terms with fuzzy ranges for any numerical element. These terms can then be used to construct an executable search query for the XML database. The end result is an application that is capable of parsing an XML database and allows users to utilize fuzzy natural language terms to search for desired information.
Fig. 5 Basic view of our application
5 Implementation 5.1 Application Design Concepts After the initial research phase, we found it necessary to plan and assess the scope of the prototype being developed to satisfy the requirements of the problem. This involved determining some design decisions that would be adhered to during the implementation of the prototype. It was decided that three key aspects needed to be present throughout the system. The first concept was that the prototype would need to have a logical flow to a typical user. The user should not have any difficulty in discerning the
178
K. Kianmehr et al.
proper order of use when presented with the application. Such details like tab-ordering at the top of the window were kept in mind to reflect a logical presentation to the user. This is depicted in Figure 5, which shows the tabs along the top of our application. These tabs were placed in logical ordering, with focus given to the first tab upon start-up, to clearly communicate to the user the proper flow of use. This would inevitably also help a user make fewer mistakes with input, thus creating fewer mistakes within the prototype itself. Another key design idea was that the prototype should be easy to use as it was geared towards non-technical users. Again, this was reflected in the placement of graphical components on the interface, and in the clear labeling of each field and button. Each tab was designed with only minimal components to ensure that the task remained simple to the user. Lastly, the prototype would have to balance ease of use with functionality as it should not sacrifice flexibility for esthetics. Although the main focus is to provide the users with a simple intuitive interface, we wanted to keep a large degree of flexibility in the provided functionality. An additional tab was added to accommodate users with some familiarity with XQuery to execute their own complex queries for non-numerical data, as well as for general testing purposes. After coming to these conclusions, the necessary development stages were prioritized based on their importance to these design concerns and to the overall scope. Base-line functionality, like querying and parsing an XML file, took precedence over fleshing out more advanced functions like adding additional membership function shapes.
5.2 Graphical User Interface The first task to be completed was the creation of the GUI, constructed using Java’s Swing. Our team came to the conclusion that each step of user querying would be divided amongst tabs, so that screen real-estate would not be monopolized by the prototype. Thus, the interface consists of five tabs, each serving a necessary function. Each of the tabs is logically sequenced in the order that the user would follow for each step of querying. The “Browse” tab is placed first amongst the rest, allowing access to browse functionality, as shown in Figure 6. This, simply put, enables the user to open an XML file and browse its contents. More importantly is that when an XML file is loaded, the related schema file is also opened and parsed automatically (the purpose for this is discussed in Section 4.3). When the user has loaded their desired data file, they may proceed to the next tab. The Schema File tab appears next. This tab displays the loaded file’s associated schema file to the user, allowing them to determine which level any desired fuzzy terms occur in. The schema file is automatically loaded, requiring no intervention on the user’s part. The schema file is displayed in order to help the user calculate the nesting level of the terms they need.
Attractive Interface for XML
179
Fig. 6 XML data file
Fig. 7 XML Schema file
Fig. 8 Specifying membership functions
180
K. Kianmehr et al.
Fig. 9 Fuzzy query
The “Membership Functions” tab comes next, giving the user the ability to define membership functions on numerical attributes from the data file. As Figure 8 shows, a list of existing fuzzy membership definitions appear in the table at the top, according to which numerical element is selected by the user in a drop-down menu. This drop-down component is automatically populated with attribute names after the XML data file is loaded. Our team chose this form of automation in order to decrease possible user errors. Populating these values into drop-down GUI components restricts the user’s input only far enough to ensure that they do not input erroneous attribute names, effectively eliminating this as a source of frustration to the user. A change in elementselection also dynamically modifies a label to the right, which displays the minimum and maximum values of the selected element in the XML database. This acts as a guideline to the user when defining ranges for a fuzzy term. For instance, if the user is aware that the lowest amount of fish in stock at the moment is 3, then it acts as a value for which the user can base the definition of their “few” range around. Text fields below the drop-down box allows the user to enter a fuzzy term for their range (such as “few”). This fuzzy term must be formatted like “few,x” where x is the level of nesting that term has. The user can determine the level of nesting by looking at the schema file through the Schema Tab. While we feel that is not a good solution, given the constraints placed on us by Swing and time, it was difficult to implement a more effective query-builder that would determine the level of nesting on its own. The user can then define their range via the Min and Max text fields and by clicking the “Add a New Fuzzy Term” button. This adds their new term to the list for that numerical element. Here, the user may define as many fuzzy terms as they wish, on as many numerical attributes as they wish. Once the user is finished defining their fuzzy membership functions, the “Fuzzy Queries” tab is next, as shown in Figure 9. The user may now perform queries on the XML file using the fuzzy terms they’ve created. Here, the
Attractive Interface for XML
181
user is presented with two drop-down boxes; the one on the right changes depending on the selection on the left. The drop-down box on the left is populated with any numerical elements that the user has defined fuzzy terms for, whereas the drop-down box on the right dynamically changes to list all fuzzy terms that have been defined for the selected element. The user can select an element and a fuzzy term to form query conditions, which can be added to the current query by clicking the “Add to Query” button, also displaying the condition to the query list. This list represents the entire query that the user has constructed so far. They can easily remove any conditions they no longer want by selecting it, and clicking the “Remove from Query” button. In effect, the user visually builds a query using the interface, rather than typing out a confusing and syntax-specific XQuery query. Clicking the ”Run Query” button in the upper right executes the query, and displays the returned results in the text area at the bottom. The user has the option of saving these results to an XML file if they desire, by clicking the “Save Results” button. Lastly, the fifth tab, “XQuery”, allows the user to enter FLWOR queries that the prototype can use to poll the database. This was added to allow more technical users to make queries that they themselves might be comfortable with. It will also allow users to search for non-numerical queries. Note that these results show up in the previous tab as the XQuery tab was implemented mainly for testing purposes.
5.3 Functionality Implementation Our underlying implementation was constructed with Java, in conjunction with NUX, an open-source XML API. NUX is an invaluable library that provides the ability to parse and query XML from a data file. Unfortunately, it has several drawbacks, such as not being schema-aware, being over simplified, and not being designed for this specific task. NUX was originally designed for messaging software, not large-scaled XQuery on data. Despite these drawbacks, the team has integrated NUX in order to provide a simple and effective API to hasten the implementation process. NUX assists us by effectively being the wrapper API of several different components [28]. These range from using XOM trees for storage to XSLT for output. This cuts down on the time to learn each individual component, and provides an accessible method for using each. Though problems have arisen from the aforementioned drawbacks they have been worked around by coding tweaks and design choices. As discussed in the previous section, we automatically load the names of the numerical elements into a drop-down box for the user to eliminate confusion and error-prone typing. While this takes a little more loading time in the initial file-loading phase, it benefits the user by doing some of the tedious work for them by finding all numerical elements and displaying their minimum and maximum values. This is accomplished entirely without any intervention from the user, except that they must provide the corresponding
182
K. Kianmehr et al.
schema file for the data file they wish to query. The prototype accomplishes this by finding the schema file from the same directory as the data file, identifying it by its identical filename with the “xml” extension replaced with “xmls”. This is necessary in order to poll the schema file for all numerical attributes, and also allows the application to determine the minimum and maximum values of each one. Performing these functions in the backend allows us to display these values to the user, in order to aid them in building their own membership functions later on. All data pertaining to membership functions is stored in an internal hash map for easy reference. Using the aforementioned Fuzzy Queries tab, users build their queries by defining conditions, which are stored in the visible table component. Clicking the “Run Query” button translates each of the conditions into query fragments in XQuery. The fragments are combined into a full XQuery statement, and are then executed on the loaded data file. Our query works by having the whole document placed in a variable, say $a. With this variable in place we can put the individual fragments into other variables, say $b, $c, and so on. We then run a comparison between $a and $b and where they match, we return the results. We have found this works well, especially for nested queries. In our implementation, our fuzzy memberships are defined with a triangle that spans over the ranged area on a graph to determine how closely related the value is to the fuzzy defined term. To do this, we assumed the following: 1. All fuzzy memberships have a minimum and maximum range; 2. The highest membership for each fuzzy membership area is the average between the min and max, in which the fuzzy coefficient will be 1. With these assumptions, we can determine the fuzzy coefficient by using the mathematical equation for a line, y = mx + b. m is the value of the slope on the graph, y is the value of the y axis for a point on the graph, x is the value of the x axis for a point on the graph, and b is a constant used in the mathematical equation to help determine where the line is positioned on the graph. In our application, we first determine the value for the slope of the line m. This can be determined using two points, (x1 , y1 ) and (x2 , y2 ) and the equation m = (y1 − y2 )/(x1 − x2 ). The two points used will be the value of where the membership function starts or ends, and the middle point. We will only know the x values for these points at first, since the user only defines the beginning and end of each range, but since the middle/average point will give the membership degree coefficient its maximum value of 1, we can discern the coordinate of first point from this. We can determine the second point since we know the minimum and maximum values, whose values will border the fuzzy membership function, giving it a coefficient of 0. Once we’ve determined the slope of the line, we need to find the position of it on the graph, which effectively is the value b. By substituting in a known coordinate on the line, we will be able to determine the value of b
Attractive Interface for XML
183
through algebra. Once we have found the values of m and b, our equation is complete. We can then substitute the returned fuzzy value into x, and solve the equation, giving us the value of y, which is the membership coefficient for that value within the fuzzy membership function. We return this along with the results. The prototype satisfies the problem by providing an easy to use interface that allows a user to perform fuzzy searches with no additions to existing databases or changes in querying language.
6 Integrating the Proposed Approach into VIREX After we developed the prototype and thoroughly tested its functionality from software engineering perspective, we went into the next step to integrate with VIREX. The integration has been very successful and empowered VIREX with more expressiveness and functionality.
6.1 VIREX System Overview VIREX is a powerful tool for querying relational databases to produce XML Documents and corresponding XML schemas. It is also capable of creating views to transform part of a relational database into XML. VIREX has different modules that interact with the database to achieve the target. From user’s perspective, VIREX takes a relational database as input, extracts the schema of the relational database, and generates an interactive diagram, similar to the extended entity-relationship diagram (EER) diagram, using which queries can be constructed and views can be defined. Queries and views can be constructed visually using VRXQuery. Resulting XML documents and schemas are generated accordingly. In the front end of VIREX, there are four modules with which a user interacts to code queries and to define views; all are done using the mouse and with minimum keyboard input. After a visual query is constructed and submitted for execution, several modules within VIREX get involved in the generation of the target XML document and schema. VIREX works in a systematic way to satisfy user needs. The process starts with schema conversion. Based on the query constructed on the interactive diagram and the database schema generated earlier, the schema conversion module produces a schema object, which is provided to the XML generation module. This module has two submodules: query generation and data processing. The former submodule creates SQL queries to be executed based on the specified visual query. SQL queries are executed against the underlying relational database by the data processing submodule. After the XML document object is created, both the XML document object and the schema object are passed on to the result generation module which produces expandable JAVA tree representation of the XML document and schema; it generates colored XML documents and schemas as tree
184
K. Kianmehr et al.
structures. These documents are returned to the front end and displayed to the satisfaction of end-users. In addition to the process described above, extra steps are taken when a user decides to store the result of a query (as materialized view). After the XML document and schema objects are generated, they are passed onto the view maintenance module before they are displayed. The view maintenance module is responsible for materializing and updating views. In this case, the XML document and schema objects are analyzed by the view generating submodule to find appropriate mapping of the XML view into the relation database. The created database for the XML view is then populated based on data in the XML document object. A materialized view must be maintained consistent with its source database. In the approach adapted by VIREX, the update of a materialized view is deferred until the next time it is accessed where extra steps are executed before the XML document is generated by the XML transformation module. To update a materialized view, the view maintenance module consults the internal representation model to obtain information on modifications done to the database since the last update of the view, which is then processed. The visual query used to generate the view is considered when the view is updated.
Fig. 10 Query result: colored document on the left; schema on the right
In parallel with the deferred update on materialized views, corresponding XML views are also updated. The view maintenance module checks the internal representation model and then updates the corresponding XML document object directly before the actual XML document is displayed by the
Attractive Interface for XML
185
result generation module. VIREX has a visual query module named VRXQuery, which allows interactive querying of relational and XML data and facilitates specifying results in arbitrary XML document structure. A corresponding XML schema that describes the result XML document is also generated. VRXQuery is simple, user-oriented, efficient and effective. There is no textual query and transformation languages to learn. Finally, a sample query result in XML format is displayed in Figure 10, where part of the XML document and the XML schema are shown; a sample fuzzy query in VIREX is shown in Figure 11.
Fig. 11 Sample VRXQuery query with a fuzzy term on the AGE attribute and the corresponding SQL and XQuery queries
6.2 The Operations Supported by VRXQuery VRXQuery is intended to be expressive, user-friendly and closed. To achieve expressiveness, we first analyzed the problem and identified the basic operations necessary and required to manipulate a given database in order to produce the target XML structure. Projection and selection are necessary for reducing the information to appear in the output. Join is necessary to combine information from different sources. However, nesting is more powerful and expressive. Join is dedicated to produce flat structures, while nesting produces a nested structure. Union and difference are also necessary. We added order-by to give the user the opportunity to sort the information in the result. We also support group-by, which has been already investigated by
186
K. Kianmehr et al.
other researchers extending XQuery and Xpath, e.g., [2]. Finally, the renaming of relations and attributes has been added as the basic schema evolution function necessary to make the integration of databases easier and straightforward after the names are unified. VRXQuery is user-friendly because all queries can be specified directly on the visual diagram as a sequence of mouse clicks and with minimum keyboard input; this is illustrated later on in Figure 11 where a user specifies the query on the displayed diagram and the condition is specified in the corresponding interactive table. The user is not expected to be expert in the relational or XML technology. Queries may be coded by trial and error; this makes VRXQuery attractive learning tool for people interested in learning the XML technology; they may specify simple queries and visualize the derived XML schema and document. VIREX provides online help for different aspects of the process to guide the users whenever they get stuck. Finally, closeness enriches expressiveness; it is also necessary to incorporate fuzziness into VRXQuery for wider user community. The latter property is described in the next section, where membership functions are automatically determined and optimized.
6.3 From Fuzzy VRXQuery to SQL and XQuery After specifying the elements/attributes to be queried in fuzzy terms, it becomes possible to code queries using fuzzy terms for the specified elements/attributes. However, it is still possible to query the latter elements/attributes without fuzziness, simply because the actual values in the database are not fuzzy. Fuzziness reflects only the perspective of one usergroup; it is not binding to all user groups accessing a common data source. This introduces more flexibility by allowing users to query the database using different perspectives. In this study, we used membership functions in triangular shape; it is appropriate and satisfies the purpose; it is in general a widely used shape in fuzzy systems. As a result, a table with the following structure is derived to include the summarized fuzzy information: FuzzyAttributes(Attribute, fuzzy term, left x, middle x, right x), where the triangular shape intersects the x-axis at left x and right x, i.e., these are the extremes of the fuzzy-triangle and between them middle x is the point having membership degree one. After the fuzzy sets are decided, the user is expected to specify a fuzzy term for each fuzzy set. These fuzzy terms as stored in the table FuzzyAttributes to be used for processing and transforming the fuzzy terms appearing in user queries. For each attribute required to be queried in fuzzy terms, the table FuzzyAttributes includes one row per fuzzy term to specify the boundaries and the middle point of the corresponding triangular shape. Included in Table 3 are four fuzzy terms that classify the ‘AGE’ attribute into four group, namely kid, young, adult and senior. This structure can be smoothly adjusted to adapt other forms of membership functions, like trapezoidal.
Attractive Interface for XML
187
Table 3 Test results reported by the user study to check the effectiveness of VIREX as learning tool: testing VIREX based learning versus classical learning Attribute Fuzzy Term Left xRight xMiddle x AGE kid 0 25 0 AGE young 15 40 27 AGE adult 30 60 49 AGE senior 55 100 90
Fuzziness is incorporated in the condition (where-clause) of a VRXQuery query as demonstrated in Figure 11; notice how ‘AGE’ has been specified as ‘YOUNG’ in the condition part. Before the actual query is executed, fuzziness is resolved by considering the membership function(s) that correspond to the fuzzy term(s) appearing in the query. Each fuzzy term specified in the query is replaced by a condition that returns all the values in the range covered by the fuzzy term. The query is then transformed into equivalent query expressed either in SQL to retrieve information from the underlying relational database, or in XQuery to retrieve information from the data stored in XML format (both are shown in Figure 11). The process is illustrated in Figure 11 and by the fuzzy VRXQuery described in Example 1. Example 1. Consider an XML schema to describe citizens, and assume that the ‘AGE’ attribute is intended to be expressed in the queries in fuzzy terms. Further, assume for the ‘AGE’ attribute the system decided on the four membership functions listed in the above table. A fuzzy VRXQuery to “find young employees who are managing projects located in Calgary” could be expressed as follows. FOR $e IN distinct(document(“company.xml”)//EMPLOYEE) LET $p:=document(“company.xml”)//PROJECT[mngr=$e/SSN] WHERE $p/city= ‘Calgary’ and $e/AGE is ’YOUNG’ RETURN $e/SSN
This query is transformed into the following XQuery: FOR $e IN distinct(document(“company.xml”)//EMPLOYEE) LET $p:=document(“company.xml”)//PROJECT[mngr=$e/SSN] WHERE $p/city= ‘Calgary’ and $e/AGE>=15 and $e/AGE 0.05). All medications significantly reduced the mean IOP from baseline (P < 0.0001). IOP reduction obtained with travoprost (7.3+/-3.8 mmHg) was significantly higher than that obtained with latanoprost (4.7+/-4.2 mmHg) (P=0.01). A statistically significant reduction in mean CCT (0.6+/-1.3%) from baseline was observed when patients instilled bimatoprost (P=0.01). CONCLUSIONS: Latanoprost, travoprost, and bimatoprost had no statistically significant effect on the blood-aqueous barrier of phakic patients with POAG or OHT. Bimatoprost may be associated with a clinically irrelevant reduction in mean CCT. The corresponding full XML document is extracted as follows. Trial Source URLhttp://www.ncbi.nlm.nih.gov/pubmed/16936646?dopt=Abstract/URL TitleThe effects of prostaglandin analogues on the blood aqueous barrier and corneal thickness of phakic patients with primary open-angle glaucoma and ocular hypertension /Title AuthorArcieri ES, Pierre Filho PT, Wakamatsu TH, Costa VP /Author /Source Objective Drug id = “drug − a” Namelatanoprost 0.005%/Name /Drug Drug id = “drug − b”
286
J. Ma et al.
Nametravoprost 0.004%/Name /Drug Drug id = “drug − c” Namebimatoprost 0.03%/Name /Drug AimBlood-adqueous barrier and central corneal thickness/Aim PatientsTypeprimary open-angle glaucoma (POAG) and ocular hypertension (OHT) /PatientsType /Objective MainOutcome Result Drug ref = ”drug − a”/ Duration1 month/Duration UnitmmHg/Unit PatientsNum34/PatientsNum pValue0.01/pValue SampleDist value = ”IOP Reduction” Mean4.7/Mean SEM4.2/SEM /SampleDist /Result Result Drug ref = ”drug − b”/ Duration1 month/Duration UnitmmHg/Unit PatientsNum34/PatientsNum pValue0.0001/pValue SampleDist value = ”IOP Reduction” Mean7.3/Mean SEM3.8/SEM /SampleDist /Result /MainOutcome SideEffect Report AdverseEventirrelevant reduction in mean CCT/AdverseEvent Drug ref = “drug − c”/ Degreemay/Degree /Report /SideEffect Conclusion Efficacy Drug ref = “drug − a”/
An XML Based Framework for Merging Incomplete
287
Degreesignificantly/Degree Duration1 month/Duration FromBaselineyes/Duration /Efficacy Efficacy Drug ref = “drug − b”/ Degreesignificantly/Degree Duration1 month/Duration FromBaselineyes/Duration /Efficacy Efficacy Drug ref = “drug − c”/ Degreesignificantly/Degree Duration1 month/Duration FromBaselineyes/Duration /Efficacy CompareEfficacy Drug ref = “drug − a”/ Degreeno significant difference/Degree Drug ref = “drug − b”/ Duration1 month/Duration /CompareEfficacy /Conclusion /Trial
References 1. Abiteboul, S., Segoufin, L., Vianu, V.: Representing and querying XML with incomplete information. ACM Trans. Database Syst. 31(1), 208–254 (2006) 2. Barbara, D., Garcia-Molina, H., Porter, D.: The management of probabilistic data. IEEE Trans. on Knowledge and Data Engineering 4(5), 487–502 (1992) 3. Bolen, S., Wilson, L., Vassy, J., Feldman, L., Yeh, J., Marinopoulos, S., Wilson, R., Cheng, D., Wiley, C., Selvin, E., Malaka, D., Akpala, C., Brancati, F., Bass, E.: Comparative effectiveness and safety of oral diabetes medications for adults with type 2 diabetes. Comparative effectiveness review (8) (2007) 4. Chiselita, D., Antohi, I., Medvichi, R., Danielescu, C.: Comparative analysis of the efficacy and safety of latanoprost, travoprost and the fixed combination timololdorzolamide; a prospective, randomized, masked, cross-over design study. Oftalmologia 49(3), 39–45 (2005) 5. Crangle, C.E., Cherry, J.M., Hong, E.L., Zbyslaw, A.: Mining experimental evidence of molecular function claims from the literature. Bioinformatics 23, 3232–3240 (2007) 6. Copas, J.B., Eguchi, S.: Local model uncertainty and incomplete-data bias. J. R. Statist. Soc. B 67(4), 459–513 (2005)
288
J. Ma et al.
7. Cantor, L.B., Hoop, J., Morgan, L., Wudunn, D., Catoira, Y.: Bimatoprost-Travoprost Study Group, Intraocular pressure-lowering efficacy of bimatoprost 0.03% and travoprost 0.004$ in patients with glaucoma or ocular hypertension. Br. J. Ophthalmol. 90(11), 1370–1373 (2006) 8. Cowie, J., Lehnert, W.: Information extraction. Communications of ACM 39, 81–91 (1996) 9. Charbonnel, B.H., Matthews, D.R., Schernthaner, G., Hanefeld, M., Brunetti, P.: for the QUARTET Study Group. A long-term comparison of pioglitazone and gliclazide in patients with Type 2 diabetes mellitus: a randomized, double-blind, parallel-group comparison trial. Diabetic Medicine 22, 399–405 (2004) 10. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: Gate: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, ACL 2002 (2002) 11. Combi, C., Oliboni, B., Rossato, R.: Merging multimedia presentations and semistructured temporal data: a graph-based model and its application to clinical information. Artificial Intelligence in Medicine (2005) 12. Cavallo, R., Pittarelli, M.: The theory of probabilistic databases. In: Proc. of VLBD 1987, pp. 71–81 (1987) 13. Clegg, A., Shepherd, A.: Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinformatics 8, 24 (2007) 14. Ernest, C.S., Worcester, M.U., Tatoulis, J., Elliott, P.C., Murphy, B.M., Higgins, R.O., LeGrande, M.R., Goble, A.J.: Neurocognitive outcomes in off-pump versus onpump bypass surgery: a randomized controlled trial. Ann. Thorac. Surg. 81(6), 2105–2114 (2006) 15. Gracia-Feijo, J., Martinez-de-la-Casa, J.M., Castillo, A., Mendez, C., Fernandez-Vidal, A., Garcia-Sanchez, J.: Circadian IOP-lowering efficacy of travoprost 0.004$ ophthalmic solution compared to latanoprost 0.005%. Curr. Med. Res. Opin. 22(9), 1689–1697 (2006) 16. Greenhalgh, T.: How to Read a Paper: The Basics of Evidence-Based Medicine. BMJ Press (1997) 17. Hunter, A., Liu, W.: Fusion rules for merging uncertain information. Information Fusion 7, 97–114 (2006) 18. Hunter, A., Liu, W.: Merging uncertain information with semantic heterogeneity in XML. Knowledge and Information Systems 9(2), 230–258 (2006) 19. Hunter, A., Liu, W.: A logical reasoning framework for modelling and merging uncertain semi-structured information. In: Bouchon-Meunier, B., Coletti, G., Yager, R.R. (eds.) Modern Information Processing: From Theory to Applications, pp. 345–356. Elsevier, Amsterdam (2006) 20. Hunter, L., Lu, Z., Firby, J., Baumgartner Jr., W.A., Johnson, H.L., Ogren, P.V., Cohen, K.B.: An open-source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-specific gene expression. BMC Bioinformatics 31 9(1), 78 (2008) 21. Howard, S., Silvia, O.N., Brian, E., John, S., Sushanta, M., Theresa, A., Michael, V.: The Safety and Efficacy of Travoprost 0.004%/Timolol 0.5% Fixed Combination Ophthalmic Solution. Ame. J. Ophthalmology 140(1), 1–8 (2005) 22. Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Critical assessment of information extraction for biology. BMC Bioinformatics 6(suppl. 1), S11 (2005) 23. van Keulen, M., de Keijzer, A., Alink, W.: A probabilistic XML approach to data integration. In: Proceedings of ICDE 2005, pp. 459–470 (2005)
An XML Based Framework for Merging Incomplete
289
24. Lu, G., Copas, J.B.: Missing at Random, Likelihood Ignorability and Model Completeness. The Annals of Statistics 32(2), 754–765 (2004) 25. Lee, J.D., Lee, S.J., Tsushima, W.T., Yamauchi, H., Lau, W.T., Popper, J., Stein, A., Johnson, D., Lee, D., Petrovitch, H., Dang, C.R.: Benefits of off-pump bypass on neurologic and clinical morbidity: a prospective randomized trial. Ann. Thorac. Surg. 76(1), 18–25 (2003) 26. Lund, C., Sundet, K., Tennoe, B., Hol, P.K., Rein, K.A., Fosse, E., Russell, D.: Cerebralischemic injury and cognitive impairment after off-pump and on-pump coronary artery bypass grafting surgery. Ann. Thorac. Surg. 80, 2126–2131 (2005) 27. Lawrence, J., Reid, J., Taylor, G., Stirling, C., Reckless, J.: Favorable Effects of Pioglitazone and Metformin Compared With Gliclazide on Lipoprotein Subfractions in Overweight Patients With Early Type 2 Diabetes. Diabetes care 27(1), 41–46 (2004) 28. Ma, J., Liu, W., Hunter, A., Zhang, W.: Performing meta-analysis with incomplete statistical information in clinical trials. BMC Informatics 8(1), 56 (2008) 29. Matthews, D.R., Charbonnel, B.H., Hanefeld, M., Brunetti, P., Schernthaner, G.: Longterm therapy with addition of pioglitazone to metformin compared with the addition of gliclazide to metformin in patients with type 2 diabetes: a randomized, comparative study. Diabetes Metab. Res. Rev. 21, 167–174 (2005) 30. Michael, T., David, W., Alan, L.: Projected impact of travoprost versus timolol and latanoprost on visual field deficit progression and costs among black glaucoma subjects. Trans. Am. Ophthalmol. Soc. 100, 109–118 (2002) 31. Marasco, S.F., Sharwood, L.N., Abramson, M.J.: No improvement in neurocognitive outcomes after off-pump versus on-pump coronary revascularisation: a meta-analysis. European Journal of Cardio-thoracic Surgery 33, 961–970 (2008) 32. Noecker, R.J., Earl, M.L., Mundorf, T.K., Silvestein, S.M., Phillips, M.: Comparing bimatoprost and travoprost in black Americans. Curr. Med. Res. Opin. 22(11), 2175–2180 (2006) 33. Nierman, A., Jagadish, H.: ProTDB: Probabilistic data in XML. In: Proc. of VLDB 2002. LNCS, vol. 2590, pp. 646–657. Springer, Heidelberg (2002) 34. Nicola, C., Michele, V., Tiziana, T., Francesco, C., Carlo, S.: Effects of Travoprost Eye Drops on Intraocular Pressure and Pulsatile Ocular Blood Flow: A 180-Day, Randomized, Double-Masked Comparison with Latanoprost Eye Drops in Patients with OpenAngle Glaucoma. Curr. Ther. Res. 64(7), 389–400 (2003) 35. Pf¨uzner, A., Marx, N., L¨uben, G., Langenfeld, M., Walcher, D., Konrad, T., Forst, T.: Improvement of Cardiovascular Risk Markers by Pioglitazone Is Independent From Glycemic Control Results From the Pioneer Study. Journal of the American College of Cardiology 45(12), 1925–1931 (2005) 36. http://protege.stanford.edu/ 37. Parmarksiz, S., Yuksel, N., Karabas, V.L., Ozkan, B., Demirci, G., Caglar, Y.: A comparison of travoprost, latanoprost and the fixed combination of dorzolamide and timolol in patients with pseudoexfoliation glaucoma. Eur. J. Ophthalmol. 16(1), 73–80 (2006) 38. Qi, G., Hunter, A.: Measuring incoherence in description logic-based ontologies. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudr´e-Mauroux, P. (eds.) ISWC 2007. LNCS, vol. 4825, pp. 381–394. Springer, Heidelberg (2007) 39. Radev, D., Fan, W., Qi, H., Wu, H., Grewal, A.: Probabilistic question answering on the Web. In: Proc. of WWW 2002, pp. 408–419 (2002) 40. Stefan, C., Nenciu, A., Malcea, C., Tebeanu, E.: Axial length of the ocular globe and hypotensive effect in glaucoma therapy with prostaglandin analogs. Oftalmologia 49(4), 47–50 (2005)
290
J. Ma et al.
41. Tan, M.H., Johns, D., Strand, J., Halse, J., Madsbad, S., Eriksson, J.W., Clausen, J., Konkoy, C.S., Herz, M., For the GLAC Study Group.: Sustained effects of pioglitazone vs. glibenclamide on insulin sensitivity, glycaemic control, and lipid profiles in patients with Type 2 diabetes. Diabetic Medicine 21, 859–866 (2004) 42. Wang, Y., Liu, W., Bell, D.A.: Combining uncertain outputs from multiple ontology matchers. In: Prade, H., Subrahmanian, V.S. (eds.) SUM 2007. LNCS (LNAI), vol. 4772, pp. 201–214. Springer, Heidelberg (2007) 43. van Dijk, D., Jansen, E.W.L., Hijman, R., Nierich, A.P., Diephuis, J.C., Moons, K.G.M., Lahpor, J.R., Borst, C., Keizer, A.M.A., Grobbee, D.E., de Jaegere, P.P., Kalkman, C.J.: Cognitive outcome after off-pump and on-pump coronary artery bypass graft surgery: a randomized trial. JAMA 287, 1405–1412 (2002) 44. White, I.: Missing data and departures from randomised treatment in pragmatic trials, http://www.mrc-bsu.cam.ac.uk/BSUsite/Research/ Section11.shtml 45. Zupan, B., Demsar, J., Katten, M., Ohori, M., Graefen, M., Bojanec, M., Beck, R.: Orange and decisions-at-hand: bridging predictive data mining and decision support. In: Proc. of ECML/PKDD 2001 workshop on Integrating Aspects of Data Mining Decision Support and Meta-Learning, September 2001, pp. 151–162 (2001) 46. http://en.wikipedia.org/wiki/Sampling_distribution
Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML Raquel D. Rodrigues, Adriano J. de O. Cruz, and Rafael T. Cavalcanti
Abstract. This chapter presents and discusses the main characteristics of the new Fuzzy Database Alianc¸a (Alliance)1 . This name represents the fact that the system is the union of fuzzy logic techniques, a database relational management system and a fuzzy meta-knowledge base defined in XML. Alianc¸a accepts a wide range of data types including all information already treated by traditional databases, as well as incorporating different forms of representing fuzzy data. Despite this fact, the system is simple due to the fact that it uses XML to represent meta-knowledge. An additional advantage of using XML is that makes it easy to maintain and understand the structure of imprecise information. Alianc¸a was designed to allow easy upgrading of traditional database systems. The Fuzzy Database Architecture Alianc¸a approximates the interaction with databases to the usual way in which humans reason.
1 Introduction Human beings are immersed in a sea of information. Our senses are continuously absorbing and processing external data. The most part of this information is intrinsically vague or imprecise. We are able to process such imprecise data and based on them take actions that guide our daily activities and interactions with other Raquel D. Rodrigues Universidade Federal do Rio de Janeiro, IM and NCE, CCMN, Cx Postal: 2324, CEP: 20010-974 Cidade Universit´aria, Rio de Janeiro, Brazil e-mail:
[email protected] Adriano J. de O. Cruz Universidade Federal do Rio de Janeiro, IM and NCE e-mail:
[email protected] Rafael T. Cavalcanti Universidade Federal do Rio de Janeiro, IM and NCE e-mail:
[email protected] 1
A shorter version of this chapter appeared in Fuzzy Sets and Systems 160 (2009) 269-279.
Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 291–313. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
292
R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti
human beings. Most pieces of information that the brain receives through our senses, such as images, sounds and tastes are of imprecise nature, and despite that fact we reason, plan, solve complex problems, think abstractly, communicate, learn from experiences and so on. In short, we hold what can be understood as an important aspect of intelligence, that is evidently more than a large capacity of memorization or solving precisely complex mathematical operations. Everything in the Universe is in constant flow, and the borders between one state and the other are fluid and vary continuously. However, science, and particularly computing science, is based on a logic system that deals only with two states: true or false [12]. There is no place for degrees of truth and imprecisions. Computers process strings of ones and zeroes and are entirely based on a logic system in which every statement is true or false. At the beginning of the twentieth century vagueness and imprecision slowly started to emerge as important points to be considered in a wide range of scientific problems. The Heisenberg uncertainty principle that states that certain pairs of physical properties, such as position and momentum, cannot be known to arbitrary precision was a shock about the limits of physical knowledge. Logician Bertrand Russel identified vagueness at the level of symbolic Logic. The concept of fuzziness comes from the multivalued logic studied at the beginning of the twentieth century [13]. In the 1920s the Polish mathematician Jan Lukasiewicz developed fundamental concepts of a multivalued logic. Finally, in 1965, Lotfi Zadeh, of the University of California at Berkeley, published the founding article “Fuzzy Sets”, where for the first time the word “fuzzy” was used in place of “vague”. Fuzzy logic was successfully applied to control and decision making systems, and many examples are available in the literature. Industries worldwide are embedding fuzzy logic in all sorts of products and services. For example, fuzzy logic has been used in the control of cement manufacture, water purification processes and management information systems. One of its most famous applications, is the fuzzy control of the subway system in the Japanese city of Sendai, which opened in 1987 and was developed by Hitachi [21]. Consumer goods include television sets that adjust volume and contrast depending on noise level and lighting conditions; fuzzy washing machines that select the optimal washing cycle on the basis of quantity and type of dirt and load size. Photo and video cameras use fuzzy logic to map image data to lens settings. Most car manufactures use fuzzy logic in some of their components like anti-skid braking systems and fuel injection. In Japan, the term ”fuzzy” was presented as a synonymous to “efficient operation requiring minimal human intervention.” Since fuzzy logic expands the domain of information that computers are able to process, it is only natural that researchers seek ways to incorporate it to management systems and databases. The original database systems were designed for efficient treatment of large quantities of precisely defined data. These databases try to model the data from real world data using precise structures, but unfortunately we are surrounded by uncertainty and imprecise information. Human beings manipulate imprecise information very well and, in fact, we act based on these pieces of information. We decide what to buy based on information such as “The interest
Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML
293
rate is too high” or use the car brakes because “The car is going too fast”. These facts may be true (or false) only partially and computers are not prepared to process them. Fuzzy logic extends the computer processing domain providing it with tools to operate based on facts that are not right or wrong, but lay in the gray area. Some of the most important and common operations in databases are queries for information. For example, one might ask, “Which male suspects are between 55 and 65 years old?”. In many occasions it would be more important for the solution of a problem, or more reasonable given the available information, if we could ask the same question as a human would do: “Which male suspects are about 60 years old?”. The idea of “about” hints that if the suspect is 54 years old then he must be included in the list of persons to be considered. The label “about” in a traditional database would not even be considered because only precise numerical data is stored on the attribute age. The conclusion is that vital information was lost due to the lack of tools to treat fuzzy information. Imprecise information is important in many contexts and provides solutions that would not be obtained if only precise data were considered. Considering the importance of treating this kind of information in an efficient way, fuzzy databases were developed. This new type of database can store, handle and respond to queries about vague and imprecise information in a very flexible way. However, there are not many examples of large fuzzy databases in production. Some of the reasons for that can be traced back to the costs of replacing or modifying costly legacy systems and the lack of efficient ways to incorporate fuzzy knowledge into traditional relational databases. This chapter presents a fuzzy database architecture called Alianc¸a (Alliance). This name represents the fact that the system is the union of fuzzy logic techniques, a database relational management system and a fuzzy meta-knowledge base (FMB) defined in XML, brought together in order to handle and represent vague information [17]. Alianc¸a was designed to easily incorporate fuzzy knowledge and also allow easy upgrading of traditional database systems. Alianc¸a architecture can be used by old and new database systems expanding their applications and the domain of the processed information.
2 Architecture Overview In this section, we present the architecture of Alianc¸a and discuss its basic elements and their relationships. The aim of this proposal is to provide a system that stores and handles imprecise information efficiently, and that can provide a simple path to upgrade traditional databases. The system uses a traditional database system to store information and a modified SQL language. The main goal of Alianc¸a is to provide an efficient way to create and modify traditional databases so that they can incorporate fuzzy information. Previous proposals, like the GEFRED model, for example, created by [14] and improved by [7], which was a fusion of all previous proposals to represent fuzzy information,
294
R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti
presented problems with the way such information was described and incorporated into the database. Fuzzy databases need to incorporate extra information in order to process fuzzy information. For instance, when storing the information that someone is “young” the semantics of this label must be described so that it can be compared with other fuzzy attributes and also with usual numeric information relative to age. In Alianc¸a we call this aggregate of information used by the database system to handle fuzzy data Fuzzy Meta-Knowledge Base (FMB). Previous proposals kept the FMB as an extension of the system catalogue. Therefore, the FMB is organized as sets of tables or relations in the same way as precise information is. These tables are of a complicated nature and their creation and maintenance requires a considerable effort from the database administrators and users. When the user issues a query that needs to retrieve fuzzy information from a database, the Database Management System needs to access these extra tables, possibly reducing the performance. As it will be described in section 4.1, in Alianc¸a these extra tables are not necessary because the information is stored in XML files that are easier to maintain and support. Alianc¸a does not require the addition of extra tables but the addition of extra columns in the original tables. Figure 1 shows the general architecture of the Fuzzy Database architecture Alianc¸a. The main modules of the system are: • RDBMS (Relational Database Management System): This is a traditional relational database manager. Therefore, all fuzzy operations or the ones that involve fuzzy data must be translated by the FSQL Server module into classical SQL operations before being sent to the RDBMS. Differently from previous proposals, for example, FIRST (in GEFRED) [7], in Alianc¸a the RDBMS does not have any direct relationship with the fuzzy meta-knowledge base, unaware of its existence. From the RDBMS point of view the only changes to the database were the
Fig. 1 General Fuzzy Database Architecture Alianc¸a
Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML
•
•
•
•
295
extra columns that were added to the tables that store fuzzy data. Only the FSQL Server module knows how to put together the pieces that are needed to process a fuzzy query. DB (Database): The database, like all traditional relational databases, is a collection of tables and the relationships among those data. However, Alianc¸a expands the capabilities of traditional databases allowing the storage of fuzzy information in its tables. This new kind of information is stored using a set of hidden attributes that together with the FMB, define all relevant characteristics of the fuzzy data. It is important to observe that these hidden attributes are stored in the same tables and side by side with the traditional data formats, and they are transparent to the end user. Therefore, no extra tables are required, simplifying the incorporation of fuzzy information. Due to this fact, originates one of the main advantages of Alianc¸a its easiness of upgrading from a traditional database. FMB (Fuzzy Meta-Knowledge Base): The necessary information to define and describe data of fuzzy nature is stored in the FMB. This information is organized in XML format and only the FSQL Server can access it. The FMB does not store data, but information about the structure of data stored in the tables of the database. For instance, the system must retrieve from the XML text files information such as the labels of fuzzy attributes the parameters that define their semantics. There is also a special kind of data which is stored in the FMB that is the degree of similarity between concepts. FSQL Server: It is the main part of the system, because it deals with the relationship between the FMB and the RDBMS. One of its objectives is to transform FSQL information in traditional SQL information in order to allow the database management to process it and return an answer. This server was developed in Java. The FSQL receives queries, identifies the fuzzy parts and searches the FMB for the meta information describing them. Based on the results of this search the FSQL server is able to construct a classical SQL query that is submitted to the RDBMS. User’s Interface: The User’s Interface is a program that allows communication between the users and the FSQL Server. Users can submit queries and receive answers through the interface that also checks syntax errors.
It is important to note that the proposed architecture is not restricted to the relational model and can easily be extended to an object-oriented database. This is due mainly to the fact that the Fuzzy Meta-Knowledge Base is not stored in tables but on a text based XML document that is on an external level to the relational database. Comparing the two ways used to implement databases, relational and objectoriented, we observe that while the relational model is based on tuples, the OO model deals directly with objects and their persistency. Therefore, it would be simple to allow objects to receive a similar treatment to the one presented here. In addition to the fuzzy attribute of the object it would be necessary to add two other attributes in a very similar way as it will be discussed for relational databases in section 4.1.
296
R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti
Besides that, the structure of the Object Query Language (OQL) is similar to the structure of SQL and, therefore, the transformation of a fuzzy query into an object query would follow the same methodology presented in section 6.
3 Representation of Vague Knowledge in Alianc¸a In this section we discuss the different types of information that can be stored in the database and its forms of representation. Alianc¸a accepts 8 different data types, each one receiving a numeric label from 0 to 7. This is a very rich set and includes all information already stored in traditional databases, as well as incorporating different forms of representing fuzzy data. • Crisp Data: Crisp data is the usual precise data handled by the traditional databases and from the RDBMS point of view it receives the same treatment as the imprecise data. This kind of data is classified as type 0 and it does not need any additional information added to the Fuzzy Meta-knowledge Base (FMB). Strings, real and natural numbers and dates are examples of usual crisp data formats. Alianc¸a acts as a traditional database when processing information of this type. • Unknown (but applicable): An attribute gets the value Unknown when it may receive any value from its domain, but it is impossible to define exactly what its value is. This kind of data is classified as type 1. The type Unknown is represented by using the possibility distribution, { 1u , ∀u ∈ U}, where U is the domain. Generally speaking a possibility distribution D can be defined through enumeration using the expression D = ∑ μD (xi )/xi
(1)
i
where the summation and addition stand for the union of (xi , μD (xi )) pairs and “/” is only a mark. Figure 2 shows this distribution. • Undefined (not applicable): An attribute gets the value Undefined when none of the values from the domain is applicable. This kind of data is classified as μ(x)
Unknow
1
0
x
Fig. 2 Possibility distribution for a type Unknown
Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML
• •
•
•
•
297
of type 2. The type Undefined is represented using the possibility distribution { 0u , ∀u ∈ U}, where U is the domain. Figure 3 shows this distribution. Null (absolute ignorance): An attribute gets the value Null when no information about it is available, either when we do not know it (Unknown) or when it is not applicable (Undefined). This kind of data is of type 3. Linguistic Label with a Possibility Distribution: When an attribute is associated to a vague value it receives a linguistic label with a possibility distribution. This kind of data is of type 4 and it has an associated trapezoidal possibility distribution whose definition is stored in the Fuzzy Meta-Knowledge Base. Figure 4 shows an example of a linguistic label. Trapezes are frequently used in fuzzy systems to represent vague values. Other functions like triangles and Gaussian may be used, however in Alianc¸a it was decided to use trapezes that represent satisfactorily the semantics of the vague concepts and are simple to manipulate. Possibility Interval [m, n]: This kind of data is of type 5 and it is associated to an interval possibility distribution. It is used to represent the fact that the only information about some piece of information is that it lies within an interval with equal possibility. Figure 5 shows an example of a possibility interval. This kind of data also needs additional data stored in the FMB. Approximate Value (approximately d): If the value d is in the domain, the vague concept approximately d is defined by a triangular possibility distribution defined around d with a margin a, as shown in Figure 6. The margin indicates the degree of certainty available about the value of the attribute. This is the type 6 kind of data and it also needs additional data stored in the FMB to define the margin used. Linguistic Label with Similarity: This kind of data is defined on a non ordered domain. In this domain a relationship of similarity is defined between the linguistic labels. The relation is represented by a table showing the strength of the relations between all pairs of values belonging to the domain. This is the type 7 kind of data and it needs additional data stored in the FMB. Table 3 shows an example of a similarity relationship .
μ(x)
Undefined
1
0
x
Fig. 3 Possibility distribution for a type Undefined
298
R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti
4 Fuzzy Meta-knowledge Base As we have seen in section 3, some data types need additional information in order to be correctly manipulated. The fuzzy meta-knowledge base (FMB) contains, in an efficient and organized way, the necessary additional information required by the system. Differently from what was proposed in FIRST (in GEFRED, by [14] and [7]), the Alianc¸a database does not store its fuzzy meta-knowledge using tables and relations within the database. The FMB in Alianc¸a is described in XML format. This format makes the process easier for understanding and maintenance. The information stored for each of the types previously presented is shown in Table 1. As can be seen from Table 1, data types 0, 1, 2 and 3 do not store any additional information in the FMB. Table 1 Information stored in the FMB Type of Data Type Information Stored Crisp Data 0 None Unknown 1 None Undefined 2 None Null 3 None Linguistic Label 4 Labels and their defining with Possibility Distribution characteristics Possibility Interval 5 minimum and maximum Approximate Value 6 Margin Linguistic Label 7 Pairs (a, b), Degree of with Similarity similarity between a and b
μ(Size)
Small
1
0
a=20 b=30 c=40
d=60 Size(m2 )
Fig. 4 An example of linguistic label for the concept “Small”
4.1 Structure of the FMB The fuzzy meta-knowledge base is where all additional information necessary to handle the database transactions is stored. Alianc¸a defines for the FMB a directory
Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML
299
μ(Size) 1
0 m=10
n=30
Size(m2 )
Fig. 5 An example of possibility interval
structure, where the root directory is named after the database. Each database table contains, in the root directory, a subdirectory for each table. This subdirectory contains one XML file for each attribute. We will describe, through an example, the internal structure of the fuzzy fields and the new way in which data related to the fuzzy attributes are stored in the FMB of Alianc¸a. The example is a database called Real Estate that stores a relation containing information about apartments for sale in Copacabana, a beach in Rio de Janeiro - Brazil. Figure 7 shows the FMB directory structure, while Table 2 lists the apartments used as examples. Table 2 List of Apartments Id Bedrooms Price Size Conservation 01 1 95000,00 33 Bad 02 2 600000,00 #140a Unknown 03 4 650000,00 Large Regular 04 1 145000,00 Small Regular 05 2 270000,00 78 Bad 06 4 800000,00 [130, 150]b Bad 07 2 480000,00 Large Excelent 08 3 360000,00 Unknown Good a The symbol # means “approximately” b [m, n] is an interval and means between m and n
The description of the attributes of Table 2 is shown below, according to the classification criteria used in Alianc¸a for the different fuzzy types. • Id: It is an integer serial numeric field automatically completed by the SGDB and it is the table primary key. • Bedrooms: The number of bedrooms of each apartment. It is a numeric field of crisp type. • Price: It is the price of each apartment. It is an attribute amenable to fuzzy treatment, so it can be filled with values of the type presented in Table 1. This attribute needs extra information stored in the BMN. The value of the margin (type 6) was
300
R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti μ(Size) 1
0
10
20
Size(m2 )
30
margin=10
Fig. 6 An example of approximate value
Fig. 7 FMB structure
defined as 5000. The definitions of the linguistic labels that use possibility distributions (type 4) are shown in Figure 8. Therefore, it is possible to store data according to the information available and, depending on the necessity or the possibility, users may reach conclusions based on this data. For instance, it is possible to store precise prices, should they be available, imprecise prices, like the apartment of average price or that the price is approximately R$ 500000.00. When considering this kind of attribute the ability to deal with fuzzy information may be very important during the negotiation process. • Size: Stores the size of each apartment. It is also an attribute that can store fuzzy data, therefore all types defined in Table 1 can be used. The value of the margin, for this attribute, was defined as 5. The linguistic labels are presented in Figure 9. μ(Price) Low
Average
High
1
0
100 150
250 300
400
550 600
700 750
850
1000
Price(R$ 1000.00 )
Fig. 8 Definition of the labels for the attribute Price
Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML
301
μ(Size) Small
Medium
Large
1
0
20 30 40 50 60 70 80 90
120
150 Size(m2 )
Fig. 9 Definition of the labels for the attribute Size
• Conservation: Stores the conservation of each apartment. It is another attribute that can be treated as a fuzzy quantity. Differently from the previous attributes, this one is defined by a similarity relation, which is of type 7 value and this is the only type that can be considered. The linguistic labels and the similarities between all possible pairs of labels, for the attribute conservation, are presented in Table 3. Note that this similarity relation must be symmetric and reflexive. Table 3 Similarity relation sr defined over the attribute Conservation sr (d, d ) Bad Regular Good Excelent Bad 1 0.8 0.5 0.1 Regular 0.8 1 0.7 0.5 Good 0.5 0.7 1 0.8 Excelent 0.1 0.5 0.8 1
The attributes Price and Size are defined over as ordered domain, therefore they can be filled with values of type 0, 1, 2, 3, 4, 5 and 6 from Table 1. In order to discuss the contents of the XML files we will start considering the contents of the file Size.xml that is shown in Example 1. Example 1 (File Size.xml).