Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2457
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Tatyana Yakhno (Ed.)
Advances in Information Systems Second International Conference, ADVIS 2002 Izmir, Turkey, October 23-25, 2002 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editor Tatyana Yakhno Dokuz Eylul University Computer Engineering Department Bornova, 35100 Izmir, Turkey E-mail:
[email protected]
Cataloging-in-Publication Data applied for Bibliograhpic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at http://dnb.ddb.de
CR Subject Classification (1998): H.2, H.3, H.4, I.2, C.2, H.5 ISSN 0302-9743 ISBN 3-540-00009-7 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Stefan Sossna e.K. Printed on acid-free paper SPIN: 10871178 06/3142 543210
Preface
This volume contains the proceedings of the Second International Conference on Advances in Information Systems (ADVIS) held in Izmir, Turkey, 23–25 October 2002. This conference was dedicated to the memory of Prof. Esen Ozkarahan. He was a great researcher who made an essential contribution to the development of information systems. Prof. Ozkarahan was one of the pioneers of database machine research and database systems in Turkey. This conference was organized by the Computer Engineering department of Dokuz Eylul University in Izmir. This department was established in 1994 by Prof. Ozkarahan and he worked there for the last five years of his life. The main goal of the conference was to bring together researchers from all around the world working in different areas of information systems, to share new ideas and present their latest results. This time we received 94 submissions from 27 countries. The program committee selected 40 papers for presentation at the conference. During the conference a workshop was organized on the topic “New Information Technologies in Education”. The invited and accepted contributions cover a large variety of topics: general aspects of information systems, databases and data warehouses, information retrieval, multiagent systems and technologies, distributed and parallel computing, evolutionary algorithms and system programming, and new information technologies in education. The success of the conference was dependent upon the hard work of a large number of people. We gratefully acknowledge the members of the Program Committee who helped to coordinate the process of refereeing all submitted papers. We also thank all the other specialists who reviewed the papers. We would like to express our grateful acknowledgement to the Rector of Dokuz Eylul University, Prof. Dr. Emin Alici, for his support and understanding. August 2002
Tatyana Yakhno
Organization Honorary Chair:
Irem Ozkarahan Dokuz Eylul University, Turkey Program Committee Chair: Tatyana Yakhno Dokuz Eylul University, Turkey Program Committee: Sibel Adali, USA Adil Alpkocak, Turkey Frederic Benhamou, France Cem Bozsahin, Turkey Ufuk Caglayan, Turkey Fazli Can, USA Yalcin Cebi, Turkey Paolo Ciaccia, Italy Cihan Dagli, USA Mehmet Dalkilic, Turkey Dursun Delen, USA Oguz Dikenelli, Turkey Asuman Dogac, Turkey Dan Grigoras, Ireland Yakov Fet, Russia Lloyd Fosdick, USA Fabio Grandi, Italy Ugur Gudukbay, Turkey
Malcolm Heywood, Canada Alp Kut, Turkey Victor Malyshkin, Russia Eric Monfroy, France Mario A. Nascimento, Canada Erich Neuhold, Germany Selmin Nurcan, France Gultekin Ozsoyoglu, USA Meral Ozsoyoglu, USA Tamer Ozsu, Canada Marcin Paprzycki, USA Malcolm Rigg, Luxembourg Vadim Stefanuk, Russia Martti Tienari, Finland Turhan Tunali, Turkey Ozgur Ulusoy, Turkey Nur Zincir-Heywood, Canada Adnan Yazici, Turkey
Additional Referees Mustafa Adacal Emrah Akdag Mehmet Aydogdu Aysenur Birturk Lucas Bordeaux Souveyet Carine Sylvie Cazalens Martine Ceberio Maria Cobb Gokhan Dalkilic Cenk Erdur Vladimir Evstigneev Minor Gordon Fabrice Guillet Auvo Hakkinen Heikki Helin Christine Jacquin
Timo Karvi R´egis Kla Thomas Klement Ahmet Koltuksuz Selcuk Kopru Erkan Korkmaz Bora I. Kumova N. Gokhan Kurt Lea Kutvonen Patrick Lehti Martin Leissler Vladimir Malukh Mrinal Mandal Federica Mandreoli Alexander Marchuk Feodor Murzin Caglar Okat
C ¸ agdas Ozgen¸c Brice Pajot Aldo Paradiso Marco Patella Mihaela Quirk Davood Rafiei Frank Ramsak Guillaume Raschia Wolfgang Sch¨ onfeld Onur T. Sehitoglu Pinar Senkul Nikolay Shilov Arnd Steinmetz Johnson Thomas Eleni Tousidou Meltem Turhan Y¨ ondem Yuri Zagorulko
VIII
Organization
Conference Secretary Emine Ekin, Turkey
Local Organizing Committee Taner Danisman Serife Sungun
Deniz Turan Cagdas Ozgenc
Derya Birant K. Ulas Birant
Sponsors Support from the following institutions is gratefully acknowledged: • Foundation of Dokuz Eylul University (DEVAK) • The Scientific and Technical Research Council of Turkey (TUBITAK)
Table of Contents
Databases and Data Warehouses Preserving Aggregation in an Object-Relational DBMS . . . . . . . . . . . . . . . . . Johanna Wenny Rahayu, David Taniar
1
Database Compression Using an Offline Dictionary Method . . . . . . . . . . . . . 11 Abu Sayed Md. Latiful Hoque, Douglas McGregor, John Wilson Representation of Temporal Unawareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Panagiotis Chountas, Ilias Petrounias, Krassimir Atanassov, Vassilis Kodogiannis, Elia El-Darzi Scalable and Dynamic Grouping of Continual Queries . . . . . . . . . . . . . . . . . . 31 Sharifullah Khan, Peter L. Mott Uncertainty in Spatiotemporal Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Erlend Tøssebro, Mads Nyg˚ ard Integrity Constraint Enforcement by Means of Trigger Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Eladio Dom´ınguez, Jorge Lloret, Mar´ıa Antonia Zapata Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Ole Guttorm Jensen, Michael B¨ ohlen Magic Sets Method with Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Karel Jeˇzek, Martin Z´ıma
Information Retrieval Information Retrieval Effectiveness of Turkish Search Engines . . . . . . . . . . . 93 Yıltan Bitirim, Ya¸sar Tonta, Hayri Sever Comparing Linear Discriminant Analysis and Support Vector Machines . . . 104 Ibrahim Gokcen, Jing Peng Cross-Language Information Retrieval Using Multiple Resources and Combinations for Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Fatiha Sadat, Masatoshi Yoshikawa, Shunsuke Uemura Extracting Shape Features in JPEG-2000 Compressed Images . . . . . . . . . . . 123 Jianmin Jiang, Baofeng Guo, Pengjie Li
X
Table of Contents
Comparison of Normalization Techniques for Metasearch . . . . . . . . . . . . . . . . 133 Hayri Sever, Mehmet R. Tolun On the Cryptographic Patterns and Frequencies in Turkish Language . . . . . 144 Mehmet Emin Dalkılı¸c, G¨ okhan Dalkılı¸c Automatic Stemming for Indexing of an Agglutinative Language . . . . . . . . . 154 Sehyeong Cho, Seung-Soo Han Pattern Acquisition for Chinese Named Entity Recognition: A Supervised Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Xiaoshan Fang, Huanye Sheng
Information Systems The Information System for Creating and Maintaining the Electronic Archive of Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Alexander Marchuk, Andrey Nemov, Konstantin Fedorov, Sergey Antyoufeev KiMPA: A Kinematics-Based Method for Polygon Approximation . . . . . . . . 186 ¨ ur Ulusoy Ediz S ¸ aykol, G¨ urcan G¨ ule¸sir, U˘gur G¨ ud¨ ukbay, Ozg¨ The Implementation of a Robotic Replanning Framework . . . . . . . . . . . . . . . 195 S ¸ ule Yıldırım, Turhan Tunalı A 300 MB Turkish Corpus and Word Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 205 G¨ okhan Dalkılı¸c, Yal¸cın C ¸ ebi Adaptation of a Neighbor Selection Markov Chain for Prefetching Tiled Web GIS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Dong Ho Lee, Jung Sup Kim, Soo Duk Kim, Ki Chang Kim, Yoo-Sung Kim, Jaehyun Park Web Based Automation Software for Calculating Production Costs in Apparel Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Ender Yazgan Bulgun, Alp Kut, G¨ ung¨ or Baser, Mustafa Kasap
Multi-agent Technologies and Systems Knowledge Representation in the Agent-Based Travel Support System . . . . 232 Marcin Paprzycki, Austin Gilbert, Minor Gordon Intelligent Agents in Virtual Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Philippe Codognet Characterizing Web Service Substitutivity with Combined Deductive and Inductive Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 Marcelo A.T. Arag˜ ao, Alvaro A.A. Fernandes
Table of Contents
XI
Modular-Fuzzy Cooperation Algorithm for Multi-agent Systems . . . . . . . . . . 255 ˙ Irfan G¨ ultekin, Ahmet Arslan Minimax Fuzzy Q-Learning in Cooperative Multi-agent Systems . . . . . . . . . 264 Alper Kilic, Ahmet Arslan A Component-Based, Reconfigurable Mobile Agent System for Context-Aware Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Hsin-Ta Chiao, Ming-Chun Cheng, Yue-Shan Chang, Shyan-MingYuan A FIPA-Compliant Agent Framework with an Extra Layer for Ontology Dependent Reusable Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Rıza Cenk Erdur, O˘guz Dikenelli
Evolutionary Algorithms Vibrational Genetic Algorithm (Vga) for Solving Continuous Covering Location Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 ¨ Murat Ermi¸s, F¨ usun Ulengin, Abdurrahman Hacıo˘glu Minimal Addition-Subtraction Chains Using Genetic Algorithms . . . . . . . . . 303 Nadia Nedjah, Luiza de Macedo Mourelle Preserving Diversity through Diploidy and Meiosis for Improved Genetic Algorithm Performance in Dynamic Environments . . . . . . . . . . . . . . 314 A. Sima Etaner-Uyar, A. Emre Harmanci Ant Systems: Another Alternative for Optimization Problems? . . . . . . . . . . 324 Tatyana Yakhno, Emine Ekin
System Programming Augmenting Object Persistency Paradigm for Faster Server Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Erdal Kemikli, Nadia Erdogan Power Conscious Disk Scheduling for Multimedia Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Jungwan Choi, Youjip Won, S.W. Nam Task Scheduling with Conflicting Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 Haluk Topcuoglu, Can Sevilmis A Fast Access Scheme to Meet Delay Requirement for Wireless Access Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 Sung-Ho Hwang, Ki-Jun Han
XII
Table of Contents
New Information Technologies in Education Dokuz Eylul University - Distance Education Utilities Model . . . . . . . . . . . . 366 Alp Kut, Sen C ¸ akir, Ender Yazgan Bulgun Problem-Based Learning as an Example of Active Learning and Student Engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Scott Grabinger, Joanna C. Dunlap Use of PBL Method in Teaching IT to Students from a Faculty of Education: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Sevin¸c G¨ ulse¸cen Interval Matrix Vector Calculator – The iMVC 1.0 . . . . . . . . . . . . . . . . . . . . . 386 Ayse Bulgak, Diliaver Eminov
Distributed and Parallel Data Processing Efficient Code Deployment for Heterogeneous Distributed Data Sources . . . 395 Deniz Demir, Haluk Topcuoglu, Ergun Kavukcu Efficient Parallel Modular Exponentiation Algorithm . . . . . . . . . . . . . . . . . . . 405 Nadia Nedjah, Luiza de Macedo Mourelle Interprocedural Transformations for Extracting Maximum Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Yu-Sug Chang, Hye-Jung Lee, Doo-Soon Park, Im-Yeong Lee On Methods’ Materialization in Object-Relational Data Warehouse . . . . . . 425 Bartosz Bebel, Zbyszko Kr´ olikowski, Robert Wrembel
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Preserving Aggregation in an Object-Relational DBMS 1
Johanna Wenny Rahayu and David Taniar
2
1
Department of Computer Science and Computer Engineering, La Trobe University, Australia
[email protected] 2 School of Business Systems, Monash University, PO Box 63B, Clayton, Vic 3800, Australia
[email protected]
Abstract. Aggregation is an important concept in database design where composite objects can be modelled during the design of database applications. Therefore, preserving the aggregation concept in database implementation is essential. In this paper, we propose models for implementation of aggregation in an Object-Relational Database Management System (ORDBMS) through the use of index clusters and nested tables. ORDBMS is a commercial Relational Database Management Systems (RDBMS), like Oracle, which support some object-oriented concepts. We will also show how queries can be performed on index clusters and nested tables.
1
Introduction
Aggregation is a composition (part-of) relationship, in which a composite object (“whole”) consists of other component objects (“parts”) (Rumbaugh et al, 1991). Aggregation concept is a powerful tool in database design, and consequently, preserving aggregation in database implementation is essential. Our previous work in this field was mainly in mapping object-oriented design, including mapping of aggregation of direct implementation in pure RDBMS (Rahayu, et al, 1996, 1999). The main reason was that the design must be able to be directly implemented in any commercial RDBMS without any necessary modification to the relational kernel. Since the last few years, there is a growing trend that many RDBMS vendors (such as Oracle, DB2, Informix, and others) to include some object-oriented concepts (e.g. object type, inheritance, collection types, composite objects) in their products. This is a way of acknowledging the importance of object-oriented paradigm in database applications without sacrificing the foundation of relational theory. This new era marks the birth of Object-Relational Database Management Systems (ORDBMS) (Stonebraker and Moore, 1996). The term ORDBMS as in this early stage is often interpreted differently by different people. However, in this paper, we refer ORDBMS as commercial RDBMS that provides some support of object-oriented concepts, such as Oracle 9i (Muller, 2002). It is the aim of this paper to propose models for preserving aggregation in ORDBMS, in particular we introduce the use of index clusters and nested tables in Oracle 9i. T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 1–10, 2002. © Springer-Verlag Berlin Heidelberg 2002
2
2
J.W. Rahayu and D. Taniar
Background: Mapping Aggregation to Relational DBMS – Our Previous Work
Object-relational transformation methodology is to transform an object-oriented conceptual model into relational tables. Our previous work outlined in this section is the transformation methodology of aggregation concept into relational tables (Rahayu, et al, 1996, 1999). The first step in transforming an aggregation structure to relational tables is to create a table for each class in the aggregation hierarchy. The Object Identifier (OID) of each class becomes the Primary Key (PK) of the respective table. In addition to these tables, an aggregate table that stores the whole-part relationship is also created. The relationship between the "whole" and the "parts" is intermediated by the aggregate table. Since the aggregate table PK is a composite attribute comprising the PK of the "whole" and the "part" tables, the PK of the "whole" and each of the "parts" relate to the PK of the aggregate. As an example for our illustration is a PC class which consists of several part classes: Harddisk, Monitor, Keyboard, and CPU. Figure 1 shows an example of Create Table statements in SQL to create the aggregation structure using a RDBMS. CREATE TABLE PC ( PCoid VARCHAR(10) NOT NULL, Primary Key (PCoid)); CREATE TABLE Aggregate ( PCoid VARCHAR(10) NOT NULL, PartOid VARCHAR(10) NOT NULL, PartType VARCHAR(20), Primary Key (PCoid, PartOid), Foreign Key (PCoid) Reference PC (PCoid) ON Delete Cascade ON Update Restrict); CREATE TABLE Harddisk ( HDOid VARCHAR(10) NOT NULL, Primary Key (HDOid)); CREATE TABLE Monitor ( Moid VARCHAR(10) NOT NULL, Primary Key (Moid)); CREATE TABLE Keyboard ( Koid VARCHAR(10) NOT NULL, Primary Key (Koid)); CREATE TABLE Cpu ( CPUoid VARCHAR(10) NOT NULL, Primary Key (CPUoid)); Fig. 1. Create Tables for an Aggregation
Preserving Aggregation in an Object-Relational DBMS
3
Notice from Figure 1 that apart from each table for each class, an Aggregate table is also created. The PartType attribute of table Aggregate represents the type of the part classes (i.e. Harddisk, Monitor, Keyboard, or CPU). In this example, the value of PartType attribute is the full string. It could have been encoded to, for example, one character instead of a full string, such as ‘H’ for ‘Harddisk’, ‘M’ for ‘Monitor’, and so on. The relationship between the Aggregate table and the part tables through PK-FK relationship is not possible in the implementation, although conceptually desirable. This is because each value of PK (except NULL) must match with one of the values in each of the PKs. This is simply because PK-FK must exist between one table (FK) in one side and another table (PK) in the other side. It is not possible to have an FK in one side and multiple PKs in the other side. The only solution to this problem is to remove the PK-FK constraints between all of the "part" tables and the Aggregate table. This means PartOid in the Aggregate table is not an FK, although the value of a PartOid must be EITHER one of the PKs of the part tables. In this case, there is no referential integrity between aggregate and part tables. Further restriction imposed by insertion new parts or deletion of the whole object must be handled through stored procedures. The referential integrity between the "whole" and the aggregate relationship (i.e. PC – Aggregate) depends on the type of the aggregation structure. If it is an existenceindependent, the deletion is a cascade. That means that a deletion of a PC will cause a deletion of matching records in the aggregate table. This, however, does not cause a deletion of any related parts. If the aggregation is an existence-dependent, the deletion is also a cascade. However, the cascade operation does not have any impact on the "parts". The update operation will adopt a restrict update, as OIDs are not meant to be updated. The cardinality of the relationship between the "whole" and the aggregate (e.g. PC and Aggregate) is one-to-many. However, for the "part" to the aggregate table relationship, the cardinality is determined by the type of aggregation structure. If the aggregation structure is exclusive (i.e. non-sharable), the cardinality is one-to-one, but if the type is non-exclusive (i.e. sharable), the cardinality if many-to-one (aggregatepart). An aggregation hierarchy where one ‘whole’ class consists of several ‘part’ classes can be repeated to several levels on composition or aggregation. In the case of aggregation hierarchy involving multiple levels, each level is treated as one level aggregation described above. In addition to this, a composite table to capture the overall aggregation hierarchy is created. This composite table consists several attributes including WholeOid, and pairs of PartOid and PartName for each level of aggregation. The composite table structure is shown in Figure 2. CREATE TABLE Composite ( PCoid VARCHAR(10) NOT NULL, PartOid_1 VARCHAR(10) NOT NULL, PartType_1 VARCHAR(20), PartOid_2 VARCHAR(10) NOT NULL, PartType_2 VARCHAR(20)); Fig. 2. Composite table to hold a multi-level aggregation
4
J.W. Rahayu and D. Taniar
Details of our previous work on the transformation of an aggregation hierarchy (i.e. one-level and multiple-level aggregation) to relational tables (using a commercially available RDBMS) can be found in Rahayu et al (1996, 1999).
3
The Proposed Models
In this section, we propose an implementation model for aggregation in an ORDBMS. In an ORDBMS (i.e. Oracle version 9), new structures are in place to support objectoriented concept. These were not exist in the original versions of RDBMS, simply because they do not comply with the relational concepts. The structures that we are going to use to represent aggregation are particularly: (i) Index Clustering technique, and (ii) Nesting technique. In our proposed models, we will show how to preserve aggregation concepts using the above two structures. 3.1
Using Index Clustering Techniques
In this section, we utilize an index clustering technique available in Oracle 9i in order to preserve an aggregation structure. Figure 3 shows the SQL statements on how aggregation is implemented. CREATE CLUSTER HD_Cluster ( HDoid VARCHAR2(10)); CREATE TABLE Harddisk ( HDoid VARCHAR2(10) NOT NULL, Capacity VARCHAR2(20), Primary Key (HDoid)) Cluster HD_Cluster(HDoid); CREATE TABLE HD_Contr ( HDoid VARCHAR2(10) NOT NULL, HD_Contr_id VARCHAR2(10) NOT NULL, Description VARCHAR2(25), Primary Key (HDoid, HD_Contr_id), Foreign Key (HDoid) References Harddisk (HDoid)) Cluster HD_Cluster(HDoid); CREATE INDEX HD_Cluster_Index ON Cluster HD_Cluster; Fig. 3. Implementation of a homogenous aggregation relationship using an Index Cluster
It is clear from the implementation shown in Figure 3 that the index clustering technique supports existence dependent type of aggregation only, where the existence of the part object is dependent upon the whole object. It is not possible to have a harddisk controller (“part” object) that does not belong to a harddisk (“whole” object). This is enforced by the existence of the cluster key in all the “part” tables. Moreover, the example also shows a sharable aggregation type, where each “part” object can be owned by more than one “whole” object. For example, harddisk controller
Preserving Aggregation in an Object-Relational DBMS
5
with OID ‘HDC1’ may belong to harddisk with OID ‘HD1’ as well as harddisk ‘HD2’. Depending on the situation, the above sharable type may not be desirable. We can enforce non-sharable type of aggregation by creating a single primary key for the “part” object, and treat the cluster key as a foreign key rather than part of the primary key. The following Figure 4 shows an implementation of the example in Figure 3 as a non-sharable type (the implementation of the cluster and the cluster index remain the same). CREATE TABLE Harddisk ( HDoid VARCHAR2(10) NOT NULL, Capacity VARCHAR2(20), Primary Key (HDoid)) Cluster HD_Cluster(HDoid); CREATE TABLE HD_Contr ( HDoid VARCHAR2(10) NOT NULL, HD_Contr_id VARCHAR2(10) NOT NULL, Description VARCHAR2(25), Primary Key (HD_Contr_id), Foreign Key (HDoid) References Harddisk (HDoid)) Cluster HD_Cluster(HDoid); Fig. 4. Implementation of a homogenous aggregation relationship non-sharable type
Each time a record is inserted into the "part" table (i.e. HD_Contr) the value of the cluster key (HDoid) is only stored once. The rows of the "whole" table (i.e. Harddisk) and the rows of the "part" table (HD_Contr) are actually stored together physically (see Figure 5). The index is created in order to enhance the performance of the cluster storage. HDoid HD001
Capacity 2GB
HD002
6GB
HD_Contr_id Contr11 Contr22 Contr12 Contr13 Contr14
Description ................. ................. ................ ............... ................
Fig. 5. Physical storage of aggregation relationship using index cluster
It is also possible to use a cluster method to implement an aggregation relationship between a "whole’ object with a number of "part" objects. Figure 6 demonstrates the implementation of an aggregation between a PC with Harddisk, Monitor, Keyboard, and CPU. CREATE CLUSTER PC_Cluster ( PCoid VARCHAR2(10)); CREATE TABLE PC ( PCoid VARCHAR2(10) NOT NULL, Type VARCHAR2(20), Primary Key (PC_id))
6
J.W. Rahayu and D. Taniar Cluster
PC_Cluster(PC_id);
CREATE TABLE Harddisk ( PCoid VARCHAR2(10) NOT NULL, HDoid VARCHAR2(10) NOT NULL, Capacity VARCHAR2(20), Primary Key (PCoid, HDoid), Foreign Key (PCoid) References PC (PCoid)) Cluster PC_Cluster(PCoid); CREATE TABLE Monitor ( PCoid VARCHAR2(10) NOT NULL, Moid VARCHAR2(10) NOT NULL, Resolution VARCHAR2(25), Primary Key (PCoid, Moid), Foreign Key (PCoid) References PC (PCoid)) Cluster PC_Cluster(PCoid); CREATE TABLE Keyboard ( PCoid VARCHAR2(10) NOT NULL, Koid VARCHAR2(10) NOT NULL, Type VARCHAR2(25), Primary Key (PCoid, Koid), Foreign Key (PCoid) References PC (PCoid)) Cluster PC_Cluster(PCoid); CREATE TABLE Cpu ( PCoid VARCHAR2(10) NOT NULL, CPUoid VARCHAR2(10) NOT NULL, Speed VARCHAR2(10), Primary Key (PCoid, CPUoid), Foreign Key (PCoid) References PC (PCoid)) Cluster PC_Cluster(PCoid); CREATE INDEX PC_Cluster_Index ON Cluster PC_Cluster; Fig. 6. Implementation of an aggregation relationship with multiple "part" objects "whole" id PC001
PC002
"whole" attribute ..................
.....................
"part" id
"part" attribute
Harddisk01
.................
Harddisk02
.................
Monitor01
.................
Keyboard01
.................
Cpu01
.................
Harddisk03
................
Monitor02
...............
Keyboard02
................
Cpu02
................
Fig. 7. Physical storage of multiple aggregation relationships using an index cluster technique
Preserving Aggregation in an Object-Relational DBMS
3.2
7
Using Nested Tables
Another technique for implementation of aggregation in Oracle is the use of nested tables. In this nesting technique, similar to clustering, the "part" information is tightly coupled with the information of the "whole" object. This actually enforces the existence dependent type of aggregation where the existence of a "part" object is fully dependent on the "whole" object. If the data of the "whole" object is removed, all associated "part" objects will be removed as well. Because of this reason, nested table technique is only suitable for existence dependent aggregation. The following Figure 8 describes the link between the “whole” and the “part” table in a nesting structure, whereas Figure 9 shows the implementation of the homogenous aggregation using nested table technique. Note that there is no concept of primary key nor integrity constraint in the “Part” nested table as shown in Figure 9. For example, if a particular harddisk controller is used by another harddisk from the “whole” table, then all the details of the harddisk controler will be written again as a separate record within the nested table. “W hole” Table
HD_id HD001
Capacity
Controller (reference)
2GB “Part” NestedTable
HD002
6GB HD_Contr_id
Description
Contr11
………
Contr22
…….
Contr12
…….
Contr13
…….
Contr14
…….
Fig. 8. Aggregation relationships using nested table CREATE OR REPLACE TYPE HD_Contr AS OBJECT (HD_Contr_id VARCHAR2(10), description VARCHAR2(30)); / CREATE OR REPLACE TYPE HD_Contr_Table AS TABLE OF HD_Contr / CREATE TABLE Hard_Disk (HD_id VARCHAR2(10) NOT NULL, capacity VARCHAR2(20), controller HD_Contr_Table, Primary Key (HD_id)) NESTED TABLE controller STORE AS HD_Contr_tab; Fig. 9. Implementation of aggregation relationships using nested table
8
J.W. Rahayu and D. Taniar
Figure 9 shows that we do not create a standard table for HD Controller. We only need to define a HD Controller type, and define it as a nested table later when we create the Harddisk table. It is also shown that the information of the nested table is stored externally in a table called HD_Contr_tab. This is not a standard table, in which no additional constraints can be attached to this table, and no direct access can be performed to this table without going through the Harddisk table. Every “whole” object can own any “part” object in nesting technique, even if that particular “part” has been owned by another “whole” object. The record of the HD Controller object will simply be repeated every time a Harddisk claims to own it. This shows a sharable type of aggregation, where a particular “part” object can be shared by more than one “whole” objects. Because there is no standard table created for HD Controller, we cannot have a primary key for the table, which we usually employ to enforce a non-sharable type of aggregation (see the previous clustering technique).
4
Queries Using Clustering and Nesting Techniques
Previous section describes our proposed model on persevering aggregation through the use of nested tables and index clusters. In this section, we are going to show how queries can be performed on these two data structures. 4.1
Queries Using Clustering Techniques
Queries on aggregation structures are normally categorized as “Part Queries” and “Whole Queries”. A “Part Query” is a query on an aggregation hierarchy to retrieve information of “part” classes where the selection predicates are originated at the “whole” class. On the other hand, a “whole” query is a query to retrieve information of the “whole” class where the selection predicates are originated at the “part” class. The SQL structure for part queries and whole queries using a clustering technique are shown in Figures 10 and 11, respectively. Note that in the clustering technique, the query to access the data along an aggregation hierarchy are simply standard queries to join the whole table with it’s associated part tables. SELECT FROM WHERE AND Fig. 10. Part query structure using a clustering technique SELECT FROM WHERE Fig. 11. Whole query structure using a clustering technique
Preserving Aggregation in an Object-Relational DBMS
9
An example of a part query is to display harddisk number and capacity of a given PC (e.g. PC ID = ‘PC001’). The SQL statement for this query is shown in Figure 12. SELECT H.HDoid, H.Capacity FROM Harddisk H, PC WHERE H.PCoid = PC.PCoid AND PC.PCoid = ‘PC001’; Fig. 12. Part query example using a clustering technique
An example of a whole query is to display details of PCs which as a harddisk capacity of less than 2Gb. The SQL statement for this query is shown in Figure 13. SELECT PC.PCoid, PC.Type FROM PC, Harddisk H WHERE PC.PCoid = H.PCoid AND H.Capacity < 2; Fig. 13. Whole query example using a clustering technique
In the clustering technique, whole queries are implemented in a very similar manner as those of part queries. 4.2
Queries Using Nesting Techniques
The query representation for part and whole queries are shown in Figures 14 and 15, respectively. SELECT FROM THE (SELECT “whole” class nested table attribute FROM WHERE Fig. 14. Part query structure using a nesting technique
SELECT FROM WHERE Fig. 15. Whole query structure using a nesting technique
In nesting technique, nested tables are referred to using the keyword THE in the queries. Using the sample query example as shown in Figure 12, the SQL using a nested table is written as follows (see Figure 16).
10
J.W. Rahayu and D. Taniar SELECT HDoid, Capacity FROM THE (SELECT PC.Harddisk FROM PC WHERE PC.PCoid = ‘PC001’); Fig. 16. Part query example using a nesting technique
Again using the whole query example as shown in Figure 13, the following SQL in Figure 17 is an equivalent whole query using a nesting technique. SELECT PC.PCoid, PC.Type FROM PC, TABLE (PC.Harddisk) H WHERE H.Capacity < 2; Fig. 17. Whole query example using a nesting technique
5
Conclusions
In this paper we have shown how aggregation hierarchies in database design using an object-oriented paradigm can be preserved in the implementation using an ObjectRelational Database Management System (i.e. Oracle 9i or above). Preserving aggregation hierarchies in the implementation stage in database development is important in order to maintain object-oriented concepts, found at the design stage, at the implementation stage. Our proposed models make use of two unique features, namely index clusters and nested tables, for maintaining aggregation. We have shown the DDL (Data Definition Language) using these two features. We have also presented how queries can be performed on these two data structures. We illustrate aggregate queries by focusing on part queries and whole queries – two main important queries involving aggregation hierarchies.
References 1. 2. 3. 4. 5. 6.
Muller, R.J., Oracle 9i Complete: A Comprehensive Reference to Oracle 9, 2002. Rahayu, J.W., Chang, E., Dillon, T.S., and Taniar, D., “Aggregation versus association in object modelling and databases”, Proceedings of the Australasian Conference on Information Systems, Hobart, Tasmania, 1996. Rahayu, J.W., Chang, E., and Dillon, T.S., “Composite Indices as a Mechanism for Transforming Multi-Level Composite Objects into Relational Databases”, The OBJECT Journal, 5(1), 1999. Rahayu, J.W., Chang, E., Dillon, T.S., and Taniar, D., "Performance Evaluation of the Object-Relational Transformation Methodology", Data and Knowledge Engineering, 38(3), pp. 265–300, 2001. Rumbaugh, J. et al, Object-Oriented Modelling and Design, Prentice-Hall, 1991. Stonebraker, M. and Moore, D., Object-Relational DBMSs: The Next Great Wave, Morgan Kaufmann, 1996.
Database Compression Using an Offline Dictionary Method Abu Sayed Md. Latiful Hoque, Douglas McGregor, and John Wilson Department of Computer and Information Sciences, University of Strathclyde, 26, Richmond St, Glasgow, G1 1XH, UK {Latiful.Hoque,jnw}@cis.strath.ac.uk
[email protected]
Abstract. Off-line dictionary compression is becoming more attractive for applications where compressed data are searched directly in compressed form. While there has been large body of related work describing specific database compression algorithms, the Hibase [10] architecture is unique in processing queries in compressed data. However, this technique does not compress the representation of strings in the domain dictionaries. Primary keys, data with high cardinality and semi-structured data contribute very little or no compression. To achieve high performance irrespective of type of data, the string representation must be in compressed form. At the same time, the direct addressability of compressed data is maintained. Serial compression techniques cannot be used. In this paper, we present a prefix dictionary-based off-line method that can be incorporated with systems like Hibase where compressed data can be accessed directly without prior decompression. The complexity is O(n) in time and space.
1
Introduction
Since Shannon’s seminal work [12] it has been known that amount of information is not synonymous with volume of data. Insight depends on information, not its data representation. In fact the volume of information may be regarded as the volume of its theoretically minimal representation. There are general benefits if data representation is concise. Reducing the volume of data without losing any information is known as Loss-less Data Compression. This is potentially attractive for two reasons: reduction of storage cost and performance improvement. Performance improvement arises because the smaller volume of compressed data may be accommodated in faster memory than its uncompressed counterpart. Only a smaller amount of compressed data need be transferred and/or processed to effect any particular operation. A further significant factor is now arising in distributed applications using mobile and wireless communications. Low bandwidth is a performance bottleneck and data transfers may be costly. Both of these constitute to factors making data compression architecture exceedingly worthwhile. The Hibase architecture by Cockshott, McGregor and Wilson [10] is a model of data representations based on information theory. Their technique does not compress the representation of strings in the domain dictionaries. That T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 11–20, 2002. c Springer-Verlag Berlin Heidelberg 2002
12
A.S.M. Latiful Hoque, D. McGregor, and J. Wilson
is why, primary key and data with high cardinality contribute very little or no compression. Semi-structured data like text fields are like primary keys in a relational database. To achieve high performance in HIBASE irrespective of cardinality or type of data, the string representation of the domain dictionaries must be in compressed form. In this paper we have presented a prefix dictionary based loss-less data compression scheme that can be incorporated in any database system (like HIBASE) where compressed data can be accessed directly without prior decompression. The decompression is done only when query output is generated. The scheme can be applied to highly structured relational data to semi-structured text or XML data. The basis of the approach and the formal definition of the problem are described in section 2. The state of the art is given in section 3. The off-line prefix dictionary method, the derivation of redundant longest disjoint prefixes, the augmentation of prefixes and deletion of covered prefixes are described in section 4. The implementation and analysis of complexity is given in section 5. The result, discussion and conclusion are presented in section 6 and 7.
2
Basis of the Approach
In the relational approach, the database is a set of relations (tables). A relation is a set of tuples. A tuple (row) in relation r represents a relationship among a set of values. The corresponding values of each tuple belong to a domain, for which there is a set of permitted values. If the domains are D1 , D2 , . . . Dn respectively, then relation r is defined as the subset of the Cartesian product of the domains. Thus r is defined as r ⊆ D1 × D2 × . . . × Dn Thus, considered as a tuple it is divided into rows and columns. Conventionally the rows represent tuples; the columns represent domains. Queries are answered by the application of the operations of the relational algebra, usually as embodied in a relational calculus-based language such as SQL. In conventional implementations, relations are stored with the tuples implemented as records directly mapping the input form of the information. Though other arrangements are possible this is the simplest and we shall adopt it for ease of explanation. It is the technique used in IBM’s DB2. A fixed maximum size of storage is allocated to represent each record field. Thus the database system must allocate sufficient storage space to allow for the largest storage representation required by any tuple in the relation resulting in a considerable waste of storage for all but the extreme tuple field values. For example, to store the relation in fig. 1, we have to make each field of the records big enough to hold the largest value occurring in that field. When the database designer does not know exactly how large the individual values are, they must err on the side of caution and make the fields longer than is strictly necessary. For example, the designer has heard that this year the fashionable colour for cars will be ’vermillion’. In this instance a designer might specify the width of car color 10. With each tuple
Database Compression Using an Offline Dictionary Method
13
occupying 35 bytes (first name: 10, surname: 10, dept: 5 and car color: 10) the total space occupied by the relation’s 10 tuples will be 350 bytes. Compression Car Color Red Red Green 0000 Blue 1100 Blue 2201 Red 3312 Red 4402 Green 1010 Blue 5510 Red 6410 7112 Compressed table 2610
First Name Billy Ann John Gordon Jane Ann Tim Peter Craig John
Surname Jones Smith Fraser Adams Brown Jones Wilder Brown Smith Mcleod
Dept Sales Sales Sales Tech Sales Tech Tech Sales Tech Tech
Id 0 1 2 3 4 5 6 7
Dictionary First Name Surname Dept Billy Jones Sales Ann Smith Tech John Fraser Gordon Adams Jane Brown Tim Wilder Peter Mcleod Craig
Car Color Red Green Blue
Fig. 1. An uncompressed and compressed table
The approach to compression adopted by the Hibase architecture involves the creation of domain dictionaries and the reduction of attribute values to minimal tokens. Queries can then be resolved against the tokenised database. Domain dictionaries are produced by generating minimal bit string tokens to represent dictionary strings . Tuple are compressed by replacing the original field values of the relation by identifiers . The range of the identifiers need only be sufficient to unambiguously distinguish which string of the domain dictionary is indicated. For example as there are eight First Names only eight identifiers 0-7 inclusive are required. This range can be represented by only a 3-bit binary number. Hence in the compressed table each tuple requires only 3 bits for First Name; 3 bits for Surname; 1 bit for Department and 2 bits for Car Colour - a total of 9 bits instead of the 620 (35 *8) bits for the uncompressed relation. This produces a compression of the table itself by a factor of over 30. We must, however take account of the space occupied by the domain dictionaries and also by indexes to arrive at a final estimate of the overall compression. Typically, a proportion of domains are present in several relations, which thus reduces the dictionary overhead by sharing it between different tables. In practice, the early prototype Hibase achieved overall factors of between 8 and 10 compared to conventional DBMS systems [2,10] The fact that in a domain a specific identifier always refers to the same field value enables many operations to be carried out by processing only the table component of the data, ignoring the dictionaries until string values are essential (e.g. for output). Queries are carried out directly on the compressed data, without requiring any decompression during the processing. The query is translated to compressed form and then processed directly against the compressed form of the relational data. Since this requires manipulation of fewer bits it is inherently more efficient than the conventional alternative of processing an uncompressed
14
A.S.M. Latiful Hoque, D. McGregor, and J. Wilson
query against uncompressed data. The final answer has to be converted from the internal identifier form to a normal uncompressed representation. However, the computational cost of this decompression occurs only for the tuples that are returned in the result, normally a small fraction of those processed. We can define the problem formally as follows. We have an uncompressed string heap, Suc containing n number of lexemes, L on the alphabet Σ. So Suc = {L1 , L2 , L3 , . . . Ln }|L ∈ Σ. The average length of lexeme is w . The length of the uncompressed dictionary heap is nw . A word has to be selected from Suc such that its encoding would yield the highest contraction in the heap. At the same time we have L = Decode(Encode(L)). To achieve these goals we may have a general solution of the problem and then apply the above restriction. Let us give an example. We have a string S = “abcabcdabcdefabcdefgabca”. We have to find the set of all longest disjoint sub-strings, Sub |(Sub = Sdsj−substr (S)) ∧ (fcount (Sub ) >= ft ), where fcount (Sub ) is the frequency count of Sub in the string S and ft is the threshold frequency. A threshold frequency is defined as the minimum frequency of a sub-string in S that offers compression considering the storage space of the off-line dictionary and the encode vector. In the above example, if we take the primary length of Sub as 3 and threshold frequency 2, then the set of sub-strings over threshold is {abc, bcd, cde, def} having frequeny counts {5, 3, 2, 2} respectively. The disjoint set is {abc, def}. The longest disjoint sub-string set is {abca, abcd}. These longest sub-strings will form the off-line dictionary. We have to find the set of all disjoint sub-strings in linear time and space. A storage mechanism using the off-line dictionary has to be developed.
3
State of the Art
Westmann et. al. [15] has shown how compression can be integrated into a relational database system. They have presented a light-weight compression method that improves the response time of queries. Cormack [4] has described a data compression technique for database systems based on context dependent Huffman Code [1]. It is part of IBM’s Information Management Systems. The algorithm in Goldstein et. al. [5] is especially effective for records with many low to medium cardinality fields and numeric fields. Bently et. al. [7] have extended the Lempel idea of Lampel-Ziv [16,14] to represent long common strings that may appear far apart in the input string. The most recent work on off-line dictionary compression is given by Larsson and Moffat [8]. Their method Re-Pair consists of recursively replacing the most frequent pair of symbols in the source message by a new symbol. The algorithm is simple and the complexity in time and space is O(n). But one of the important drawback of the method is that during encoding phase, the processing must temporarily halt for a compaction phase. Most related work on database compression has focused on the development of new compression algorithms or an evaluation of existing compression techniques. Our work differs from this work in the following. We are interested in showing how optimal compression (e.g., off-line dictionary methods) can be applied to database with the same performance as Light-Weight database compres-
Database Compression Using an Offline Dictionary Method
15
sion [15]. We can access the ith tuple in compressed form without accessing any other tuple in the compressed database. Queries can be processed on compressed data using the compressed form of the query. Only the required tuple can be decompressed without decompressing any other tuple. The domain dictionaries can be either in compressed or uncompressed form.
4
Offline Prefix Dictionary Compression
Dictionary-based modelling is a mechanism used in many practical compression schemes. In off − line dictionary compression, the full message or a large block of it is used to infer a complete dictionary in advance and include a representation of the dictionary as part of the compressed message. Intuitively, the advantage of this off-line approach to dictionary-based compression is that with the benefit of having access to all of the message, it is possible to optimize the choice of phrases so as to maximize compression performance. Any part of the compressed message can be decompressed without prior decompression of other part of the message. In off-line prefix dictionary compression, we need to derive all the longest redundant prefixes of the uncompressed string. The amount of compression yielded by a scheme can be measured by the compression ratio, which has been defined in several ways. Cappellini [3] has defined this value as C = (Average message length)/(Average codeword length). Rubin [11] has defined it as C = (S − O − OR)/S, where S represents the source string, O the length of coded message, and OR the size of the output representation. We have taken Rubin’s definition because in the off-line dictionary method, the encoder must transmit the dictionary explicitly with the coded message to the decoder. Derivation of phrases is very important in offline compression. The derivation depends on the length of the phrase, frequency of occurrence of the phrase in the uncompressed string and the representation model of the off-line dictionary. The minimum frequency of occurrence of a phrase that offers compression is referred to as the threshold frequency. Let the frequency of a particular phrase be f and the size of phrase be Sp . The total amount of space occupied by the phrase in the uncompressed string is f × Sp . Let the size of the string in the dictionary be Sd and the size of the code for the phrase be t. The equivalent size of the phrases in compressed form is f × t + Sd . The compression ratio is C = 1 (f × t + Sd )/(f × Sp ). To achieve compression, the ratio must be greater than zero. So C = 1 (f × t + Sd )/(f × Sp ) > 0 that is f must be greater than Sd /(Sp − t) . This is called the threshold condition. The threshold frequency, ft = (Sd /(Sp − t) + 1 ) if t is an exact multiple of bytes and ft = (Sd /(Sp − t) + 1 ) if t is not an exact multiple of bytes. The selection of disjoint prefixes that are most economic for compression needs multiple passes. Though Lampel-Ziv algorithms parse strings in a single pass, the code is not optimal [13]. We have developed a two pass greedy algorithm that works as in fig.2. The first pass constructs the dictionary and the second pass encodes the string using the dictionary. We can compare our algorithm with the algorithm developed by Larsson and Moffat [8]. Their compression scheme, Re-
16
A.S.M. Latiful Hoque, D. McGregor, and J. Wilson
Pair uses a recursive pairing algorithm . We can use our algorithm to compress the string singing.do.wah.diddy.diddy.dum.diddy.do used as an example in [8]. Our algorithm performs better than Re-Pair in the following:
1. 2. 3. 4. 5. 6. 7.
Algorithm Example Find all three char. prefixes and their freq. count Input String: singing.do.wah.diddy.diddy.dum.diddy.do Delete all prefixes with frequency count 1 Expand the prefix with highest freq. one char. right Step Prefix set if(count(augmentedPrefix)>=threshold) 1 {.did, .do, dy., y.d, ing} Decrement the count of the covered prefix 2 {.didd, .do, dy., y.d, ing} if(count(coveredPrefix)=threshold 4 {.diddy, .diddy., ing}
Fig. 2. Prefix derivation algorithm and an example
It requires only 4 steps to reach the dictionary where as Re-Pair needs 8 steps. The data structure is simple. A trie to keep the prefixes with their counts and a two dimensional dynamic array to keep the addresses. Less memory is required because of a smaller number of prefixes. The dictionary code is more compact. There are three prefixes and code length is 2 bits where as Re-Pair needs 9 symbols to be coded and the code length is 4 bits. In Re-Pair, the selection of pair for continuous repeating symbols (figure 4) will be optimal (i.e. n2 ) only when string length is n = 2 m+1 where m is any positive integer. Otherwise, the longest phrase is 2 m+1 /2 for 2 m + 1 < n < 2 m + 2 . In our algorithm, it is 6m for string length n in the range 12m < n < 12 (m + 1 ) . Initially, we have chosen the three character prefixes with maximum frequency count. To achieve optimum compression, we have to consider both frequency and length of the prefixes. So to obtain the redundant longest prefixes, they should be augmented to the right up to threshold frequency. As a consequence, prefixes will be deleted from the list. The augmentation of the prefix with highest frequency count and the deletion of prefixes covered by the augmented prefix is the most computationally intensive operation in the development of an off-line prefix dictionary. The augmentation and deletion method is shown in fig.3.
Augmentation of prefixes Input: "abcabcdabcdefabcdefgabca" abc 0 3 7 13 20 abca 0 20 abcde 7 13 def 11 17 bcd 4 8 14 abcd 7 13 step 3 bca 1 21 cde 10 16 abcdef 7 13 cde 10 16 def 11 17 step 4 def 11 17 step 2 Output: {abca, abcdef} step 1
Repeated symbol Input string: "aaaaaaaaaaaaaaa" Longest redundant string: Prefix method: "aaaaaa" Re−Pair method: "aaaa"
Fig. 3. Augmentation of prefix
Database Compression Using an Offline Dictionary Method
17
The figure shows the prefixes and the corresponding pointer locating the starting point of each of the prefixes. After deletion of all prefixes of count 1, the set of prefixes is {abc, bcd, bca, cde, def} and their frequency counts {5, 3, 2, 2, 2} respectively. When “abc” is extended, it produces {abca, abcd, abcd, abca}. The last three character suffix of the extended prefixes are {bca, bcd, bcd, bca}. These are deleted from the list. The final output of step 1 - 4 is {abca, abcdef}.
5
Implementation
We have used v.42 bis trie data structure to incrementally construct the dictionary. All the prefixes of the three characters and their corresponding counts are added to the trie. All nodes with a count of one are deleted. This makes free space in the trie to add new nodes when the prefixes are extended. The complexity of extension of a prefix by adding one character to the right is O(n) in time and space. Let there be m number of disjoint prefixes. The upper bound of m is k (k − 1 )(k − 2 ) where k is the cardinality of alphabet used in the string. In practice m will be very much less than the upper bound because of the disjoint property and deletion of nodes with count of one. Each time an extension of a prefix removes other candidate prefixes and the value of m is dynamically reduced. Hence m is a decreasing entity and constructing the prefix with maximum count is O(mlog(m)) . Memory: Initially, the input message is stored as a string of length n bytes. We have implemented the trie data structure as scalar character and integer arrays. So each node of trie requires 4.25 words. The maximum number of three character prefixes is bounded by k (k − 1 )(k − 2 ) where k is the size of input alphabet. The space required to store all the three character prefixes is 4 .25k (k − 1 )(k − 2 ). The possible maximum number of prefixes whose addresses need to be stored is n − 2 words. The total space requirement is 4 .25k (k − 1 )(k − 2 ) + 2n − 2 which is O(n) . Time: In the first pass, method addAllPrifix(String input, int len) constructs a trie of all prefixes of length len in O(n) time. The deletion of all prefixes with count one is performed in O(1 ) . The most computationally intensive operation is augmentPrefix(String prefix). If the frequency count of the prefix is u and the amount of augmentation is v , the number of operations to perform the n augmentation is v . If there are w number of prefixes to be augmented, the 1
n number of operations is w ∗ v . The worst case number of operations are 1 n w∗ v ≤ n . So the computation of augmentation is O(n) . 1
6
max
Results and Discussion
All experiments were run on a 800 MHz AMD DuronTM processor machine with 256 MB of physical memory. The operating system was Microsoft Windows 2000
18
A.S.M. Latiful Hoque, D. McGregor, and J. Wilson
Professional. We have studied a number of different aspects of the construction of off-line prefix dictionaries. These include 1) the frequency distribution of initial prefixes of three character long for different types of data 2) the size of the dictionary for different approaches 3) the growth of the off-line dictionary and 4) the construction time. The initial experiment was performed on three files named test.txt: a simple text file (32 KB), Trie4.java 41 KB java source code of the off-line dictionary compression system and tmpns.txt, a file containing 554 KB of IP address data. The frequency distribution of the initial prefixes of three characters is shown in fig. 4(a). We see that the average frequency count decreases rapidly just after roughly 50% of the phrases are evolved. Most of the longest disjoint prefixes remains within the first region of the frequency spectrum.
400
database simple text java source
25 20 15 10
3001-3500
2501-3000
2001-2500
1501-2000
501-1000
1001-1500
5 0-500
FastaH DNAH
350 Dictionary size (KB)
Avg frequency
30
300 250 200 150 100 50 0 0
1
Range of number of phrases
2 3 Size of input file (MB)
(a)
5
(b)
1000
5
FastaH DNAH
dic & code FastaH dic & code DNAH Hibase
4 100
Size (Mb)
Time (sec)
4
10
3 2 1
1
0 0
1
2
3
Size of input file (Mb)
(c)
4
5
0
1
2
3
4
5
Size of input file (Mb)
(d)
Fig. 4. (a) Average frequency count decreases rapidly as the number of phrases increases (b) The growth of off-line dictionary (c) Off-Line dictionary construction time (d) The overall size of Hibase and off-line dictionary
Table 1 is a comparison of the size of the dictionary in LZW and Off-line Prefix methods for three test files: test.txt, a simple text file; testFasta, a segment of pir.fasta and testDNA, a segment of DNA database. The sizes of the files are
Database Compression Using an Offline Dictionary Method
19
32KB, 392KB and 90KB respectively. We find that the LZW dictionary is a factor of three to eight bigger in size than the off-line prefix dictionary. At the same time, the size of the LZW code will be two to three bits wider than the off-line code. Table 1. The size of dictionaries in LZW and Off-line prefix methods File name File size LZW Off-line LWZ/Off-line (Bytes) dictionary (Bytes) dictionary (Bytes) test.txt 32000 3734 832 3.2 test.Fasta 392000 10604 3897 2.7 test.DNA 96000 6078 773 7.8
Each DNA and Fasta sequence has a header that identifies the sequence. Creating a dictionary on this data in Hibase is like creating dictionary on primary key. We have extracted the headers from two files: pir.Fasta containing Fasta data and ensembl containing DNA data. The extracted data are stored in FastaH.txt and DNAH.txt. The growth of the dictionary shows the randomness of data. The growth for the DNA header file is much higher than the Fasta header (fig. 2b). In both the cases, the growth is linear i.e., (O(n)). The dictionary construction time (fig. 2c) for DNAH is higher than the FastaH for the same size of the input file. This is because the size and the growth of DNAH are higher than the FastaH. Finally we have compared the size of dictionary in the Hibase approach with our approach (fig. 2d). In both the cases, we have used same size of Hibase dictionary for FastaH and DNAH. The overall size in our approach is the sum of the size off-line dictionary and the coded input string. The size in our approach is two to four times smaller than the Hibase.
7
Conclusion
Off-line dictionary compression is becoming more attractive to applications where compressed data are searched directly in compressed form. While there has been large body of related work describing specific database compression algorithms, our approach concentrates on improving compression whilst preserving direct addressability. We have achieved a factor 2 - 4 compression over the uncompressed dictionary heap in Hibase. Comparing to LZW the size of off-line dictionary is a factor of 3 - 8 times smaller. The incorporation of our compression scheme in to HIBASE will enhance the capability of the system. This scheme is a general purpose approach and can be used in many other applications such as digital libraries, XML and data broadcasting in mobile communications. We have applied the approach to storage and querying high dimensional sparsely populated data [6]. The application of the scheme in XML data is under active
20
A.S.M. Latiful Hoque, D. McGregor, and J. Wilson
research [9]. In mobile applications, the distribution of dictionaries to clients, processing queries on compressed data or broadcasting of compressed data will improve the use of costly wireless bandwidth in mobile communications. Further research is needed to show the effectiveness of the scheme to the above application areas. The use of multiple dictionaries for different classes of data and a reference to the corresponding dictionary makes the size of the code-word smaller. Further study is needed to explore the consequences of these additional compression techniques. Acknowledgements: This work was supported by Commonwealth Scholarship Commission of UK.
References 1. Huffman D. A. A method for the construction of minimum-redundancy code. Proc. IRE, 40:9:1098–1101, 1952. 2. W.P. Cockshott, D. McGregor, and Wilson J. Kotsis, N. Data compression in database systems. IDEAS’98, Cardiff, July, 1998. 3. G. P. Copeland and S. Khoshafian. A decomposition storage model. In Proceedings of the 1985 ACM SIGMOD, Austin, Texas, May 1985. 4. G.V. Cormack. Data compression on a database system. Communication of the ACM, 28:12, 1985. 5. J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing relations and indexes. Int Proce. IEEE Conf on Data Engineering, 1998. 6. McGregor D. R. Hoque A. S. M. L. Storage and querying high dimensional sparsely populated data in compressed representation. In Accepted in EurAsia-ICT, Tehran, Iran, October 2002. 7. Bentley J. and Mcllroy D. Data compression using long common strings. http://www.bell-labs.com. 8. Larson N. J. and Moffat A. Off-line dictionary-based compression. Proceedings of the IEEE, 88:11, 2000. 9. M Neumuller. Compact data structure for querying xml. Proceedings of EDTB, 2002. 10. Cockshott W. P., MCGregor D., and Wilson J. High-performance operations using a compressed architecture. The Computer Journal, 41:5:283–296, 1998. 11. F. Rubin. Experiments in text compression. Commun. ACM, 19, November 1976. 12. D. Solomon. Data Compression The Complete Reference. Springer-Verlag New York, Inc., 1998. 13. J.A. Storer and T.G. Szymanski. Data compression via textual substitution. Journal of the ACM, 29:928–951, 1982. 14. Weltch T.A. A technique for high-performance data compression. Computer, 17:6:8–19, 1984. 15. T. Westmann, D. Kossmann, S. Helmer, and G. Moerkotte. The implementation and performance of compressed databases. SIGMOD Record, 29:3:55–67, 2000. 16. J. Ziv and A. Lempel. Compression of individual sequences via variable rate coding. IEEE Transaction on Information Theory, IT-23(3):337–343, 1978.
Representation of Temporal Unawareness 1
2
3
Panagiotis Chountas , Ilias Petrounias , Krassimir Atanassov , 1 1 Vassilis Kodogiannis , and Elia El-Darzi 1
Department of Computer Science, University of Westminster Watford Road, Northwick Park, London, HA1 3TP, UK
[email protected] 2 Department of Computation, University of Manchester Institute of Science & Technology, PO Box 88, Manchester M60 1QD, UK 3 CLBME – Bulgarian Academy of Sciences, Acad. G. Bonchev Str., Bl, 105, Sofia-1113, BULGARIA
Abstract. In this paper we are concerned with the problem of designing of a conceptual framework for the construction of a flexible query answering mechanism that is adequate for selecting useful answers to queries, considering a multi-source and conflicting community of information sources. We put forward a proposal to provide representations with that kind of functionality through flexible reasoning mechanisms for handling query descriptions and retrieving perspective answers. We start by positioning conceptual schema, database with regard to this problem. Then we distinguish evidential assessment as the central task in query answering considering the conceptual, instance level always with respect to the dynamism-dimension of data.
1
Introduction
An intelligent query-answering system is able to perform a query, which is some indication of a user’s needs, to an information base, and converts it into an answer of useful information. The focus is on extending the relational model and query language to accommodate uncertain or imprecise data with respect to value imperfection [1], [2], [3], [10] or temporal imperfection [5], [6], [8]. In the case of value imperfection the focus is on the representation of null values, or partial values. Alternately in temporal imperfection the focus is on the representation of indefinite temporal information. Another paradigm for query answering systems has been the logic or deductive approach, with an aim to minimize failing queries [4]. A query is said to fail whenever its evaluation produces the empty answer set. With reference to the above post relational query answering systems, some useful observations can be deduced - Current approaches consider Temporal or Value Imperfection as part of a singlesource environment and mainly as a data dependent characteristic. - Temporal or Value imperfection can be a property of the query answering system apart from being a data oriented characteristic. This may occur because a set of information bases or sources may provide different answers to the same query.
T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 21–30, 2002. © Springer-Verlag Berlin Heidelberg 2002
22
P. Chountas et al.
- However not all answers are assume to be equally good. At this point it quite valid to assume that all answers are perfect. Thus there may be no uncertain descriptions. Conclusively it can be said that conflicting information sources may be responsible for queries with no answers or conflicting answers. Considering the dynamism or dimensionality of the application data and a multi-source environment a query may fail, to retrieve a unique answer for a particular fact, “factual-conflict” (e.g. is John an Employee?) or to define through its evaluation the time-interval that a fact is defined in the real world, “temporal-conflict” (e.g. When did john work as an Employee?) In this paper the authors demonstrate how conflicting and certain answers may generate uncertainty either at the conceptual level (metadata level) or at the instance level.
2
Motivation
The process of integration relates to source integration, schema integration and data integration. The underlying assumption [9] in is that the notion of time and its validity is limited to the most recently recorded information. In this paper we deal firstly with conceptual schema conflict representation i.e. the process of encoding conflict in semantically rich temporal schemas. The original motivation is based on the argument that in order to build applications with its own information bases out of the existing ones conflicting or not we need to raise the abstraction of operations on conceptual schemas and mappings. After representing conflict at the conceptual level in terms of the structural components involved in conflict resolution, it seems appropriate to resolve conflict at the instance or information level. We propose a set of relational algebra operations that will let users to extract information from a conflicting multi-source environment. Thus we are letting users to view and query the data level from their own conceptual background. Technically we are permitting different domains to claim the same fact instance(s). The algebraic operations are based on, an imported theoretical model from the evidential area of reasoning.
3
Reconciling Conflicting Sources
To define a common integrated info-base, we shall assume that there exists a single (hypothetical) info-base that represents the real world. This ideal info-base includes only perfect descriptions. We now formulate two assumptions. These assumptions are similar to the Universal Scheme Assumption and the Universal Instance Assumption [9], although their purpose here is quite different. These two assumptions are statements of reconciliation of the given Info-bases. The Scheme Soundness Principle (SSP). All conceptual schemes are derivatives of the real world scheme. That is, in each conceptual scheme, every structural component is a view of the real world scheme. The meaning of this assumption is that the different ways in which reality is modelled are all correct; i.e., there are no modelling errors, only modelling differences. To put it in yet a different way, all
Representation of Temporal Unawareness
23
intentional inconsistencies among the independent conceptual schemes are reconcilable. The Instance Soundness Principle (ISP). All database-fact instances are derivatives of the real world instance. The meaning of this assumption is that the information stored in info-bases is always correct; i.e., There is no erroneous information, only different semantic representations of alike facts. It is suggested by [10] that in query-answering systems one central task is to compare two information items, one from the client and the other from the infobase. In our framework a client can be defined as - An application that is trying to built its own info-base out of the existing reconcile able info-bases according to Scheme Soundness (SSP), and Instance Soundness (ISP), principles - An unaware hypothetical user trying to extract information similar to its request from the info-base, that is the closest one to the single (hypothetical-perfect) database which represents the real world according to (SSP), (ISP) principles. The question to be answered at this point is how [7] argument is realized in a multisource, reconcile-able although conflicting environment? At this point we will introduce first a rich semantically temporal conceptual model that utilized by the single (hypothetical-ideal) info-base to represents the real temporal world. Afterwards we will propose an evidential model for resolving conflict at the instance level through the query mechanism.
4
The Meta-model for Capturing Uncertainty & Conflict
The meta-model for the proposed formalism specifies the basic concepts of the conceptual model that are involved in uncertainty-conflict reasoning and is application independent. Time facts or temporal phenomena are presented as objects in a 3 dimensional space spanned by the time axis, the uncertainty axis, and the axis along which facts are presented. A fact is a true logical proposition about the modelled world. The building blocks are facts in which entity types play roles. Each entity type is a set of elements and contains all concepts regardless of the different roles that they play in any of the different facts. A role played by an entity type in a fact defines a subset of an entity type. A fact type Fx among n entity types (E1, E2,…,En) is a set of associations among elements from these types. A label is used to reference an entity type and refers to elements from the domain of an entity type. Factual uncertainty is related to possible semantic propositions. This leads to possible associations between entity instances and not between entity types. Factual uncertainty is the result of ambiguity at the instance level. In our approach there is no ambiguity at the level of concepts. An irreducible fact type is defined in the real world over a time interval or time period (valid time). This is called a timestamped fact. Temporal uncertainty in the information context specifies that a semantic proposition [6] in the real world be defined over a time interval with no explicit duration. Comparing the definition of factual uncertainty and temporal uncertainty, it can be concluded that these two forms of uncertainty are independent. Certain fact
24
P. Chountas et al.
propositions may be defined over an uncertain time, or uncertain fact instances may exist during a certain time and vice versa. Timestamped facts are defined as cognisable since uncertainty about time and fact are independent. If a time fact is cognisable either the time fact or the independent time and fact representations are sufficient to represent all knowledge, provided that correct 2-dimensional representations can be derived from the time fact (3-dimensional space) and not the other way around. In this case the range of possible time points which correspond to the earliest possible instantiation Fs of the fact Fx is constrained, as well as the range of time corresponding to the latest instantiation Fe. Similarly the range of the possible instantiations of a fact, corresponding to the earliest time point instantiation ts of Fx is constrained, as also those fact instantiations corresponding to the latest time te. These constraints are summarised in equation (1). Fs Å[ts, t1] Fe Å[t2, te] , ts Å [Fs, F1] te Å[F2, Fe] (1) In equation (1) ts, t1, t2, te are linear points and as constraints are applied to the time fact instantiations. A temporal phenomenon is defined over the union of the time intervals specified by the temporal constraints in equation (1). Therefore a temporal phenomenon is defined over a temporal element. This is described in equation (2). Fx = {Fs,…,Fe} Å{[ts, t1],.., [t2, te]}
(2)
All the above axioms and concepts are represented in the uncertainty metamodel Fig.1. The meta-model also describes the different types of uncertainty and the different temporal phenomena. The latter might be unique, recurring or periodic. Conflicting information may generate two types of uncertainty. One is introduced because of queries that refer to level concepts that are at a lower level than those that exist in the instance level of the database. The other arises because of the use of an element in the query that is a member of more than one high level concept. In terms of dimensions uncertainty that is generated because of conflicting information may be either factual or temporal. If the duration of a fact can be expressed with the aid of more than one time granularities then “technical temporal conflict” is defined and temporal uncertainty is generated, since the answering query mechanism is not in position to clarify which time granularity is the most preferred one- the one that totally satisfies the temporal selection condition. If a reference label is claimed by more than high-level concepts the “factual conflict ” is defined and factual uncertainty is generated, since the answering query mechanism is not in position to clarify which reference label is the most preferred one -the one that totally satisfies the factual selection condition. The rest of the paper is attempting to address integration issues that may come up, through query resolution in a flexible manner.
Representation of Temporal Unawareness
25
Fig. 1. Uncertainty & Conflict Meta-model
5
Problems in Estimating Flexible Answers
Let us consider the following description about travellers ‘Ann’ and ‘Liz’ "Table I. Table 1. Relation R representing the schedules of Ann and Liz R
Person
Concept
VT(R)
X1
Ann
Brazil
[01/03/00,29/05/00]
X2
Ann
[01/06/00,29/08/00]
X3
Ann
X4
Liz
Southern Hemisphere Northern Hemisphere Brazil
X5
Liz
Northern Hemisphere
{[01/06/00,29/08/00], [01/09/00,29/11/00]}
{[01/06/00,29/08/00], [01/09/00,29/11/00]} [01/03/00,29/05/00]
Consider the following conflicting queries and whether these could be answered with authority or not: - Q1: When either did Liz or Ann visit Brazil?
26
P. Chountas et al.
- Q2: Derive all the people who either did visit the Southern Hemisphere or the Northern Hemisphere? The above queries are not easy to be answered because of the following inconsistencies: If we consider a lattice-structured domain, then Brazil has two parents (Northern Hemisphere and Southern Hemisphere, as shown in Fig.2.). Therefore Q2 is not easy to be answered. We cannot estimate with precision the exact date of arrival and stay in Brazil for both travellers (Liz, Ann). Therefore Q1 is not easy to be answered either. The above inconsistencies are presented in Table I through relation R. Southern Hemisphere
Northern Hemisphere
USA
CANADA
MEXICO
BRAZIL
CHILE
Fig. 2. Lattice Structured Domain
An analytical observation of Table I is rising the following requirements/issues in terms of data representation: A time model and representation that presents temporal information in terms of the following physical measurements duration D (e.g. tuples {X1, X2, X4}) and frequency K of reappearance, if information is periodical, where D, K may not be known, (e.g. tuples {X3, X5}). In our effort to classify tuples in Table I based on a virtual decision attribute which simply declares the fact that a person has visited Brazil, it can be seen that tuples {X2, X3, X5} cannot be classified with an exclusive Boolean Yes or No. A similar problem would arise if a user requested through an algebraic operation all the countries that Ann or Liz did visited in the Northern and Southern Hemispheres. Therefore, there is also a need for algebraic operations that will support approximate answers.
6
The Time Representation
The time representation has the flexible feature of representing simultaneously the three different types of temporal information (definite, indefinite, infinite). ’ ’ A time interval Dt is presented in the form of [C+K*X, C +K*X] where C =C+D, * D³N ,(D denotes the property of duration) thus an interval is described as a set of two linear equations defined and mapped over a linear time hierarchy (e.g. H2 = day²month²year). The lower time point tEarliest is described by the equation tEarliest = ’ C+K*X. The upper point tLatest is described by the equation tLatest = C +K*X. C is the time point related to an instantaneous event that triggered a fact, K is the repetition factor, * K³N or the lexical ‘every’ (infinite-periodical information). X is a random variable,
Representation of Temporal Unawareness
27
X³N, including zero, corresponding to the first occurrence of a temporal fact (phenomenon) restricted between 0 and n. tL= tEarliest = C+K*X, tR= tLatest = C +K*X, C = C+D, 0Xn, x³N , K>0 ’
’
*
(3)
The above time formalism will enable an application or a user to express the following types of temporal information Definite Temporal Information: The duration (D) of an event is constant (D = t). All times associated with facts are known precisely in the desired level of granularity, assuming that K=0, since a definite fact occurs only once. Indefinite Temporal Information: is defined when the time associated with a fact has not been fully specified. Therefore the duration of a fact is indeterminate or bounded. This may occur for two reasons: either the duration of a fact is bounded (GL D GR). * When 0Xn, x³N and K>0, then Infinite Temporal Information is represented. Infinite Temporal Information is defined when an infinite number of times are associated with a fact. Infinite temporal information includes the following types of information: - Periodic: An activity instance is repeated over a time hierarchy with the following characteristics: a constant frequency of repetition K, it has an absolute and constant duration D, and X a random variable that denotes the number of reappearance’s for an activity instance. - Unknown Periodic: Generally is described in the following intervalic form, Dt = [^, ^]. The intuition is that the duration D of an activity instance is assumed to be known. However the frequency of reoccurrence (K=?) is not known. Recalling the definition it is known that if an activity instance is periodical, then its next reappearance cannot occur before the previous one is ended. It is known that K 1. Therefore the following conclusion can be made 1 D K. The product K*X is mapped also to a linear time hierarchy (e.g. H2 = day²month²year). D represents the duration of a timestaped fact instance, is also mapped in a linear time hierarchy and may be in the range between a lower and upper bound GL D GR. Constraints are built from arbitrary linear inequalities.
7
Representation of Conflicting & Uncertain Information
Let T be a set of time intervals T={[tL, tR] where tL=C+KX, tR=C+KX ¾ a1 Xav} and D a set of non temporal values. A generalised tuple of temporal arity x and data x l arity l is an element of T D together with constraints on the temporal elements. In that sense a tuple can be viewed as defining a potentially infinite set of tuples. Each extended relation consists of generalised tuples as defined above. Each extended relation has a virtual tuple membership attribute formed by a selection predicate either value or temporal that models the necessary (Bel) and possible degrees (Pls) to which a tuple belongs to the relation. The domain of tuple membership attribute is the Boolean set W = {true, false}. The possible subsets to that are {true}, {false} and W.
28
P. Chountas et al.
The support set for tuple membership can be denoted by a pair of numbers (Bel, Pls) where: Bel = m {|true|},Pls = m{|true|} + m{W} with property 0BelPls1 A tuple with (Bel, Pls) = [1,1] corresponds to a tuple that qualifies with full certainty. A tuple with (Bel, Pls) = [0,0] corresponds to a tuple that is believed not to qualify with full certainty. A tuple with (Bel, Pls) = (0,1) corresponds to complete ignorance about the tuple’s membership. At this point an issue arises: the estimation of the (Bel, Pls) measures in a structured lattice domain Let l be an element defined by a structured domain L. U(e) is the set of higher level concepts, i.e. U(e) = {n|n ³ L ¾ n is an ancestor of l}, and L(e) is the set of lower concepts L(e) = {n|n ³ L ¾ n is a descendent of l}. If l is a base concept then L(e) = « and if l is a top level concept, then U(e) = «. If L is an unstructured domain then L(e) = U(e) = «. Considering tuple {X2, X3, X6} and the selection predicate location = ”Brazil” then L(e), U(e)are defined as follows: - U(Brazil) = {Southern Hemisphere, Northern Hemisphere} L(Brazil) = « Rule 1: If (|U(e)| > 1 ¾ L(e) = «), e.g. |U(Brazil)|=2, L(Brazil) = «, then it is simply declared that a child or base concept has many parents (lattice structure). Therefore a child or base concept acting as a selection predicate can claim any tuple (parent) containing elements found in U(e), as its ancestor, but not with full certainty (Bel > 0, Pls 1). This is presented by the following interval (Bel, Pls) = (0,1]. Now consider the case where the selection predicate is defined as follows location = ”Southern Hemisphere”. Tuple {X2} fully satisfies the selection predicate and thus (Bel, Pls) = [1,1]. Tuples {X3, X6} do not qualify as an answer and thus (Bel, Pls) = [0,0]. However it cannot be said with full certainty ((Bel, Pls) = [1,1]) whether {X1, X4, X5, X7} satisfy the selection predicate or not, since Brazil belongs to both concepts {Southern Hemisphere, Northern Hemisphere}. Using the functions L(e), U(e) this can be deduced as follows - U(Southern Hemisphere) = {«}, L(Southern Hemisphere) = {Brazil, Chile} - B=U(L(Southern Hemisphere)¾(R.concept)) = U(Brazil) = {Southern Hemisphere, Northern Hemisphere}. Formally the function can be defined as follows: B((l1),( l2))= U(L(l1)¾( l2)) where l1 is a high level concept, l2 is a base concept are elements defined in a lattice structured domain. If both arguments are high level concepts or low level concepts then B((l1),( l2))= «. Function B((l1),( l2)) is defined only in a lattice structured domain. Rule 2: If B((l1),( l2)) is defined and | B((l1),( l2))|>1, then it is simply declared that multiple parents, high level concepts, are receiving a base concept as their own child. Therefore a parent or high level concept acting as a selection predicate can claim any tuple (child) containing elements found in (L(l1)¾( l2)), as its descendant, but not with full certainty (Bel > 0, Pls 1), presented by the following interval (Bel, Pls) = (0,1]. Similarly a temporal selection predicate can use the above functions (U(e), L(e), B((l1),( l2))) for imprecise temporal information, representing the time dimension as intervals, by labelling each node in the lattice with a time interval. The use of a lattice-structured domain by an application permits also the representation of temporal information at different levels of granularity. Next an extended relational algebra is defined which operates on our model. The operations differ from the traditional ones in several ways: The selection/join condition of the operations may consist of base concepts or high-level concepts.
Representation of Temporal Unawareness
8
29
Representation of Conflicting & Uncertain Information
We are considering, for illustration purposes, the three operations s (selection), (join), (set union). Selection: Selection is defined as follows: sP (R): ={t | t³R ¾ P(t) = true} where P denotes a selection condition. There are two types of a selection condition. A data selection condition (Pd) considers the snapshot relation R in Table 1. The temporal selection condition (Pt) is specified as a function of three arguments Pt: = < K,D,C> which is mapped to the time hierarchy Hr. It has to be mentioned that temporal constraints are included in the result tuples. The combined predicate over relation VT(R) in Table II is defined as follows: P:= Pd | Pt | Pd¾Pt. The selection support function Fs(tA ..An, P) returns a (Bel, Pls) pair indicating the support level of tuple t for the selection condition P, where A ..An is the set of attributes, excluding the virtual membership attribute. The selection support function Fs utilises the (U(e), L(e), B((l1),( l2))) functions in conjunction with Rule-1 and Rule-2, as defined in section 7, for estimating the actual support values. Recall that a compound predicate is formed by a conjunction of two or more atomic predicates. In this paper it is assumed that the atomic predicates are mutually independent. The support for the compound predicate P:= Pd | Pt | Pd¾Pt is computed based on the multiplicative rule: Fs(P) = (Fs(tA1..An, Pt) ¾ Fs(tA1..An, Pd)) =( Bel1 Bel2…..Beln , Pls1 Pls2 Plsn). Join: Let R, S be two extended relations, P be the join condition and Q the membership threshold condition. The extended join operator is defined as a Cartesian Q Q Product, followed by an extended selection: R P S s P (RS) where the tuple membership function is deviated by Fs (1) as in the case of the extended select operation. The time interval that the tuple membership is defined over is the intersection of the time intervals that the sources (A1…An) are defined. Dt(Fs) = Dt1A1¬Dt2A2¬…..¬DtnAn. Assuming two intervals with lower bounds tL = C + KX and tL’ = C’ + K1X and upper bounds tR = C1 + KX and tR’ = C1’ + K1X’. The time interval for the result is defined as tL’ = C”+ K’X the common lower bound where C’ = max (C, C’), and K’ = min (K, K1), and tU’ = C3 + K’X the common upper bound where C3= max (C1 C1’), and K’ = min (K, K1). Union: For the set operators, including union, uncertainty can be introduced when relations with different levels of refinement for the same information are combined. Without extra knowledge it is reasonable to choose the information with the finest granularity as the one to be classified with full certainty (Bel, Pls) = [1,1]. Information not in the finest granularity is classified with no full certainty (Bel, Pls) = (0,1]. Both types of information are part of the result tuples, accompanied by different beliefs. Union is formally defined as follows: R S {t| ( r) ( s) (r ³ R ¾ s ³ S ¾ t. L = r. L = s. K) ¾ (t (Bel, Pls) = Fs (r. (Bel, Pls), s (Bel, Pls)). L is the arity of the relation, Fs denotes the selection support function. Tuples with different valid times are not merged, independently of the fact that they are expressing the same snapshot tuple. 1
1
30
9
P. Chountas et al.
Conclusions
In this section we are outlying some thoughts that will enable us to represent conflicting information as part of a 4 valued characteristic function of paraconsistent relation, which maps tuples to one of the following values: (for contradiction), t (for true), f (for false) and ^ (for unknown). This will let to reason about conflict a quantitative four-value logic. The elements of the temporal information (temporal facts) can be represented [8] in the form < F , t FL , t FR > , where [t LF , t RF ] is a time interval. Using the ideas for intuitionistic fuzzy expert systems we can estimate any fact F and it can obtain intuitionistic fuzzy truth-values V ( F ) =< µ F ,ν F > , such that µ F ,ν F ∈ [0,1] and µ F + ν F ≤ 1 . Therefore, the above fact can be represented in the form < F, t FL , t FR , µ F , ν F > . This form of the fact corresponds to the case in which the fact is valid in interval t F = [ t FL , t FR ] and at every moment of that interval the fact has the truth-value < µ F , ν F > . Thus it is required to develop relational intuitionistic for representing uncertainty and contradiction or conflict as part of a multi-source temporal database environment.
References 1.
Chountas, P, Petrounias, I., Representation of Definite, Indefinite and Infinite Temporal Information, Proceedings of the 4th International Database Engineering & Applications Symposium (IDEAS’2000), IEEE Computer Society, ISBN 0-7695-0789-1, 167-178, 2001 2. Barbará, D., Garcia-Molina H., Porter, D., The Management of Probabilistic Data, IEEE Transactions on Knowledge and Data Engineering, Vol. 4, No.5, 487v502, 1992 3. Dyreson, C.E., Snodgrass. R.T., Support Valid-Time Indeterminacy, ACM Transactions on Database Systems, Vol. 23, No. 1, 1–57, 1998 4. T. Bylander, T. Steel, The logic of Questions and Answers, Yale University Press, New Haven, CT, 1976. 5. F. Kabanza, J-M. Stevenne, P. Wolper, Handling Infinite Temporal Data, Proc. of ACM Symposium on Principles of Database Systems (PODS), 392–403, 1990. 6. Chountas, P, Petrounias, I., Representing Temporal and Factual Uncertainty, Proceedings of the Fourth International Conference on Flexible Query Answering Systems, FQAS 2000, Advances in Soft Computing, ISBN 3-7908-1347-8, Physica-Verlag, 161–170, 2000. 7. Andreasen T, Christiansen H, Larsen H. L, Flexible Query Answering Systems.Kluwer Academic Publishers, 1997 8. Atanassov K., Temporal Intuitionistic fuzzy relations. Proceedings of the Fourth International Conference on Flexible Query Answering Systems, FQAS 2000, Advances in Soft Computing, ISBN 3-7908-1347-8, Physica-Verlag, 153–160, 2000. 9. Motro A, Anokhin P, Berlin J, Intelligent Methods in Virtual Databases, Proceedings of the Fourth International Conference on Flexible Query Answering Systems, FQAS 2000, Advances in Soft Computing, ISBN 3-7908-1347-8, Physica-Verlag, 580–590, 2000 10. Adnan Yazici, Alper Soysal, Bill P. Buckles, Frederick E. Petry: Uncertainty in a Nested Relational Database Model. DKE 30(3), 275–301, 1999
Scalable and Dynamic Grouping of Continual Queries Sharifullah Khan and Peter L. Mott School of Computing, University of Leeds, Leeds, LS2 9JT, UK {khan, pmott}@comp.leeds.ac.uk
Abstract. Continual Queries (CQs) allow users to receive new information as it becomes available. CQ systems need to support a large number of CQs due to the scale of the Internet. One approach to this problem is to group CQs so that they share their computation on the assumption that many CQs have similar structure. Grouping queries optimizes the evaluation of the queries by executing common operations in the group of queries just once. However, traditional grouping techniques are not suitable for CQs because their grouping raises new issues. In this paper we propose a scalable and dynamic CQ grouping technique. Our grouping strategy is incremental in that it scales to a large number of queries. It also re-groups existing grouped queries dynamically to maintain the effectiveness of the groups.
1
Introduction
Continual Queries (CQs) [3,12,13,7] are persistent queries that are issued once and then are run at regular intervals or when source data change until a termination condition is satisfied. A CQ is a typical SQL query having an additional triggering condition (T rg) and termination condition (T rn). The triggering condition specifies when to evaluate the query, while the termination condition specifies when to terminate the evaluation of the query. CQs are of two types: changebased and time-based. Change-based CQs are fired when new data arrive at a source, while time-based CQs are executed at regular intervals. CQs relieve users from having to re-issue their queries frequently to obtain the new information that matches their queries from a data source [3,12,13,7]. The users receive new information automatically as it becomes available. CQs are particularly useful for an environment like the Internet composed of large amounts of frequently changing information. For example, users might want to issue CQs of the form: “notify me during the next year whenever BMW stock price drops by more than 5%”. CQ systems need to support a large number of CQs due to the scale of the Internet. One approach to designing CQ system that scale, is to use query grouping to share computation, when many queries have a similar structure. This optimizes the evaluation of the queries by executing common operations in a group of queries just once [3,16,18,8]. Consequently, it avoids unnecessary query invocations over autonomous data sources on the Internet. However, the grouping of T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 31–42, 2002. c Springer-Verlag Berlin Heidelberg 2002
32
S. Khan and P.L. Mott
CQs raises new issues: (1) a CQ system has to handle a large collection of CQs due to the scale of the Internet (2) CQs are not all made available to the system at the same time (3) a user’s requests are unpredictable and change rapidly (4) CQs in a group can have different evaluation times. So, CQ groups are dynamic: they are continually changing as old queries are deleted and new ones are added. In this case one or more groups may requires dynamic re-grouping to maintain their effectiveness [3]. In addition, users may request a similar query with different evaluation times. One user may want to evaluate the query daily, while another may want to evaluate it when the profit increases by 60%. This makes sharing computation difficult [3]. Traditional grouping techniques [16,19,18,10] are not suitable for CQs because they are static and designed to handle a small number of queries [3]. In this paper we propose a novel grouping technique that is both scalable and dynamic. This is scalable because our grouping strategy is incremental, allowing it to scale to a large number of queries. This means that the existing groups are considered as a possible choice for a new input query to join, rather than re-grouping all the queries in the system when a new query is added. The new input query is merged into an existing group if the query matches to that group, otherwise a new group, consisting of just the new query, is created. In addition, there is then a dynamic step, whereby queries are added to and removed from the existing groups, thus causing the sizes of the groups to change, and all or some of the groups are re-grouped automatically. The technique is based on our new execution model of grouped CQs. This model makes the technique more scalable and also permits a simplified approach to the dynamic re-grouping of queries. The rest of the paper is organized as follows. Section 2 presents related work. Section 3 describes the new execution model of grouped CQs. Section 4 presents matching technique in terms of matching conditions and levels. Section 5 describes the grouping management of CQs. It also describes how to select the best group among matched groups. Section 6 concludes the paper.
2
Related Work
Grouping queries on the basis of common operations has applications in many areas such as multiple query optimization [18,16], deductive queries [9], materialized view selection [4,8,19], and semantic data caching [2,10]. Multiple query optimization algorithms usually consist of two conceptual phases: identifying common sub-expressions, and constructing a global access plan for a small number of queries. Moreover, all the queries of a group are made available to the grouping system at the time of planning. The work of [9] on the problem of deriving query results based on the results of other queries is also related to the problem of multiple query optimization since it provides algorithms for checking common sub-expressions. Multiple query optimization has the advantage that queries can be grouped to share common data. Its disadvantages are that they
Scalable and Dynamic Grouping of Continual Queries
33
are static and it is expensive to extend them to a large number of queries in the Internet environment [3]. Related to query grouping is the problem of materialized view selection. However, the algorithms for the latter are also static and can only handle a small number of queries. The approach in [8] is interesting in that it dynamically manages view selection, but it is limited to multi-dimensional queries in a data warehousing environment. Similarly, semantic data caching [2,10] optimizes the processing of subsequent queries by using the results of previously asked queries. Further, [3] groups CQs incrementally on the basis of expression signature. It treats time-based and change-based CQs uniformly. In it, CQs are grouped over the server and the group query (GQ - the query which represents the group) is evaluated against the data source while other CQs of the group are deduced from the result of the GQ. However, this model does not handle dynamic re-grouping to maintain the effectiveness of the groups. In addition, the query-split scheme in the model replicates the group query results on the CQ server. The model can proliferate the versions of group query results in dynamic re-grouping.
3
Execution Model of Grouped CQs
The new execution model of grouped CQs is shown in Figure 1. It is based on differential evaluations of CQs. By differential evaluation, we mean that the CQ is evaluated on the changes that have been made in the base data since its previous evaluation instead of on the whole base data. This technique bypasses the complete evaluation of the CQ after the initial evaluation and subsequently reduces the data transmission [11,3]. We maintain base data changes in differ-
Q1(r1):
CQ Grouping CQ Differential Evaluation ∆ r1 a r1
∆ r2
r2) { Q2(r1, Q3(r1, r2) Q4(r1, r2) G2:Q5 { Q5(r1, r2) G1:Q2
a r2
Q6(r2)
Wrapper
r1
Internet
CQ Server
Clients
r2
Data Source−I
Data Sources on the Internet
Fig. 1. Execution model of continual queries
ential relations, which are held on the CQ server to make the system scalable. Each CQ is evaluated for the first time against the data sources. After initial evaluation, single relation CQs are evaluated locally against these differential relations. However, multi-relation CQs can never be wholly evaluated against differential relations. They also need to access data drawn from base relations which are known as auxiliary data [17].
34
S. Khan and P.L. Mott
Multi-relation CQs are grouped on the CQ server. Each group of queries has its own GQ. A GQ is just an ordinary query which is pruned of the projection operator to simplify the insertion of a new query into an existing group dynamically and to make grouping more scalable. To retrieve auxiliary data, we derive auxiliary queries from a group query. These auxiliary queries are run when changes occur in differential relations (see [7] for details). These data are then stored in auxiliary relations on the CQ server. Multi-relational CQs are evaluated against differential and auxiliary relations after an initial evaluation.
4
Matching Technique
In grouping queries, it must be determined that a given query can be computed from the GQ. So, query matching is performed to decide this. In this section we define our query matching technique used for query grouping. We formally define a GQ first. Let a database d = {r1 , r2 , ... rn } be a set of base relations, ri over the scheme {ai1 , ai2 , ..., aimi } = attr (ri ), where mi is the total number of attributes of ri . Then: Definition 1 (Group Query). A query Q is a triple (rQ , pQ , aQ ), where rQ ⊆ d, aQ ⊆ r∈r attr (r), and pQ = C1 ∧ · · · ∧ Ck . Here Ck is a clause and each Q clause can be a simple predicate or a disjunction of simple predicates, e.g. Ck = (a11 op c1 ∨ a11 op c2 ) and op ∈ {≤, , =}, and c is a constant or an attribute of a base relation. We use the term “group query, G,” or simply “group” for a single query which is being considered as a representative of a group of queries. In the above definition, rQ defines the base relations of the query, pQ indicates the selection and join predicates of a query Q and aQ are the attributes that are involved in the query predicate pQ . 4.1
Query Matching
In query matching, conditions are checked for matching between a GQ and a new CQ. These matching conditions are defined on query relations, predicates and the attributes of the predicates of the group and new query. This makes the matching technique simple and predicts a matching level usually at an early steps of the matching process. Consequently, the use of these query matching conditions reduces the computational complexity in a match detection. We define the conditions under which a query Q = (rQ , pQ , aQ ) matches to a group query G = (rG , pG , aG ) as follows: Condition 1 (Relation condition). A Q can be matched to (i.e. computed from) = φ. G iff rG ∩ rQ Condition 2 (Attribute condition). The relationship between the attributes of Q and G identifies a matching level as follows: (1) if aG = aQ , then G can be an exact match to Q (2) if aG ⊆ aQ , then G can be a containing match to Q (3) if aQ ⊆ aG , then G can be a contained match to Q.
Scalable and Dynamic Grouping of Continual Queries
35
Condition 3 (Predicate Condition). A Q can be matched to G iff PQ → PG or a G can be matched to Q iff PG → PQ . The problem of checking implication has been investigated for a considerable time and is known as the satisfiability or the implication problem. The solution of this problem involves an algorithm whose computation complexity is generally exponential. However, restrictive algorithms are proposed in [9,2] for formulas that contain only attributes, constants, comparison operators (¡, ¿, =), and logical connectives (AND, OR, NOT). The following lemma, described in [2,1], gives a constructive way of predicate satisfiability checking. Lemma 1(Predicate satisfiability check). Let PQ , PG be the predicates of queries Q and G and Ci , Cj be clauses as in Definition 1. Then: (1) PQ → PG iff PQ → Cj for each Cj of PG ; (2) PQ → Cj iff there is a clause Ci of PQ such that Ci → Cj ; (3) if {A1 , · · · , Ak } are the simple predicates of Ci and {B1 , · · · , Bm } are the simple predicates of Cj then Ci → Cj iff for each A in {A1 , · · · , Ak } there is a B in {B1 , · · · , Bm } such that A → B. A → B is false if A and B involve different attributes. Table 1 provides a look-up table for the implication A → B. The entry in Table 1(a) indicates the relationship between constants c and c under which (xθ1 c) → (xθ2 c) is true; a blank entry means that in no situation can it be true. An entry of an operator op (e.g. ¡) means that the implication holds provided that c op c . In Table 1(b) a checked entry indicates that (xθ1 y) → (xθ2 y) is always true, where both x and y are attributes. Moreover, if a disjunction of simple predicates exists in a Table 1. Look-Up Table for A → B θ1 \ θ2 x =c ¡ ≤ ¿ ≥ θ1 \ θ2 x =y ¡ ≤ ¿ ≥ √ √ √ x=c = ¡ ≤ ¿ ≥ x=y √√ ¡ ≤≤ ¡ √ ≤ ¡ ≤ ≤ √√ ¿ ≥≥ ¿ √ ≥ ¿ ≥ ≥ (a) Unary Atoms, (b) Binary Atoms
query, it can be converted into conjunctive form with elementary logic [1]. We now define query matching formally as follows. Definition 2 (Query matching). A query Q matches to (i.e. is computed from) a group query G iff conditions 1-3 are satisfied. Partial Query Matching. Consider queries Q and G with predicates PQ = (x ≤ 4 ∧ y ≥ 7) and PG = (x ≤ 5 ∧ y > 9) and rQ = rG = r over the scheme PG . r = (x, y, z). The queries fulfill relation and attribute conditions but PQ → However, the predicates are obviously related in some way. Here, we define a new notion of partial matching in the interest of grouping scalability.
36
S. Khan and P.L. Mott
Definition 3 (Partial query matching). A query Q partially matches to (i.e. may be partially computed from) a group query if matching conditions on relations and attributes are satisfied but the predicate condition is not satisfied. In some cases a partial matching can be converted into a matching by satisfying the predicate condition. The predicate condition can be satisfied by increasing or decreasing the attribute values in either of the query predicate. Those cases where the predicate condition can be satisfied are shown Table 2. We formally define a conversion function as follows. Table 2. Look-Up Table for (Ai → Aj ) θ1 \ θ2 x =c ¡ ≤ ¿ x=c ≥ ¿ ≤ ¡ ¿ ¿ ≤ ≥ ¿ ¿ ¡ ≥ ≤ Unary Atoms
≥ ¡ ¡ ¡
Definition 4 (Conversion Function). Let (xθ1 c) and (xθ2 c) be predicates of queries Q, G respectively. (xθ1 c) → (xθ2 c) means they do not imply but can be converted into (xθ1 c) → (xθ2 c) as follows. IF θ1 is = and θ2 is (< or ≤) THEN replace (xθ2 c) by (x ≤ c) ELSE IF θ1 is = and θ2 is (> or ≥) THEN replace (xθ2 c) by (x ≥ c) ELSE replace (xθ2 c) by (xθ1 c) ENDIF In our example, PQ = (x ≤ 4 ∧ y ≥ 7) →PG = (x ≤ 5 ∧ y > 9) can be converted into a satisfied implication by changing (y > 9) to (y ≥ 7). Then PQ = (x ≤ 4 ∧ y ≥ 7) → PG = (x ≤ 5 ∧ y ≥ 7) and Q matches to G. 4.2
Matching Levels
This sub-section formally presents group matching levels. We have defined these levels in terms of relation, attribute and predicate conditions. They are shown in Table 3 and graphically illustrated in Figure 2. Similar definitions are given in [10] for query matching in semantic data caching but in the terms of query containment. Cases 4-6 repeat cases 1-3 but with only a partial match: these cases (4-6) will also be referred to as overlap matches.
5
Grouping Management
In this section we describe our grouping technique. Generally there are a large number of GQs. We use a group index (e.g., a memory-resident hash table) to
Scalable and Dynamic Grouping of Continual Queries
37
Table 3. Query matching levels and their conditions Match. Levels 1.Exact 2.Containing 3.Contained 4.Partial Exact 5.Partial Containing 6.Partial Contained
Relation cond. Attribute cond. Predicate cond. rQ ∩ rG = φ aQ = a G pQ → pG rG ∩ rQ = φ aG ⊆ a Q pQ → pG rQ ∩ rG = φ aQ ⊆ a G pG → pQ rQ ∩ rG = φ aQ = a G pQ → pG rG ∩ rQ = φ aG ⊆ a Q pQ → pG rQ ∩ rG = φ aQ ⊆ a G pG → pQ
Group
Query
Query
Group + Query
Group
Group Query
Case 1
Case 2
Case 3
Case 4
Group
Query
Case 5
Query
Group
Case 6
Fig. 2. Matching levels
store these groups. The group index is formally defined as follows. Definition 5 (Group Index). I = {< K1 , G1 >, < K2 , G2 >, ..., < Kn , Gn >} = Kj if i = j. is an index for a set of group queries {G1 , G2 , · · · , Gn }, where Ki Ki is the group identifier key. The structure of an entry in the group index is (K, rG , pG , aG ). The following example illustrates the group index. Example 1. The given relations: p (a, b, c) and q (x, y, z) are shown in Table 4. We present a group index also in Table 4 that contains GQs defined over these relations. This table represents the relations, predicates and predicate’s attributes of GQs, which are used for matching. Table 4. Relations p , q and a group index of Example 1 a 15 11 .. .
b c1 c3 .. .
c Lds Hul .. .
x c1 c3 .. .
y 50 50 .. .
z c s .. .
K G1 G2 G3
rG pG aG p, q a > 10 ∧ y < 50 ∧ b = x a, b, x, y p, q a < 10 ∧ y > 50 ∧ b = x a, b, x, y p, q a > 20 ∧ c = Lds ∧ b = x a, c, x, y
Our grouping technique treats time-based and change-based CQs uniformly to make the technique more scalable. Hence the triggering and termination conditions are not considered in the grouping decision process. They are tested against source data and can also be grouped in order to reduce trigger processing cost [13]. A simple approach to CQ grouping would be to apply traditional grouping techniques by re-grouping all queries whenever a new query is added. This approach is not applicable to a large dynamic environment because of the
38
S. Khan and P.L. Mott
associated performance overhead [3] and so an incremental grouping strategy is proposed here. When a CQ is added to the system, the CQ is first pruned of any projections that it uses. The pruned CQ is then compared with the existing group representatives to see which group it joins before its projections are restored. If it matches to any GQ that group is selected, otherwise a new group is created. In the case that the matched group is exact or containing, the CQ simply joins the group. In the case that the matched group is contained then the GQ is replaced by the new CQ so as to accommodate this CQ in the group. This replacement broadens the group size. This means that the values of certain attributes in the GQ are increased or decreased to maximize the number of tuples in the intermediate relation of the GQ. Similarly, when the matched group is either partial exact or partial containing, the group size is broadened with the help of the conversion function. When the matched group is partial contained, the GQ is replaced by the input CQ and furthermore, the input CQ is extended to the GQ with the help of the conversion function. Consequently, the group size is broadened twice in the partial contained case. In the case that a CQ does not match to an existing GQ in the index, a group key is created by the system for a new group. The predicate of the query, pQ , base relation of the query, rQ , and the attributes that are involved in the query predicates, aQ are entered in the new group. The new group for the query is < K, (rG , pG , aG ) >. In the case of multiple matched GQs, it is necessary to select the best matched GQ. Otherwise, it may lead to a non-optimal grouping strategy [16,2]. The process of selecting a single GQ from among the matched GQs can be conceptually divided into two phases: matching and optimization. The matching phase detects a matching level of a GQ to an input CQ. The optimization phase searches for the best GQ among the matched GQs. The simplest approach to the selection procedure of a GQ is to perform the matching phase and then the optimization phase separately. However, this approach may entail navigating an extremely large search space in selecting the best group. An alternative is to integrate the matching with the optimization. This approach unifies the search space and avoids duplicate effort [2]. We employ this alternative in our selection strategy with the help of the graph searching algorithm described in [15]. 5.1
Selection Criterion
In this subsection, we present a criterion for the selection of the best matching group among the matched groups. In distributed query processing, two types of costs are involved: local processing costs and communication costs. Communication costs are frequently measured by the number of bytes transmitted from one site to another [14,5]. When communication costs are present they are predominant [5,19]. Therefore our criterion for the best matching GQ is to minimize communication cost. Our main objective here is to select the GQ that transmits the least data over the network. Consequently, this selection criterion also improves query responsiveness and reduces storage consumption on the CQ server.
Scalable and Dynamic Grouping of Continual Queries
39
In general, we cannot determine exactly how many tuples an intermediate relation has without computing it. We are forced to estimate the size of the intermediate relation. The problem of developing methods for query size estimation in database query optimization has been studied for over fifteen years [14]. Interested readers may refer to [6] for a brief description of these methods. In this research we assume a query size function available as follows. Definition 6 (Size Function). Let Q be a query defined over a database. Size (Q) estimates the size of the intermediate relation that is produced by the query. 5.2
Selection Rules
Here we describe rules for the selection of the best group on the basis of our selection criterion. We merge these rules in the graph searching algorithm to search the best group among the matched groups (see [6] for the complete algorithm).The rules are: (1) The exact matched group is the best selection. (2) The containing match level should be preferred over contained and all overlap matches. (3) If two matched groups are containing matches, select the smaller group. (4) If two matched groups are contained matches, select the larger group. (5) The contained match level should be preferred over a partial contained match. (6) If two matched groups are overlap or one is overlap and the other is contained, then find the sizes of each group query and the query, and select the group that has minimum size difference from the query. It is noted that these rules consider all of the possible forms of multiple matched groups (some of the forms are not possible mathematically) and select a group on the basis of minimum data transmission. The rules prefer an exact match if there is one available so as to avoid broadening of the group size and the consequent additional data transmission. If an exact match is not possible, a containing match is preferred over contained/overlap matches, also in order to avoid broadening of the group size. When there is more than one group available at the preferred level, the choice between them is made by reference to the Size function. If this level is containing then the group with the smallest size is chosen to avoid additional data transmission. If the match level is contained then the group with the largest size is chosen to minimize the broadening of the group size. If there are two matches at the best level and one is contained and the other is partial contained, then the contained group is chosen so as to broaden the group size once instead of twice. If both match levels are overlap or one is overlap and the other is contained, the group which has the minimum difference with the input query is chosen so as to broaden either the group query or input query size at least. To summarize, the group that transmits minimum data is selected dynamically in order to reduce data traffic over the network. Consequently, the query responsiveness should be improved and the storage consumption on the CQ server reduced. We now give an example to illustrate the concepts introduced earlier. Example 2. Assume that we have two relations: p (a, b, c) and q (x, y, z) shown in Table 4. Assume further that we have three groups G1, G2 and G3 shown in Table
40
S. Khan and P.L. Mott
4. Two CQs to be grouped, where projection is omitted in the interest of clarity, are: Q1 = σa>15∧y, · · ·· · · · · · , < Keyn , Grop, T rmintd, Dscripsn >} is a list for the set of queries, {Q1 , Q2 , ..., Qn } where Keyi = Keyj if i = j. Keyi is a query identifier; Grop is the group identifier for the group that includes the query. Single relation queries have ‘Nil’ in this field because they do not have a group. T rmintd is a Boolean which is true when the query is no longer active. Dscripsn holds the description of the CQ (i.e., (Q, T rg, T rn)). The following example is used as an illustration. Example 3. We present a query list in Table 5 that contains CQs defined over the relations p (a, b, c) and q (x, y, z) shown in Table 4. Projection is again omitted for sake of clarity. Continuous insertions and deletions of CQs from a group reduces the quality of the group in terms of the data transmission and storage space on the CQ server. For example, unnecessary data is transmitted and stored on the server after the termination in the case of Q2 in group G2 of Table 5. The query list is examined periodically. If the terminated CQ is a GQ, then it is deleted from the group index and all the non-terminated CQs of the group in the query list are re-grouped. Otherwise we simply remove the CQ from the query list. For example, after termination in the case of Q2 in group G2 of Table 5, the CQs in G2 need re-grouping. On the other hand, after termination in the case of Q3 in group G1 of Table 5,this query is simply removed. We do not remove a GQ immediately after it is terminated, but continue to use it as the group representative until such time as we can conveniently re-group. We believe that this periodic strategy of re-grouping is cheaper than immediate re-grouping in uniform query grouping in the terms of system performance overhead.
Scalable and Dynamic Grouping of Continual Queries
41
Table 5. Example of Query List Key Grop Trmintd Q1 G1 No
6
Q2
G2
Yes
Q3
G1
Yes
Q4
G2
No
Descripsn σa>10∧y15∧y 0
The z in this formula is a member of whatever type or set of types the function f accepts as input values. This means that support is defined for all uncertain types, whether they are spatial or not. All the types for uncertain spatial data from [12] store the probability that the object itself exists separately. This probability is retrieved by the Existence operator such that Existence(A) is the probability that object A exists. All the uncertain data types and operations used in this paper rely on probabilities or probability density functions. 4.1
Base Types
An uncertain number can easily be modelled by a probability function. For a real number, this function would have to be defined as a probability density function, whereas for integers, it might be just a collection of probabilities for the number having particular values. The type for an uncertain number is called AUNumber. Many queries in spatial databases return Boolean values for data without uncertainty. Because a single Boolean value cannot indicate uncertainty, different ways of
46
E. Tøssebro and M. Nygård
answering these queries must be found. The simplest alternative is to introduce a third “Boolean” value, Maybe, which indicates that the answer is not known. This is described in [5]. However, many times a better approach is to return the probability of the answer being true. Thus, values that would have been boolean in the crisp case are probabilities in the uncertain case in this paper. This type is called AProb. 4.2
Uncertain Points
An uncertain point (AUPoint) is a point for which an exact position is not known. However, one usually knows that the point is within a certain area. One may also know in which parts of this area the point is most likely to be. An uncertain point is therefore defined as a probability density function P(x, y) on the plane. The support of this function is the area in which the point may be. To be able to model crisp points, P(x, y) must be allowed to be a dirac delta function. A possible uncertain point is shown in Figure 1a. The central point in Figure 1a shows the expected position of the point. A set of uncertain points (AUPoints) is needed to implement operations that may return more then a single point value, even if they only get a single point as input. This type is the uncertain version of APoints from [6]
P
∆P
PC
∆P
∆P
∆P ∆P
D 8QFHUWDLQ3RLQW
Gradient
E 8QFHUWDLQ&XUYH
F 8QFHUWDLQ)DFH
Fig. 1. Uncertain Spatial Types
4.3
Uncertain Lines
The line type as defined in [6] is a set of curves where each member is a simple curve. The first step in developing a model for an uncertain line is therefore to create a model for an uncertain curve (AUCurve). An uncertain curve is a curve for which the exact shape, position or length is not known, but it is known in which area the curve must be. An example of an uncertain curve is shown in Figure 1b. It may also be known where in this area the curve is most likely to be. The dotted line in Figure 1b exemplifies this. When seen along a line crossing it, a crisp curve would look like a point, or a set of points in the case of multiple crossings. When seen along the same line, an uncertain curve should be a probability density function indicating where the curve is most likely to cross. This function may apply along the line marked “Gradient” in Figure 1b.
Uncertainty in Spatiotemporal Databases
47
When seen along its length, a probability function determines the probability of existence in each point. The line type (AULine) is a set of curves for the same reasons as given for points. 4.4
Uncertain Regions
An uncertain region (AURegion) is a set of uncertain faces. An uncertain face (AUFace) is one where the location of the boundary or even the existence of the face itself is uncertain. This may be modelled as a probability function P(x,y) which gives the probability that the point (x,y) belongs to the face. Support(P) must be a valid crisp face. Figure 1c shows an example of an uncertain face where the black area is the area in which the face is certain to exist and the grey area is the area of uncertainty.
5
The Time Type and Temporal Uncertainty
Temporal uncertainty has been studied in [3]. However, their model is based on discrete time. To be consistent with the way the spatial dimension has been modelled, the temporal dimension should also be continuous. Therefore, the model defined here uses continuous time. A crisp time instant can be modelled as a real number. An uncertain time instant may therefore be modelled by an uncertain real. The modelling of time intervals requires a type for a set of uncertain intervals. This type takes the Arange type constructor from [6] and adds uncertainty to it. An uncertain interval (AUInterval) is a probability function P(x) over the real number line which indicates the likelihood that the interval includes the number x. Figure 2a shows one example of such a function. In this example, there is uncertainty about the length of the time interval, represented by the two time intervals marked dT. Because an interval should be continuous, the probability function should not have areas in the middle where it is 0. This property is expressed by the NoDip function. This function is defined as follows: NoD ip ( f ) ∫ ( "x "y "z: ( ( f ( x ) > 0 ) Ÿ ( f ( z ) > 0 Ÿ x < y < z) Æ f(y) > 0 )
1
t dt
dt
D &RUHDQG6XSSRUW
t
E 8QFHUWDLQW\DVWRWKHQXPEHURILQWHUYDOV
Fig. 2. Examples of uncertain time intervals
48
E. Tøssebro and M. Nygård
7KHLQWHUYDOPXVWEHGHILQHGRYHUDSDUWLFXODUW\SHRIQXPEHU VXFKDVLQWHJHU or real. Definition 1: An uncertain interval is defined as follows. A UInte rval ( a ) ∫ { IF a ProbFunc ( IF a ) Ÿ PieceCont ( IFa ) Ÿ I
I
I
Support ( IF a ) Œ A I nter val (a ) Ÿ NoDip ( IF a ) } I
I
AInterval(α)LVWKHVHWRIDOOFRQWLQXRXVFULVSLQWHUYDOVRYHUWKHW\SH This defines a single interval, but many operations may return a set of such intervals. This requires the AURange type. Definition 2: The uncertain range is defined as a set of uncertain intervals. A URan ge ( a) ∫ { UR Õ AUInter val ( a ) Finite ( UR ) Ÿ "( ai Œ UR ) "( bi Œ UR ): ( ai π bi Æ Disjoint ( ai, bi )
To find the time instant at which the time interval starts, take the derivative of the probability function of the time interval in the dT intervals. Each disjoint region on the real number line in which this derivative is not zero is a time instant which bounds the time interval. Our types for time instants and intervals are chosen because they can model both uncertainty as to the length of a time interval and uncertainty as to exactly how many time intervals there are in a set. The last can be done by using a probability function like that shown in Figure 2b. In this case, the set of bounding time instants described in the previous paragraph will also contain some uncertain time instants corresponding to the area of uncertainty in the middle of Figure 2b. 5.1
Turning Spatial Types into Spatiotemporal Types
Types for uncertain temporal data may be derived from the non-temporal data types. If A is a data type, let TA be its temporal version. The value of TA in each time instant in which it exists must be a valid member of A. It is therefore natural to define TA as a function from time to A. Uncertainty about when an object first appeared or ceased to exist may be indicated by making the existence of the A’s close to the start or end uncertain. The uncertainty of existence of TA becomes a function of time that indicates the temporal uncertainty. The uncertain spatial data types in Section may be transformed into spatiotemporal types using the “UMoving ´W\SHFRQVWUXFWRU Definition 3: The 80RYLQJ W\SHFRQVWUXFWRULVGHILQHGDVIROORZV Ï A UMoving ( a ) ∫ Ì f a f a :A ins tan t Æ Aa Ÿ PartFunc ( fa M M M Ó ¸ Ÿ Finite ( Components ( f a ) ) Ÿ PieceCon t ( f a ) ˝ M M ˛
This type constructor creates a temporal type from a non-temporal type. For instance, AUMoving(A ) is the carrier set for a temporal, or moving, uncertain point. This carrier set contains all functions from time to AUPoint which satisfies the above criteria. UPoint
Uncertainty in Spatiotemporal Databases
49
In this definition, fα is the function from TA to A. The first two conditions in the definition are the same as those of the moving(α) type constructor from [6], and are there to ensure implementability. PartFunc(f) is true iff f is a partial function. The difference between A and A is that A does not include the empty set. The final condition ensures that the probability values or probability densities are piecewise continuous in time as well as in space. This is necessary for implementability as well as to ensure that the return values of some operations are valid. The function fα is piecewise continuous if the probability value in any single crisp point for the values returned by f at different time instants are piecewise continuous. An example of a moving type is the moving uncertain point, AUmoving(A ). This type is a function f from Ainstant to AUPoint, that is, for any time instant in which f is valid, a value of type AUPoint is returned. Notice that f takes a crisp time instant, not an uncertain one. Because this function must be piecewise continuous, there may not be any time instant where the point is at a completely different location from the time instants immediately before and after it. This method of extending a model for indeterminate spatial data to spatiotemporal data may also be used for some other models. It may, for instance, be used to extend Schneider’s model for vague regions from [10] with only minor modifications. M
M
UPoint
6
Operations on Uncertain Data
An important part of a set of data types is a general definition of the operations that can be applied to them. The operations described in [6] for spatiotemporal data have been evaluated. The operations from [6] that the evaluation showed to be most important are described. In this paper, the emphasis is on the temporal dimension. The operations are divided into two categories, those that are applied to data with no uncertainty, but which are meaningless for uncertain data, and those that can be applied to both kinds of data.
Table 1. Type Designators2
Letter Type
Letter
Type
Po Ps L F Re S N
I Ra NI T Pr MOV(X) CX
Interval (of Number) Range (of Number) Non-Spatial (Number or Range) Anu non-spatial or spatial type Probability Moving (X) Crisp X
Point Points Line Face Region Spatial (Point, Curve or Face) Number (Real or Time Instant)
2 All these stand for uncertain data types except for CX
50
E. Tøssebro and M. Nygård
In this section, the letter name of the variable describes its type as given in Table 1. A signature of the type S × S → S means that both the inputs must be of the same type, and the output is of the same type as the input. In a signature of the type S × Re → Pr, the S input is neither limited to the type of the other input nor of the output. In the semantics for the operations, the letter R is used for the result, A for the first input and B for the second input. 6.1
Lifting
Like operations for crisp data, operations for uncertain data can be lifted into operations for uncertain spatiotemporal data using the method described in [6]. The basic idea is that the lifted operation takes temporal versions of the normal input parameters as input and returns a temporal version of the normal output. The prefix MOV is used to indicate a moving, or temporal, type as opposed to a non-moving type. 6.2
Operations on Crisp Data which Are Meaningless for Uncertain Data
There are two operations for temporal data which cannot be determined when there is uncertainty about the time. They are listed in Table 2.
Table 2. Operations which are meaningless for temporal uncertain data
Operation Initial Final
Initial (MOV(T) → T): If there is uncertainty about when an object first appeared, its initial value cannot be determined. As an example, a region B is known to exist at time 3, and to overlap region A. It is also known that region B did not overlap region A before time 2, and that region B did not exist prior to time 1. In this case determining an initial shape is impossible. If B began to exist before time 2, it could not have overlapped region A at the start, but if it came into being after time 2, it could have overlapped region A right from the start. Final (MOV(T) → T): If there is uncertainty about when an object ceased to exist, its final value cannot be determined. The argument is the same as for Initial. Because the Initial and Final cannot be answered, replacement strategies are needed. One method would be to have operations that return the shape of the object at the first and last time instants in which it is known to exist. Another is to take the shape of the object at an early time such as when its probability of existence is 0.5, or even the when the probability of existence becomes greater than zero. Projections: Many operations in both spatial and spatiotemporal databases involve projecting a data type down into a universe with fewer dimensions than the universe in which the argument was defined. The exact probabilities of the projection cannot
Uncertainty in Spatiotemporal Databases
51
3
be determined without knowledge of the underlying probability model . Therefore, the best one can do is to create approximations. One could for instance project the support of the data type to get an outline of the projected result. Alternatively, for each projected value one might use the maximal value of all the values projected into it. One example of this would be if one wanted to know where a moving region has been at any point in time, one might use the maximal probability for each point in space. Table 3. Temporal restriction operations
Operation
Signature
Semantics
At
MOV(NI) × I → MOV(NI) MOV(S) × F → MOV(S) MOV(T) × CN → T MOV(T) × N → T
R(t)(x) = A(t)(x) ⋅ B(x) R(t)(x,y) = A(t)(x,y) ⋅ B(x, y) R(v) = A(B)(v) R(v) = ∫t A(t)(v) ⋅ B(t) dt
At_Instant
6.3
Operations which May Be Used on Both Crisp Data and Uncertain Data
Temporal restriction operations. These operations restrict a spatiotemporal object to certain times. The semantics of these operations are given in Table 3. At: This operation limits A to the times and places in which it intersects B. Checking At with a second argument other than a face or interval does not make sense in the uncertain case because the probability that the point or curve A intersects the point or curve B is 0. At_Instant: This operation returns the shape of the object at a time instant. For the uncertain case an integral is used because one must consider the shape of the object in the entire period of uncertainty.
Table 4. Projections from space-time into space or time
Operation
Signature
Semantics
Def_Time Locations
MOV(T) → Ra MOV(Po) → Ps
Present Trajectory
MOV(T) × N → Pr MOV(P) → L
This is a lifted version of Existence R = {p ∃t : (p = A(t)) ∧ ∃(ε > 0) ∀a ∀b : (t-ε < a < t+ε ∧ t-ε < b < t+ε) → (A(a) = A(b)) Existence(At_Instant(A, B)) Projection (See section 6.2)
Projections from space-time. These operations create projections of a spatiotemporal object into either space or time. The semantics of these operations are defined in Table 4. 3
The underlying probability model means what exactly it is that varies. In a lake with varying water level, there is really only one random variable, the water level. For a geological feature with uncertain measurements, however, the location and shape of different parts of the boundary are independent random variables.
52
E. Tøssebro and M. Nygård
Def_Time: This operation returns the uncertain time interval in which the object A exists. Because this is the same as would be returned by a lifted version of Existence, no new definition is needed for this operation. Locations: This operation returns the portions of the projection of an uncertain moving point which are points themselves. The projection of an uncertain moving point will only be a point when it stands still. This operation is therefore defined to return all the uncertain locations in which the point has stood still. Present: This function must return a probability rather than a Boolean value, because it may be uncertain whether the object is present or not. Trajectory: This operation returns the portions of the projection of an uncertain moving point which are lines. These are the projections of the uncertain point when it is moving. It contains the entire projection except for those periods in which the point has been standing still.
7
Discussion
Our work introduces types for continuous uncertain time instants and time intervals using the type system described in [12] as a base and uses these to create types for spatiotemporal data with uncertainty both in the spatial and temporal dimensions. This is done by using a type constructor that creates spatiotemporal types from spatial ones. This means that our method can also be used to extend other abstract type systems for spatial data into type systems for spatiotemporal data. Additionally, some of the operations from [6] are evaluated for use in the uncertain case. Some of the operations are meaningless for uncertain data, but many can be used for both crisp and uncertain data. Some operations may also be used in new ways or on new data types when the data is uncertain. For uncertain Boolean values, the operations instead return the likelihood of the answer being true. We also describe how to use the concept of lifting from [6] in the uncertain case. This means that any operation for uncertain spatial data can also be used on spatiotemporal data. Unfortunately, the introduction of uncertainty also means that some operations cannot be answered. This applies to Initial, Final and many projections. In this paper, we have tried to come up with strategies or operations which could be used to replace these operations, but such replacements are either inaccurate or they have somewhat different semantics.
References 1. 2.
E. Clementini and P. Di Felice: An Algebraic Model for Spatial Objects with Indeterminate Boundaries. In Geographic Objects with Indeterminate Boundaries, GISDATA series vol. 2, Taylor & Francis, 1996, pages 155–169. G. Cohn and N. M. Gotts: The ‘Egg-Yolk’ Representation of Regions with Indeterminate Boundaries. In Geographic Objects with Indeterminate Boundaries, GISDATA series vol. 2, Taylor & Francis, 1996, pages 171–187.
Uncertainty in Spatiotemporal Databases 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
53
E. Dyreson and R. T. Snodgrass: Supporting Valid-time Indeterminacy. In ACM Trans. on Database Systems, vol. 23, no. 1, pages 1–57. M. Erwig, R. H. Güting, M. Schneider, M. Vazirgiannis: Abstract and Discrete Modeling of Spatio-Temporal Data Types. ACM-GIS 1998: pages 131–136 M. Erwig and M. Schneider: Vague Regions. In Proc. 5th Symp. on Advances in Spatial Databases (SSD), LNCS 1262, pages 298–320, 1997 R. H. Güting, M. F. Böhlen, M. Erwig, C. S. Jensen, N. A. Lorentzos, M. Schneider, M. Vazirgiannis: A Foundation for Representing and Querying Moving Objects. In ACM Transactions on Database Systems 25(1), 2000. P. Lagacherie, P. Andrieux and R. Bouzigues (1996): Fuzziness and Uncertainty of Soil Boundaries: From Reality to Coding in GIS: In Geographic Objects with Indeterminate Boundaries, GISDATA series vol. 2, Taylor & Francis, 1996, pages 155–169. K. Lowell: An Uncertainty-Based Spatial Representation for Natural Resources Phenomena. SDH’94 vol. 2, pages 933–944. M. Schneider: Modelling Spatial Objects with Undetermined Boundaries using the Realm/ ROSE Approach. In Geographic Objects with Indeterminate Boundaries, GISDATA series vol. 2, Taylor & Francis, 1996, pages 155–169. M. Schneider: Uncertainty Management for Spatial Data in Databases: Fuzzy Spatial Data Types. In Proc. 6th Int. Symp. on Advances in Spatial Databases (SSD), LNCS 1651, Springer Verlag, pages 330–351, 1999. E. Tøssebro and R. H. Güting: Creating Representations for Continuously Moving Regions from Observations. In Proc. 7th Int. Symp on Advances in Spatial and Temporal Databases (SSTD), LNCS 2121, pages 321–344, 2001. E. Tøssebro and M. Nygård: Representing Uncertainty in Spatial Databases. Submitted for publication. L. A. Zadeh (1965): Fuzzy sets. Information and Control, 8, pp. 338–353, 1965
Integrity Constraint Enforcement by Means of Trigger Templates Eladio Dom´ınguez, Jorge Lloret, and Mar´ıa Antonia Zapata Dpt. de Inform´ atica e Ingenier´ıa de Sistemas. Universidad de Zaragoza. E-50009 – Zaragoza. Spain. fax: +34.976.76.11.32 {noesis,jlloret,mazapata}@posta.unizar.es
Abstract. The specification of data integrity controls in DBMS, and particularly support for triggers, is one of the most important features for database developers and administrators. However, it is recognized that the specification of a correct set of triggers is a difficult and errorprone task. Our proposal aims to facilitate such a task by suggesting a different method for determining constraints and triggers that check constraints when database updates take place. Specifically, the method proposes to define trigger templates in order to enforce constraints imposed in a schema pattern and to store them in a database. When the analyst specifies a particular conceptual schema (which matches the schema pattern) the associated triggers are automatically generated from the information stored in the trigger template database. Keywords: Active database systems; Integrity constraint enforcement; Trigger generation; Trigger template
1
Introduction
Within the active databases field, triggers are considered as a useful and powerful artifact for constraint enforcement [4,2]. However, although there is no question about their utility, it is recognized that the specification of a correct set of triggers is a difficult and error–prone task, which hampers their widespread use [10]. For this reason, the definition of methods and the construction of tools which facilitate the specification of triggers are current topics of interest [2]. Several approaches to trigger design assistance can be found in the literature. Some aim at providing tools or wizards which facilitate the task of trigger specification [10,2] while in other cases the aim is to generate triggers which enforce a set of integrity constraints [14,3]. Our proposal would fit in this second approach. More specifically, our goal is to determine a procedure which, starting from a conceptual schema provided with a set of semantic integrity constraints, automatically generates a logical schema together with declarative constraints
This work has been partially supported by DGES, projects TIC2000-1368-C03-01 and PB-96-0098-C04-01, and by University of Zaragoza, project UZ-00-TEC-04.
T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 54–64, 2002. c Springer-Verlag Berlin Heidelberg 2002
Integrity Constraint Enforcement by Means of Trigger Templates
55
and triggers which enforce the semantic constraints imposed. With respect to the triggers, it is worth noting that we have concentrate on using triggers that check constraints but that do not cascade repairing actions. To develop the procedure we propose the notions of schema pattern and trigger template are introduced. A schema pattern is a conceptual schema whose elements are denoted by means of symbols (they do not have any particular meaning) resulting in a generic schema which can be used as a pattern for specifying different specific conceptual schemas. On the other hand, a trigger template is a parameterized trigger by means of which the actual code of a trigger can be obtained substituting the parameters of the template by the specific values associated to the situation at hand. Making use of these two notions, we propose to store in a so called trigger template database the trigger templates that must be generated to enforce constraints of different fixed types imposed in a particular schema pattern. Once this database is constructed, every time the analyst designs a conceptual schema matching a schema pattern and with fixed type constraints, then the triggers which enforce the constraints are automatically generated from the information stored in the database. Besides, every time a constraint is enabled/disabled, triggers are automatically regenerated in order to adjust them to the new situation. The paper is organized as follows. In the next section the integrity constraint specification at the conceptual level is explained and, at the same time, a case study is presented. In Section 3 the procedure we propose to construct a trigger template database is presented and, in Section 4, the way in which such a database is used in a particular case is shown. Related work is discussed in Section 5 and, finally, conclusions and future work are stated in Section 6.
2
Conceptual Constraint Specification
The trigger generation procedure we propose in this paper can be applied using different conceptual techniques. In this paper we are going to use the ER model provided with a formal constraints specification language (in particular the version proposed in [8]). Within the ER–model, the analyst represents by means of a schema the entity types, relationship types and attributes (s)he perceives in the situation of the real world under consideration. For example, the ER–schema of Figure 1(a) (specified following the notation proposed in [8]) represents a company in which it is perceived that the projects are developed within the departments of the company, that the departments are managed by bosses and that the projects are assigned to bosses. With regard to the integrity constraints, that is, the conditions that data must hold at all times [1], an extended ER–Calculus is proposed in [8] in such a way that a constraint can be formally specified by means of formulas without free variables. For our purposes of trigger generation, we concentrate in this paper on inclusion dependency constraints (see [6]), and cardinality, connectivity and primary key constraints. An example of inclusion dependency constraint
56
E. Dom´ınguez, J. Lloret, and M.A. Zapata id_boss:int
assigned_to
(0,*)
id_proj:int
v1
boss
A2
(1,1) (0,1)
project
manages
A1
v2
(0,1) (1,1)
developed_in
(1,*)
department
v3
A3
id_depart:int
(a)
(b)
Fig. 1. Example of ER–schema and triangular schema
associated to the schema of Figure 1(a) is a constraint stating that the projects assigned to a boss are developed in the department managed by the boss. This constraint can be formulated as follows: ∀(p : P roject) ∀(b : Boss) ∀(d : Department) assigned to(p, b) ∧ manages(b, d) ⇒ developed in(p, d) It must be noted that the developed in relationship type is not redundant since a project can be developed in a department and not be assigned to the boss of the department. An ER–schema, as we have said, represents aspects perceived in a real world situation. However, in order to obtain theoretical results, symbolic schemas (whose elements are denoted by means of symbols without any particular meaning) are used in the literature. The aim is to tackle the problems at a generic level and to achieve results which can be applied to different particular situations. With this aim, we will consider symbolic ER–schemas we will call ER–schema patterns. To be concrete we use as case study the ER–schema pattern of Figure 1(b), referred to as triangular schema, where we consider that every entity type of the triangle schema has a primary key associated to it (named pk A1 , pk A2 and pk A3 , respectively) and that the analyst can only specify four types of relationship constraints: primary key of relationship type, cardinality, connectivity and inclusion dependency. With these restrictions only 30 different constraints (listed in Table 1) can be specified. In this table, {Z1 , . . . , Zr }, {Y1 , . . . , Ys }, {X1 , . . . , Xt } stand for, respectively, the primary key of relationship types v1, v2 and v3 and CN T stands for the function that counts the instances of a set.
3
Integrity Constraint Enforcement
The ER–schema specified by the analyst at the conceptual level has to be translated into a relational schema for which it is possible to find in the literature different satisfactory translation algorithms [7]. In particular, we consider the algorithm which produces a relational schema containing a table for each entity type and a table for each relationship type regardless of its cardinality. Integrity constraints enforcement is achieved by incorporating in the logical database schema statements that specify the constraints imposed at the concep-
Integrity Constraint Enforcement by Means of Trigger Templates
57
Table 1. Integrity constraints PRIMARY KEY OF RELATIONSHIP TYPE R1 ∀ (v, w : v1 ) Z1 (v) = Z1 (w) ∧ Z2 (v) = Z2 (w) ∧ · · · Zr (v) = Zr (w) ⇒ v = w R2 ∀ (v, w : v2 ) Y1 (v) = Y1 (w) ∧ Y2 (v) = Y2 (w) ∧ · · · Ys (v) = Ys (w) ⇒ v = w R3 ∀ (v, w : v3 ) X1 (v) = X1 (w) ∧ X2 (v) = X2 (w) ∧ · · · Xt (v) = Xt (w) ⇒ v = w CARDINALITY R4 ∀ (a1 : A1 ) CN T − [v| (v : v1 ) ∧ v.A1 = a1 ] ≤ 1 R5 ∀ (a1 : A1 ) CN T − [v| (v : v3 ) ∧ v.A1 = a1 ] ≤ 1 R6 ∀ (a2 : A2 ) CN T − [v| (v : v1 ) ∧ v.A2 = a2 ] ≤ 1 R7 ∀ (a2 : A2 ) CN T − [v| (v : v2 ) ∧ v.A2 = a2 ] ≤ 1 R8 ∀ (a3 : A3 ) CN T − [v| (v : v2 ) ∧ v.A3 = a3 ] ≤ 1 R9 ∀ (a3 : A3 ) CN T − [v| (v : v3 ) ∧ v.A3 = a3 ] ≤ 1 CONNECTIVITY R10 ∀ (a1 : A1 ) CN T − [v| (v : v1 ) ∧ v.A1 = a1 ] ≥ 1 R11 ∀ (a1 : A1 ) CN T − [v| (v : v3 ) ∧ v.A1 = a1 ] ≥ 1 R12 ∀ (a2 : A2 ) CN T − [v| (v : v1 ) ∧ v.A2 = a2 ] ≥ 1 R13 ∀ (a2 : A2 ) CN T − [v| (v : v2 ) ∧ v.A2 = a2 ] ≥ 1 R14 ∀ (a3 : A3 ) CN T − [v| (v : v2 ) ∧ v.A3 = a3 ] ≥ 1 R15 ∀ (a3 : A3 ) CN T − [v| (v : v3 ) ∧ v.A3 = a3 ] ≥ 1 INCLUSION DEPENDENCY R16 ∀ (a1 : A1 ) ∀ (a2 : A2 ) (v1 (a1 , a2 ) ⇒ (∃ a3 : A3 ) v2 (a2 , a3 )) R17 ∀ (a1 : A1 ) ∀ (a2 : A2 ) (v1 (a1 , a2 ) ⇒ (∃ a3 : A3 ) v3 (a1 , a3 )) R18 ∀ (a2 : A2 ) ∀ (a3 : A3 ) (v2 (a2 , a3 ) ⇒ (∃ a1 : A1 ) v1 (a1 , a2 )) R19 ∀ (a2 : A2 ) ∀ (a3 : A3 ) (v2 (a2 , a3 ) ⇒ (∃ a1 : A1 ) v3 (a1 , a3 )) R20 ∀ (a1 : A1 ) ∀ (a3 : A3 ) (v3 (a1 , a3 ) ⇒ (∃ a2 : A2 ) v1 (a1 , a2 )) R21 ∀ (a1 : A1 ) ∀ (a3 : A3 ) (v3 (a1 , a3 ) ⇒ (∃ a2 : A2 ) v2 (a2 , a3 )) R22 ∀ (a1 : A1 ) ∀ (a2 : A2 ) (v1 (a1 , a2 ) ⇒ (∃ a3 : A3 ) v2 (a2 , a3 ) ∧ R23 ∀ (a1 : A1 ) ∀ (a2 : A2 ) (v1 (a1 , a2 ) ⇒ (∃ a3 : A3 ) v2 (a2 , a3 ) ∨ R24 ∀ (a2 : A2 ) ∀ (a3 : A3 ) (v2 (a2 , a3 ) ⇒ (∃ a1 : A1 ) v1 (a1 , a2 ) ∧ R25 ∀ (a2 : A2 ) ∀ (a3 : A3 ) (v2 (a2 , a3 ) ⇒ (∃ a1 : A1 ) v1 (a1 , a2 ) ∨ R26 ∀ (a1 : A1 ) ∀ (a3 : A3 ) (v3 (a1 , a3 ) ⇒ (∃ a2 : A2 ) v1 (a1 , a2 ) ∧ R27 ∀ (a1 : A1 ) ∀ (a3 : A3 ) (v3 (a1 , a3 ) ⇒ (∃ a2 : A2 ) v1 (a1 , a2 ) ∨ R28 ∀ (a1 : A1 ) ∀ (a2 : A2 ) ∀ (a3 : A3 ) (v1 (a1 , a2 ) ∧ v2 (a2 , a3 ) ⇒ R29 ∀ (a1 : A1 ) ∀ (a2 : A2 ) ∀ (a3 : A3 ) (v1 (a1 , a2 ) ∧ v3 (a1 , a3 ) ⇒ R30 ∀ (a1 : A1 ) ∀ (a2 : A2 ) ∀ (a3 : A3 ) (v2 (a2 , a3 ) ∧ v3 (a1 , a3 ) ⇒
v3 (a1 , a3 )) v3 (a1 , a3 )) v3 (a1 , a3 )) v3 (a1 , a3 )) v2 (a2 , a3 )) v2 (a2 , a3 )) v3 (a1 , a3 )) v2 (a2 , a3 )) v1 (a1 , a2 ))
tual level [12]. However, there is not a generally accepted way of performing such a translation, this being a subject of current research [2]. Within the relational model, all the integrity constraints could be enforced by means of triggers, but this is not suitable for performance reasons [15]. Declarative statements (such as, not null, unique or foreign key) must be used whenever possible. For example, within our case study, following the criteria given in [15] which takes into account the enforcement capabilities of current commercial database systems, the primary key and cardinality constraints (from R1 to R9) should be translated into declarative constraints whereas the connectivity and the inclusion dependency constraints (from R10 to R30) must be enforced by means of triggers. The enforcement by means of declarative constraints is, more or less, straightforward [15]. However, this is not the case when triggers come into play [2]. From now on, we will focus our attention on the enforcement by means of triggers using as an example the connectivity and the inclusion dependency constraints enforcement. In order to do this, our approach proposes the construction of a database in which trigger templates are stored. Before presenting the schema of this database we are going to explain what trigger templates are and how they are constructed. 3.1
Trigger Template Construction Process
A trigger template is a trigger in which some values are established as parameters so that different particular triggers can be obtained by giving different values to
58
E. Dom´ınguez, J. Lloret, and M.A. Zapata CREATE OR REPLACE TRIGGER R28_trigger AFTER INSERT ON v1 FOR EACH ROW DECLARE excep exception; constr_violation boolean; found_row boolean; cursor cursor_c2 IS SELECT pk_A2, pk_A3 FROM v2; cursor cursor_c3 IS SELECT pk_A1, pk_A3 FROM v3; c2 cursor_c2%rowtype; c3 cursor_c3%rowtype; BEGIN constr_violation:=false; OPEN cursor_c2; FETCH cursor_c2 INTO c2; WHILE (cursor_c2%found) AND NOT (constr_violation) LOOP IF c2.pk_A2=:new.pk_A2 THEN found_row:=false; OPEN cursor_c3; FETCH cursor_c3 INTO c3;
WHILE (cursor_c3%found) AND (NOT found_row) LOOP IF c3.pk_A1=:new.pk_A1 AND c3.pk_A3=c2.pk_A3 THEN found_row:=true; END IF; FETCH cursor_c3 INTO c3; END LOOP; CLOSE cursor_c3; IF NOT found_row THEN constr_violation:= true; END IF; END IF; FETCH cursor_c2 INTO c2; END LOOP; CLOSE cursor_c2; IF constr_violation THEN RAISE excep; END IF; EXCEPTION WHEN excep THEN RAISE_APPLICATION_ERROR(-20001, 'Insertion not accepted'); END;
Fig. 2. Standard Trigger
the parameters. In particular, given a trigger template that enforces an integrity constraint of a schema pattern, when the analyst determines a specific conceptual schema (which matches with the schema pattern) the actual code of the trigger for enforcing that constraint is obtained by substituting the parameters of the template with the specific values associated to the particular schema. The code of the trigger templates is obtained establishing, firstly, the (standard) triggers and transforming them into trigger templates determining and specifying their parameters. Unlike other approaches for trigger generation, [3,5,11,14], which are based on an analysis of each constraint individually, we think that the trigger template code can be improved if an analysis of the constraints as a whole is done. With this aim, we propose to follow a different method which takes as an input all the conceptual constraints the analyst can specify and a logical schema (obtained from an ER–schema pattern). Our method strives for the generation of improved trigger templates for each pair (operation o, table t) through the following steps. Step 1. Establishment of the affected constraints. The goal of this step is to identify, for each pair (operation o, table t), those constraints enforced by means of triggers that can be violated as a consequence of the operation o on table t. We will call these constraints affected constraints. For example, in our case study, the affected constraints for the pair (insert,v1 ) are R16, R17, R22, R23, R28 and R29. Step 2. Specification of standard triggers. For each affected constraint, a trigger that checks the constraint (considered in isolation) against the pair (operation o, table t) is specified. As an example, a trigger which checks the constraint R28 for the pair (insert,v1 ) is shown in Figure 2 (in this work, we specify the triggers by using the Oracle specification [16]). Step 3. Specification of improved triggers. The aim of this step is to improve the code of the triggers obtained in the previous step. To do this, the idea is to determine the triggers trying to take the maximum advantage of the information derived from the other constraints imposed by the analyst. For example, the code of the R28 trigger can be improved if some of the cardinality, connec-
Integrity Constraint Enforcement by Means of Trigger Templates CREATE OR REPLACE TRIGGER improved_R28_trigger AFTER INSERT ON v1 FOR EACH ROW DECLARE excep exception; c2 v2%rowtype; c3 v3%rowtype; BEGIN SELECT pk_A2,pk_A3 INTO c2.pk_A2, c2.pk_A3 FROM v2 WHERE pk_A2=:new.pk_A2; SELECT pk_A1,pk_A3 INTO c3.pk_A1, c3.pk_A3 FROM v3 WHERE pk_A1=:new.pk_A1; IF c3.pk_A3!=c2.pk_A3 THEN RAISE excep; END IF; EXCEPTION WHEN excep THEN RAISE_APPLICATION_ERROR(-20001,'Insertion not accepted'); END;
(a)
59
TEMPLATE(, , , , , ) CREATE OR REPLACE TRIGGER improved_R28_trigger AFTER INSERT ON FOR EACH ROW DECLARE excep exception; c2 %rowtype; c3 %rowtype; BEGIN SELECT , INTO c2.,c2. FROM WHERE =:new.; SELECT , INTO c3.,c3. FROM WHERE =:new.; IF c3.!=c2. THEN RAISE excep; END IF; EXCEPTION WHEN excep THEN RAISE_APPLICATION_ERROR(-20001, 'Insertion not accepted'); END;
(b)
Fig. 3. Improved Trigger and Trigger Template
tivity or primary key of relationship type constraints are imposed. In particular, when the cardinality constraints R5, R7 and the connectivity constraints R13, R15 are also imposed, R28 trigger can be substituted by the improved trigger of Figure 3(a) which takes into account the constraints imposed. This example shows that the knowledge that certain constraints are imposed can lead to an improvement of the triggers. We capture this idea by means of the notion of influence. Given a pair (table t, operation o), a set of constraints {C1 ,C2 , ..., Cn } influences a constraint C if the trigger which enforces C (considered in isolation) can be improved simplifying the code of the trigger due to the fact that {C1 , C2 , ..., Cn } are also imposed. We will denote this fact as {C1 , C2 , ..., Cn } → C. For instance, as we have seen in the previous example, {R5, R7, R13, R15} → R28 with respect to the pair (insert, v1 ). We claim that the triggers which enforce the imposed constraints have to be determined taking the influences into account. Although with this idea if a single constraint is disabled/enabled some triggers need to be rewritten, we think this is not to the detriment of our approach because the rewritting will be automatically done inside our approach. All the detected influences which are considered worthwhile must be taken into account in order to determine improved triggers. Since in a concrete case only one of these triggers has to be generated, it is important to determine a total preference order among them so that t < t if the code of trigger t is considered an improvement on the code of trigger t. Step 4. Specification of trigger templates. The goal of this step is to specify the trigger templates related to the different triggers obtained in the two previous steps. The generic elements of the code have to be detected and represented by means of parameters. For the moment, the analyst has to detect the generic elements manually, although we are elaborating an algorithm which is a first approach to the automation of this process. An example of trigger template related to the improved R28 trigger is shown in Figure 3(b).
60
3.2
E. Dom´ınguez, J. Lloret, and M.A. Zapata
Trigger Template Database
The trigger templates generated following the method we have just described for a schema pattern have to be stored in the trigger template database and they will be retrieved for use within a particular situation. In order to do this, we propose to include in the database a table with the code and an identification of each trigger template. trigger template(ident trigger, trigger template code) In our case study this table contains all the feasible trigger templates obtained following the process explained previously. For example, there will be a row containing the code of the R28 trigger template and another with the code of the improved R28 trigger template. In addition, in order to know which triggers must be generated for each particular situation another table is included in the database. This table contains for each operation, table and affected constraint a set of influential constraints and the identification of the improved trigger template from which the trigger has to be generated. The preference order number of the templates which enforce the same constraint with respect to the same operation and table is also stored. generated trigger(operation, table, affected constraint, influential constraint set, trigger template identification, order number) For example, there will be a row specifying that for the operation insert on table v1 and the affected constraint R28 and if no other influential constraints are imposed then the trigger template generated is the R28 trigger template. There will also be another row establishing that for the same operation, table and affected constraint, if the constraints R5, R7, R13 and R15 are imposed then the improved R28 trigger template must be used.
4
Trigger Generation for a Specific Schema
Once the trigger template database has been created it can be used for automatically generating the triggers for a particular schema. In order to do this, the analyst has to design a schema which matches the given schema pattern and (s)he has to specify the constraints of the fixed types that must be imposed. Before enforcing the imposed constraints, it must be checked whether they are consistent and non-redundant [15]. Consistency and redundancy checking algorithms can be found, for example, in [15]. To be specific, if a given set C of imposed constraints is inconsistent, then no enforcing analysis has to be done. If the set C is consistent, it is advisable to obtain a set Cmin eliminating the redundancies of C (which contains the constraints to be enforced) and a set Cmax which contains C and all the constraints logically implied from the constraints of C (Cmax will be used for obtaining an improved trigger). The logical schema associated to the given conceptual schema is obtained using the method described in Section 3. The constraints of Cmin which can be enforced declaratively are stated at the logical level and the triggers which enforce the other constraints of Cmin are determined by means of the trigger
Integrity Constraint Enforcement by Means of Trigger Templates CREATE OR REPLACE TRIGGER improved_R28_trigger AFTER INSERT ON project_boss FOR EACH ROW DECLARE excep exception; c2 boss_department%rowtype; c3 project_department%rowtype; BEGIN SELECT id_boss,id_depart INTO c2.id_boss,c2.id_depart FROM boss_department WHERE id_boss=:new.id_boss;
61
SELECT id_proj,id_depart INTO c3.id_proj, c3.id_depart FROM project_department WHERE id_proj=:new.id_proj; IF c3.id_depart!=c2.id_depart THEN RAISE excep; END IF; EXCEPTION WHEN excep THEN RAISE_APPLICATION_ERROR (-20001,'Insertion not accepted') END;
Fig. 4. Concrete Trigger
template database. For each one of these constraints the rows for which the constraint is indicated as affected have to be retrieved from the generated trigger table grouped by the operation and table columns. For each group obtained, the row with the highest preference order number verifying that the influential constraints are contained in Cmax will be chosen in order to obtain the trigger template to be used. For each chosen trigger template, the actual code is obtained by substituting the parameters of the template by the specific values associated with the situation at hand. For example, let us suppose that the analyst specifies the ER-schema of figure 1(a) with the constraints R5, R7, R13, R15 and R28. Then the row chosen is the row in which it is specified that, with respect to the insert operation on v1 , the R28 constraint can be enforced by means of the improved R28 trigger template (we suppose that it has the highest preference order number according to the imposed constraints). This template is transformed into a trigger substituting the parameters by the actual values. The trigger we obtain is shown in Figure 4. This process is performed for each chosen trigger template, obtaining the triggers which check the affected constraints specified in the specific schema. Let us remember that our triggers do not cascade repairing actions. As a consequence, there are no problems of termination neither of confluence. Let us explain why there is no problem of termination. The data modification operations issued against the system fire our check triggers which, in turn, accept or reject the operations and finish their execution but do not fire other triggers. Once every trigger finishes, the whole process finishes and termination is achieved.
5
Related Work
The work [12] sets a very detailed comparison of different current methods for integrity constraint enforcement and view update. We are going to use several aspects and characteristics of the comparison proposed in that paper in order to compare our work with other related works. Since in this paper we do not deal with view update we will not take into account those elements of comparison proposed in [12] which are related with this issue.
62
E. Dom´ınguez, J. Lloret, and M.A. Zapata Table 2. Related work Article
Constraint enforcement
Definition language
Abstraction Predefined User level types participation
[3]
Maintain
Relational, logic Logical
No
Yes
[9]
Maintain
Relational, logic Logical
No
Yes
[5]
Maintain, check
O-O
[11] [14]
Conceptual Yes
No
Maintain, Restore Logic
Logical
No
No
Maintain
Relational
Conceptual No
No
Logic
Conceptual Yes
No
Our work Check
Four different techniques for integrity constraint enforcement are distinguished in [12]. Of these, our work fits into that which is based on the generation and execution of active rules or triggers. Therefore we compare our work only with those works that use the same constraint enforcement technique [3,9,5,11, 14]. Table 2 summarizes this comparison. The two most widely used approaches are integrity constraint checking and integrity constraint maintenance [12]. While the former simply rejects any transaction that violates some constraint, the latter tries to facilitate the work of the analyst modifying the transaction (by means of repairing actions) in order to avoid any constraint violation (second column of Table 2). We have presented the trigger template generation process using the checking approach in the belief that this would be an easier way of explaining it. In any case, we think that this process can be applied independently of the chosen approach, so that it can also be useful for the constraint maintenance approach. As for the language used to define integrity constraints, we have chosen logic (from [8]) as well as [11] (third column). We have not used a relational language (like [3,9,14]) in order to allow the analyst to work at a conceptual level instead of a logical level (fourth column), with the aim of facilitating the schema design and constraint specification. We thought of using an object oriented language (like [5]) using, in particular, the UML but we discarded this option because of the lack of formality of the UML constraints specification language [13]. Another aspect of our work (that is similar to [5]) is that the analyst can only specify constraints of some predefined types (fifth column). As a consequence of this, one very critical decision when applying our approach is the selection of the types of constraints if a wide scope of application is required. With regard to the user participation with respect to constraint enforcement (sixth column), our work does not require his/her intervention. The aim is to determine a complete automatic generation process. Finally, in addition to the above–mentioned comparison aspects extracted from [12], we want to comment on two aspects of our approach which are important to show its contribution to the constraint enforcement issue. On the one hand, the trigger generation is carried out trying to take the maximum advantage of the information derived from the other constraints imposed by the analyst, whereas other papers propose to generate the triggers considering each constraint independently [12]. The advantage of our approach is that it allows
Integrity Constraint Enforcement by Means of Trigger Templates
63
the improvement of the generated code. On the other hand, we wish to emphasize the notion of trigger template as a way of generating code which can be re–used in different circumstances. The advantage of using templates is also considered in [5] but is used for encoding rule templates instead of trigger templates.
6
Conclusions
In this paper, we propose a procedure for generating constraints and triggers which check constraints when database updates take place. The constraints are specified in an ER-conceptual schema by the analyst (from a predefined family of commonly used constraints) and then automatically translated into constraints and triggers. In order to do this, the notions of schema pattern and trigger template are introduced. In addition, the proposed trigger generation process takes into account all the imposed constraints as a whole with the aim of obtaining improved code. Several future research lines are open. It would be of interest to apply the trigger generation process to other schema patterns and other types of constraints, as well as, to analyze the application of our approach to a general E/R schema containing instances of multiple schema patterns. Furthermore, ways of reducing the size of the trigger template database deserve to be investigated.
References 1. S. Van Baelen, J. Lewi, E. Steegmans, B. Swennen, Constraints in Object–Oriented Analysis, in S. Nishio, A. Yonezama (Eds.), Object Technologies for Advanced Software, LNCS 742, Springer–Verlag, 1993, 393–407. 2. S. Ceri, R. J. Cochrane, J. Widom, Practical Applications of Triggers and Constraints: Successes and Lingering Issues, Proceedings of the 26th International Conference on VLDB, 2000, 254–262. 3. S. Ceri, P. Fraternali, S. Paraboshi, L. Tanca, Automatic Generation of Production Rules for Integrity Maintenance, ACM TODS, 19, 3, 1994, 367–422. 4. S. Ceri, J. Widom, Deriving Production Rules for Constraint Maintenance, Proceedings of the 16th International Conference on VLDB, 1990, 566–577. 5. I. A. Chen, R. Hull, D. McLeod, An execution model for limited ambiguity rules and its application to derived data update, ACM TODS, 20, 4, 1995, 365–413. 6. D. Dey, V. C. Storey, T. M. Barron, Improving Database Design through the Analysis of Relationships, ACM TODS, 24, 4, 1999, 453–486. 7. C. Fahrner, G. Vossen, A Survey of Database Design Transformations Based on the Entity–Relationship Model, Data & Knowledge Engineering, 15, 1995, 213–250. 8. M. Gogolla, An Extended Entity–Relationship Model, LNCS 767, Springer–Verlag. 9. M. Gertz, Specifying Reactive Integrity Control for Active Databases, Proceedings of RIDE’94, 62–70 10. D. Lee, W. Mao, W. W. Chu, TBE: Trigger–By–Example, Conceptual Modeling – ER 2000, LNCS 1920, Springer, 2000, 112–125. 11. S. Maabout, Maintaining and Restoring Database Consistency with Update Rules, Workshop Dynamics’98 (postconference of the workshop JICLSP98), 1998.
64
E. Dom´ınguez, J. Lloret, and M.A. Zapata
12. E. Mayol, E. Teniente, A Survey of Current Methods for Integrity Constraint Maintenance and View Updating, in P. P. Chen, D. W. Embley, J. Kouloumdjian, S. W. Liddle, J. F. Roddick (Eds.) ER ’99 – Workshop on Evolution and Change in Data Management, LNCS 1727, Springer, 1999, 62–73. 13. M. Richters, M. Gogolla. On formalizing the UML Object Constraint Language OCL, in T. W. Ling, S. Ram, M. L. Lee (Eds.) Conceptual Modeling–ER’98, Springer, 1998, 449–464 14. K. D. Schewe, Consistency Enforcement in Entity–Relationship and Object– Oriented Models, Data & Knowledge Engineering, 28, 1, 1998, 121–140. 15. C. Turker, M. Gertz, Semantic Integrity Support in SQL–99 and Commercial (Object–) Relational Database Management Systems, U. C. Davis Computer Science Technical Report CSE–2000–11, 2001. 16. S. Urman, Oracle8 PL/SQL Programming, Oracle Press, 1998.
Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases Ole Guttorm Jensen and Michael B¨ ohlen Department of Computer Science, Aalborg University, Frederik Bajers Vej 7E, DK–9220 Aalborg Øst, Denmark, {guttorm | boehlen}@cs.auc.dk
Abstract. After the schema of a relation has evolved some tuples no longer fit the current schema. The mismatch between the schema a tuple is supposed to have and the schema a tuple actually has is inherent to evolving schemas, and is the defining property of legacy tuples. Handling this mismatch is at the very core of a DBMS that supports schema evolution. The paper proposes tuple versioning as a structure for evolving databases that permits conditional schema changes and precisely keeps track of schema mismatches at the level of individual tuples. Together with the change history this allows the DBMS to correctly identify current, legacy, and invalid tuples. We give an algorithm that classifies tuples, in time and space proportional to the length of the change history. We show how tuple versioning supports a flexible semantics needed to accurately answer queries over evolving databases.
1
Introduction
A conditionally evolving database permits schema changes that apply to selected tuples only. We define conditional schema changes to selectively evolve the schema of tuples that satisfy a change condition. Different change conditions are orthogonal to each other, and select tuples based on their attribute values. An immediate consequence of conditional schema changes is that there are multiple current versions. After n conditional schema changes there will be up to 2n current versions. The exponential growth of the number of versions has several consequences. For example, it is infeasible to compute all current versions, let alone the entire evolution history. Similarly, an application cannot be expected to explicitly specify the version to be used when updating or querying the database. The DBMS must handle versions transparently, and all computations must be based on the change history, i.e., the sequence of conditional schema changes that has been applied to the original schema. We define evolving schemas to define a precise semantics for conditionally evolving databases. When the schema evolves the intended schema of a tuple and the actual schema of the tuple, are no longer synchronized. For example, let E mployee =
This research was supported in part by the Danish Technical Research Council through grant 9700780 and Nykredit, Inc.
T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 65–82, 2002. c Springer-Verlag Berlin Heidelberg 2002
66
O.G. Jensen and M. B¨ ohlen
(Name, Unit) model the name and unit of university employees, and let a conditional schema change add a Group attribute to the employees in the database unit. Obviously, all tuples that have been recorded before the schema change have no Group attribute. Because missing attribute values cannot be recovered in the general case, the only accurate solution is to precisely keep track of the mismatch. We introduce tuple versioning, where each tuple is associated with two schemas to keep track of actual and intended schema, respectively. In our example, we want to assert (Name, Unit) as the actual schema and (Name, Unit, Group) as the intended schema. Note that this only applies to tuples that model employees in the database unit. Tuples modeling employees in other units will have (Name, Unit) as their actual and intended schema. Tuple versioning eliminates the need to migrate data in response to schema changes. Migrating the data is a common approach to keep the extension and intension synchronized (cf. [17]). Unfortunately, data migration leads to subtle problems [14,19], and a general solution requires an advanced and complex management of multiple null values [19]. To illustrate the problems of data migration consider our initial example and the following queries: 1) Are employee tuples supposed to have a group attribute? 2) Are there employee tuples that are recorded with a group attribute? and 3) Are there employee tuples that are supposed to have a group attribute but are recorded without one? Note that these queries can be issued either before or after the schema change. To answer the first query the intended schema has to be considered. To answer the second query the actual schema has to be considered. To answer the third query both the intended and actual schema have to be considered. Committing to one of the schemas does not allow to accurately answer all queries (unless schema semantics is encoded in the attributes values and query evaluation is changed accordingly). An evolving relation schema also requires the DBMS to handle legacy tuples. There are several sources for such tuples. If tuples are not migrated to make them fit the new schema, legacy tuples are already stored in the database. Legacy applications, which continue to issue queries and assertions after a schema change has been committed, are another source for legacy tuples. Based on the schema it is natural to distinguish three types of tuples: a current tuple is a tuple that fits the schema, a legacy tuple is a tuple that at some point in the past fitted the schema, and an invalid tuple is a tuple that is neither a current nor a legacy tuple. In the presence of evolving schemas it is one of the key tasks of a DBMS to correctly identify current, legacy, and invalid tuples, and to resolve the mismatch between intended and actual schema for legacy tuples. In this paper we investigate the first issue. The correct classification of tuples is non-trivial and is needed to identify mismatches. We show that besides the current state of the evolving schema it is necessary to consider the history of the evolving schema to correctly identify current, legacy, and invalid tuples. The correct classification of tuples is reduced to path properties in the evolution tree. We give the ClassifyTuple algorithm to classify tuples without materializing the evolution tree. Given an initial schema and a change history the algorithm classifies tuples in time and space proportional to the length of the change history.
Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases
67
The paper proceeds as follows. Section 2 gives an overview of our solution and sets up a running example. Evolving schemas and conditional schema changes are defined in Sections 3 and 4, respectively. In Section 5 the properties of evolving schemas are explored. Section 6 defines change histories and evolution trees. The evolution tree defines the semantics of a change history and can be used to identify legacy tuples. In Section 7 we show how to accurately answer queries over evolving schemas by providing a flexible semantics for resolving attribute mismatches. Section 8 discusses related work. Conclusions and future work are given in Section 9.
2
Overview and Running Example
We use an evolving employee schema for illustration purposes throughout. Detailed definitions follow in the next sections. For simplicity we consider a database with a single relation. The solution is easily generalized and applies to each relation individually. We also restrict the discussion to schema changes. Updates of the data are completely orthogonal. Updates of the data change the attribute values and possibly the actual schema of tuples. Assume a schema that models university employees: E mployee = (Name, Position, Unit, Rank). This schema requires a name, position, unit, and rank for each employee. The rank is a number that determines the salary of an employee. As the number of members in the database unit grows it is decided to sub-divide the database unit into groups. We use a conditional schema change to change the schema for employees in the database unit to (Name, Position, Unit, Group, Rank). The schema (Name, Position, Unit, Rank) remains valid for all employees in other units. Note that we are left with two current and equally important schemas. A qualifier is associated with each schema to determine the association between tuples and schemas. A (schema, qualifier)-pair is a schema segment. The second schema change allows full professors to negotiate individual salaries with the university. As a result full professors need to be recorded with a S alary attribute instead of a Rank attribute. Obviously, full professors can be members of any unit. Therefore, the schema change applies to the schema of the database unit as well as to the schema of the other units. This yields an evolving schema with a total of four schema segments, S, S , S , and S , as illustrated in Figure 1. In Figure 1 each tuple is associated with an actual and an intended schema, respectively. The actual schema of a tuple (indicated by the solid line) is the schema that was used when the tuple was added to the database. The intended schema of a tuple (indicated by the dashed line) denotes the schema the tuple is supposed to have according to the initial DDL statement and the subsequent schema changes. For example, tuples t1 and t2 in Figure 1 have (Name, Position, Unit, Rank)
68
O.G. Jensen and M. B¨ ohlen
Evolving Schema Segment Schema
Qualifier
N P U R ¬(U = db) ∧ ¬(P = f ull)
S S
N P U G R U = db ∧ ¬(P = f ull)
S
N P U S ¬(U = db) ∧ P = f ull
t1 = (N/tom, P/f ull, U/db, R/28) t2 = (N/rita, P/asst, U/is, R/19) t3 = (N/john, P/asso, U/db, G/dw, R/22) t4 = (N/kim, P/f ull, U/db, G/dw, R/31) t5 = (N/anne, P/f ull, U/is, S/90.000)
S N P U G S
U = db ∧ P = f ull
recorded schema conceptual schema
Fig. 1. The Evolving E mployee Schema
as their actual schema because that was the schema that was used when they were asserted. The intended schema of t1 is (Name, Position, Unit, Group, S alary) because t1 models a full professor in the database unit and, according to our schema changes, this means that the tuple is supposed to have a Group attribute and the Rank attribute should be replaced by a S alary attribute. The intended schema of t2 is the same as its actual schema because Rita is not in the database unit and is not a full professor. Note that the actual schema of a tuple is never affected by schema changes. In contrast, the intended schema of a tuple changes whenever a conditional schema change applies to the tuple.
3
Evolving Schemas
An evolving schema, E = {S1 , . . . , Sn }, generalizes a relation schema and is defined as a set of schema segments. A schema segment, S = (A, P ), consists of a schema A and a qualifier P . Throughout, we write AS (PS ) to directly refer to the schema (qualifier) of segment S. As usual, a schema is defined as a set of attributes: A = {A1 , . . . , An }. A tuple t is a set of attributes where each attribute is a name/value pair: {A1 /v1 , . . . , An /vn }. The value must be an element of the domain of the attribute, i.e., if dom(Ai ) denotes the domain of attribute A then ∀A, v(A/v ∈ t ⇒ v ∈ dom(A)). An attribute constraint is a predicate of the form Aθc or ¬(Aθc), where A is an attribute, θ ∈ {} is a comparison predicate, and c is a constant. Note that because of attribute deletions some attributes may be missing from the schema. To allow for this, Aθc is an abbreviation for ∃v(A/v ∈ t ∧ vθc), and ¬(Aθc) is an abbreviation for ¬∃v(A/v ∈ t ∧ vθc). Note that this implies that the constraints ¬(A=c) and A =c are not equivalent. A qualifier P is either true, false, or a conjunction of attribute constraints. A tuple t qualifies for a segment S, qual(t, S), iff t satisfies the qualifier PS . A tuple satisfies a qualifier iff the qualifier is true or the tuple satisfies each
Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases
69
attribute constraint in the qualifier. An attribute constraint ∃v(A/v ∈ t ∧ vθc) is satisfied iff the tuple makes the attribute constraint true under the standard interpretation. A tuple t matches a segment S iff the schema of S and t are identical: match(t, S) iff ∀A(A ∈ AS ⇔ ∃v(A/v ∈ t)). A tuple t fits a segment S, fit(t, S), iff qual(t, S) and match(t, S). AS is the intended schema of t iff qual(t, S). AS is the actual schema of t iff match(t, S) Example 1. We use the E mployee schema from Figure 1 to illustrate these definitions. – E mployee is an evolving schema with four segments, S, S , S , and S . – Segment S = ({N, P, U, G, R}, U=db∧¬(P=f ull)) states that the schema for employees who are working in the database unit and are not full professors is (N, P, U, G, R). – Tuple t1 qualifies for S but not for any other segment: ¬qual(t1 , S), ¬qual(t1 , S ), ¬qual(t1 , S ), and qual(t1 , S ). Thus, the schema of S , (N, P, U, G, S), is the intended schema of t1 . – Tuple t1 matches segment S but does not match any of the other segments: match(t1 , S), ¬match(t1 , S ), ¬match(t1 , S ), and ¬match(t1 , S ). Thus, (N, P, U, R), the schema of S, is the actual schema of t1 . – Tuple t1 does not fit any segment: ∀Si (Si ∈ E mployee ⇒ ¬fit(t1 , Si )).
4
Conditional Schema Changes
A conditional schema change is an operation that changes the set of segments of an evolving schema. The condition C determines the tuples that are affected by the schema change. A condition is either true, false, or an attribute constraint. For the purpose of this paper we consider two conditional schema changes: adding an attribute, αA , and deleting an attribute, βA . An extended set of primitives that includes mappings between attributes and a discussion of their completeness can be found elsewhere [10]. An attribute A is added to the schemas of all segments that do not already include the attribute. For each such segment two new segments are generated: a segment with a schema that does not include the new attribute and a segment with a schema that includes the new attribute. Segments with a schema that already includes A are not changed. An attribute A is deleted from the schemas of all segments that include the attribute. For each such segment two new segments are generated: a segment with a schema that still includes the attribute and a segment with a schema that does not include the attribute. Segments with a schema that does not include A are not changed. Example 2. Let E = { ({N, P, U, R}, true) } be an evolving schema, and let βR (E, P=f ull) be the conditional schema change that deletes the Rank attribute for full professors. This yields the evolving schema E = { ({N, P, U, R}, ¬(P = f ull)), ({N, P, U }, P=f ull) }.
70
O.G. Jensen and M. B¨ ohlen
A ∈ A: (A, P )
(A, P )
A ∈ A: (A, P )
(A ∪ {A}, P ∧ C) (A, P ∧ ¬C)
∅ αA (∅, C) = iff A ∈ A {(A, P )} ∪ αA (E, C) αA ({(A, P )} ∪ E, C) = {(A ∪ {A}, P ∧ C), (A, P ∧ ¬C)} ∪ αA (E, C) iff A ∈ A Fig. 2. Adding Attribute A on Condition C: αA (E, C)
Fig. 3. Deleting Attribute A on Condition C: βA (E, C)
Note that conditional schema changes properly subsume regular (i.e., unconditional) schema changes. It is always possible to have the condition (true) select the entire extent of a relation.
5
Properties of Evolving Schemas
This section establishes a set of properties for evolving schemas. In the next section these properties are generalized to path properties in evolution trees. Path properties make it possible to correctly classify tuples, and they are used to derive an efficient classification algorithm. A tuple t is current iff it fits a segment in the evolving schema. A tuple t is a legacy tuple, iff it is not a current tuple, but at some point in the past (i.e., before some schema changes were applied) was current. If a tuple is neither a current nor a legacy tuple, it is an invalid tuple. Note that a conditional schema change may change the status of a current tuple to become a legacy tuple. However, a current or legacy tuple never becomes invalid. Lemma 1. An evolving schema cannot identify legacy tuples. Proof. The proof is by counterexample. Let E = {({A1 }, true)} be an evolving schema, and assume two attribute additions: αA2 and αA3 . Let E1 = αA3 (αA2 (E, true), true) and
E2 = αA2 (αA3 (E, true), true).
Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases
71
Clearly, E1 = E2 . However, the tuple t = (A1 /c1 , A2 /c2 ) is a legacy tuple of E1 but an invalid tuple of E2 . We show that conditional schema changes preserve matching and qualification, but do not preserve fitting. Thus, tuples that match and qualify for a segment are guaranteed to match and qualify for a segment after the schema has evolved. However, they are not guaranteed to match and qualify for the same segment (since schema changes do not preserve fitting). Lemma 2. Let E be an evolving schema, t be a tuple, and C be a condition. A conditional schema change Γ is 1. match-preserving: ∀E, S, t, C, Γ (S ∈ E ∧ match(t, S) ⇒ ∃S (S ∈ Γ (E, C) ∧ match(t, S ))) 2. qual-preserving: ∀E, S, t, C, Γ (S ∈ E ∧ qual(t, S) ⇒ ∃S (S ∈ Γ (E, C) ∧ qual(t, S ))) 3. not fit-preserving: ∀E, S, t, C, Γ (S ∈ E ∧ f it(t, S) ⇒ S ∃ (S ∈ Γ (E, C) ∧ f it(t, S ))) Proof. Match-preservation: The definition in Figure 2 guarantees that when adding an attribute to a segment a new segment with the exact same schema is included in the result. Thus, tuple matching is preserved. The same holds for attribute deletion (cf. Figure 3). Qual-preservation: If P is the qualifier for a segment then an attribute addition or deletion yields either a segment with the qualifier P or two segments with qualifiers P ∧ C and P ∧ ¬C, respectively. Clearly, if P holds then either P ∧ C or P ∧ ¬C holds as well. To show that conditional schema changes are not fit-preserving it is sufficient to give a counterexample. Assume an evolving schema with one segment: S1 = ({A1 }, true), and a tuple t = (A1 /v) that fits S1 . Adding attribute A2 on condition true yields an evolving schema with two segments: S2 = ({A1 }, ¬true) and S3 = ({A1 , A2 }, true). Clearly, t fits neither segment because it does not qualify for S2 and does not match S3 . Lemma 3. Let E and E be evolving schemas, C be a condition, and Γ be a conditional schema change such that Γ (E, C) = E . Let t be a tuple that fits a single segment in E. 1. The qualifying segment of t in E is unique. 2. The matching segment of t in E may not be unique. Proof. To prove that the qualifying segment is unique we show that 1) if t does not qualify for a segment S then t cannot qualify for any segment that results from applying Γ to S, and 2) if t qualifies for a segment S then the application of Γ results in exactly one segment that t qualifies for. Note that if P is the
72
O.G. Jensen and M. B¨ ohlen
qualifier for a segment, then the application of Γ yields either a segment with the qualifier P or two segments with qualifiers P ∧ C and P ∧ ¬C, respectively. It follows that if P does not hold then neither does P ∧ C nor P ∧ ¬C, and if P holds then so does either P ∧ C or P ∧ ¬C, but not both. To prove that the matching segment is not unique assume an evolving schema with two segments: S = ({A1 }, C1 ) and S = ({A1 , A2 }, ¬C1 ), and a tuple t matching S . The addition of A2 on condition C2 yields three segments: αA2 ({S, S }, C2 ) = {({A1 , A2 }, C1 ∧ C2 ), ({A1 }, C1 ∧ ¬C2 ), ({A1 , A2 }, ¬C1 ). Clearly, the matching segment of t is no longer unique.
6
Histories of Evolving Schemas
In the previous section we saw that an evolving schema cannot be used to identify legacy tuples. This section shows that we also have to consider the change history of an evolving schema to correctly classify legacy tuples. 6.1
Change Histories and Evolution Trees
Definition 1. (change history) Let E = {S} be an evolving schema with a single segment, and [Γ1 , . . . , Γn ] be a sequence of conditional schema changes. H = (E, [Γ1 , . . . , Γn ]) is a change history for E. The semantics of a change history is defined in terms of an evolution tree. An evolution tree is a binary tree (we assume an evolving schema with only one segment in the beginning) that models the current state of an evolving schema and keeps track of all prior states. Definition 2. (evolution tree) Let H = (E, [Γ1 , . . . , Γn ]) be a change history with change conditions Ci . TE = (S ∗ , E ∗ ) is the evolution tree of E, iff S∗ =
n
Ei , where Ei = Γi (Ei−1 , Ci )
i=0
E ∗ = {e(S, S ) | S ∈ Ei−1 ∧ E = Γi ({S}, Ci ) ∧ E = {S} ∧ S ∈ E } S ∗ denotes the vertices and E ∗ the edges in the evolution tree. Each vertex is a segment. An edge, e(S, S ), relates parent segment S to child segment S . The leaf segments of an evolution tree define the current segments, i.e., the evolving schema (cf. Figure 1). Figure 4 illustrates the evolution tree for the change history of the E mployee schema. Segment S0 with the initial schema (N, P, U, R) and qualifier true is the root of the evolution tree. The first schema change sub-divides the database unit into groups. This splits segment S0 into segments Sx and Sy . The second schema change replaces the Rank attribute with a S alary attribute for full professors. This change leads to the four segments shown at the bottom of the tree. They define the current state of the evolving E mployee schema.
Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases
Sx S0
N P U R
N P U R
73
S
N P U R
¬(U = db) ∧ ¬(P = f ull)
S
N P U S
¬(U = db) ∧ P = f ull
¬(U = db)
true
Sy
N P U G R
S
N P U G R
U = db ∧ ¬(P = f ull)
S
N P U G S
U = db ∧ P = f ull
U = db
Fig. 4. The E mployee History and its Evolution Tree.
6.2
Path Properties
A path p = [S1 , . . . , Sk ] in an evolution tree TE is a sequence of vertices in TE such that Si is the parent of Si+1 for 1 ≤ i < k. We generalize the properties of an evolving schema to path properties. Path properties allow us to identify legacy tuples and provide the basis for an efficient algorithm. Definition 3. (qualifying path) Let TE = (S ∗ , E ∗ ) be an evolution tree, p = [S1 , . . . , Sn ] be a non-empty path in TE , and t be a tuple. Path p is a qualifying path for tuple t, pQ (t, p, TE ), iff ∀S((S ∈ S ∗ ∧ S ∈ p) ⇒ qual(t, S)) ∧ / p ∧ (e(S, S1 ) ∈ E ∗ ∨ e(Sn , S) ∈ E ∗ )) ⇒ ¬qual(t, S)) ∀S((S ∈ S ∗ ∧ S ∈ Thus, a qualifying path is a maximal path of segments a tuple qualifies for. The first line requires that the tuple qualifies for all segments in the path and the second line guarantees that the path is maximal. Lemma 4. Let TE = (S ∗ , E ∗ ) be an evolution tree and t be a tuple. The qualifying path of t is unique: ∀t, p, p , TE (pQ (t, p, TE ) ∧ pQ (t, p , TE ) ⇒ p = p ). Proof. Follows by induction from Lemma 2 (qual-preservation) and Lemma 3 (unique qualifying segment). Lemma 5. Let TE be an evolution tree and t be a tuple. The qualifying path of t extends from the root to a leaf: ∀t, TE , S1 , . . . , Sn (pQ (t, [S1 , . . . , Sn ], TE ) ⇒ root(S1 , TE ) ∧ leaf (Sn , TE )). Proof. On the one hand, qualification is propagated to the leaves because of qual-preservation (Lemma 2). On the other hand, if the qualifier of a parent segment is P then the qualifier of the two child segments is P ∧ C and P ∧ ¬C, respectively. Obviously, if a tuple satisfies P ∧ C (or P ∧ ¬C) it also satisfies P . Thus the qualifying path always extends to the root of the evolution tree.
74
O.G. Jensen and M. B¨ ohlen
Definition 4. (matching path) Let TE = (S ∗ , E ∗ ) be an evolution tree, p = [S1 , . . . , Sn ] be a non-empty path in TE , and t be a tuple. Path p is a matching path, pM (t, p, TE ), iff ∀S((S ∈ S ∗ ∧ S ∈ p) ⇒ match(t, S)) ∧ ∀S((S ∈ S ∗ ∧ S ∈ / p ∧ (e(S, S1 ) ∈ E ∗ ∨ e(Sn , S) ∈ E ∗ )) ⇒ ¬match(t, S)) In other words, a matching path for a tuple is a maximal path of segments the tuple matches. Lemma 6. Let TE be an evolution tree and t be a tuple. Each matching path of t includes a leaf: ∀t, TE , S1 , . . . , Sn (pM (t, TE , [S1 , . . . , Sn ]) ⇒ leaf (Sn , TE )). Proof. By induction from Lemma 2. If t matches a non-leaf segment then t matches one of the children of that segment (match-preservation). Note that if a tuple matches the schema of a segment it does not necessarily match the schema of the parent segment (cf. Figure 4). Thus, the root may not be included in the matching path. A matching path is not necessarily unique. Different sequences of schema changes may lead to the same schema in different branches of the evolution tree (cf. Lemma 3). For example, assume that the sub-division of units into groups is also applied to the information systems unit. This means we also add the Group attribute on condition U = is. The conditional schema change applies to S and S in Figure 4 and results in segments with the same schema as S and S , respectively. Qualifying and matching paths can be used to identify current, legacy, and invalid tuples. Consider the three evolution trees in Figure 5. The qualifying path is indicated by the thin line from the root to a leaf, whereas the matching path is indicated by the bold line.
T1
PM PQ
Invalid Assertion
T2
T3 PM PQ
Legacy Assertion
PM PQ
Current Assertion
Fig. 5. Possible Relationships between Qualifying and Matching Paths
In T1 the qualifying and matching paths do not overlap. This means that, at no point in the evolution of the evolving schema, there was a segment that
Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases
75
t matches and qualifies for. Thus, t does not fit any segment in the evolution tree and is therefore invalid. In T2 the matching and qualifying paths partially overlap. This means that t fits all segments that are part of both the matching and the qualifying path. Because t does not fit a leaf node it is a legacy tuples. In T3 tuple t also fits a leaf, which makes it a current tuple. Theorem 1. Let TE = (S ∗ , E ∗ ) be an evolution tree. Tuple t is invalid iff the qualifying and matching path of t do not share any segment, i.e., ∀t, TE , p, p ((pQ (t, TE , p) ∧ pM (t, TE , p ) ∧ ¬∃S(S ∈ p ∧ S ∈ p )) ⇔ ¬∃S(S ∈ S ∗ ∧ fit(t, S))). Proof. Tuple t is invalid iff t does not fit any segment S in the evolution tree. From f it(t, S) ⇔ qual(t, S) ∧ match(t, S) and the fact that the qualifying path of t is unique (Lemma 4), it follows that t is invalid iff no matching path of t shares a segment with the qualifying path of t. Theorem 1 leads to an algorithm for tuple classification. The ClassifyTuple algorithm in Figure 6 takes as input a tuple and a change history and determines the class of the tuple. The complexity of the algorithm is linear in the length of the history. ClassifyTuple(t, H) → {C urrent, Legacy, I nvalid} Input: tuple t history H = ({S0 }, [Γ1 , . . . , Γn ]) Output: assertion C urrent,Legacy, or I nvalid Method: if ¬qual(t, S0 ) return I nvalid let i:=1 let S:=S0 let overlap:=false while i ≤ n do S:={S |S ∈ Γi ({S}, Ci ) ∧ qual(t, S )} overlap:=overlap ∨ match(t, S) i:=i+1 end while if match(t, S) return C urrent else if overlap return Legacy else return I nvalid Fig. 6. The ClassifyTuple Algorithm
The algorithm unfolds the qualifying path of the given tuple and evolution tree by applying the conditional schema changes in sequence to the unique qualifying segment. By following the qualifying path and checking for overlap with
76
O.G. Jensen and M. B¨ ohlen
a matching path, the algorithm can determine whether the tuple fitted a prior version of the evolving schema. If the tuple matches the current qualifying segment, it is a current tuple. Otherwise, an overlap with a matching path indicates a legacy tuple, and if no overlap was found the tuple is invalid.
7
Attribute Mismatches
This section investigates the mismatch between the actual and intended schema of tuples. We illustrate the four kinds of mismatches that may occur at the level of individual attributes, and use a series of examples to argue for the need of a parametric approach to resolve mismatches according to the need of the application. 7.1
Mismatch Types
Given an attribute A and the intended schema Ac and actual schema Ar of a tuple t there are four possible types of attribute mismatches: Table 1. The Four Types of Attribute Mismatches A ∈ Ar A∈ / Ar A ∈ Ac M1 (No mismatch) M2 (Not recorded) A∈ / Ac M3 (Not available) M4 (Not applicable)
– No mismatch (M1 ) The attribute appears in both the intended and actual schema of a tuple. For example, for Tom (tuple t1 in Figure 1) there is no mismatch on attributes N , P , and U . – Not recorded (M2 ) The attribute appears only in the intended schema of the tuple. These mismatches occur when schema changes add new attributes. For example, Tom was not recorded with a Group or a S alary attribute, which are attributes that were added after the tuple was inserted into the database. – Not available (M3 ) A tuple is recorded with the attribute, but it does not appear in the intended schema of the tuple. Mismatches of this kind are caused by attribute deletions. E.g., Tom has a rank of 28, but according to its intended schema is not supposed to have this attribute. – Not applicable (M4 ) The attribute appears in neither the actual nor the intended schema of a tuple. Such mismatches occur for tuples that do not satisfy the condition for an attribute addition. For example, the Group and S alary attributes are not applicable to Rita. Table 2 shows all attribute mismatches for each employee tuple in Figure 1.
Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases
77
Table 2. Attribute Mismatches of the Evolving Employee Schema Name M1 /tom M1 /rita M1 /john M1 /kim M1 /anne
7.2
Position M1 /f ull M1 /asst M1 /asso M1 /f ull M1 /f ull
Unit M1 /db M1 /is M1 /db M1 /db M1 /is
Group M2 M4 M1 /dw M1 /dw M4
Rank M3 /28 M1 /19 M1 /22 M3 /31 M4
Salary M2 M4 M4 M2 M1 /90.000
Mismatch Resolution
When querying an evolving database, the DBMS has to systematically resolve attribute mismatches. We discuss three sensible and intuitive resolution strategies (it would be easy to add other strategies). – Projection: Resolves the mismatch by using the stored attribute value. Clearly, this is only possible if the attribute appears in the actual schema of the tuple. Therefore, projection can only be used to resolve M1 and M3 attribute mismatches. – Replacement: Resolves the mismatch by replacing the (missing) attribute value with a specified value. – Exclusion: Resolves the mismatch by excluding the tuple entirely for the purpose of the query. To illustrate the resolution strategies we provide a series of examples. Example 3. Assume we want to see all the ranks ever recorded for employees. This means that we also want to see the ranks of employees who had a rank before they were allowed to negotiate individual salaries. With the resolutions M1:projection, M2:exclusion, M3:projection, and M4:exclusion, the query π[R](employees) answers the query. The result is {28, 19, 22, 31}. Example 4. Assume we want to count the number of employees recorded with a salary but who are not supposed to have one. This means that we have to count the number of tuples with an M3 mismatch for the S alary attribute. The query π[cnt(S)](employees) with the resolutions M1:exclusion, M2:exclusion, M3:projection, and M4:exclusion answers the query. The result is 0 (cf. Table 2). Example 5. Assume we want to count the number of employees who are allowed to negotiate individual salaries. This means that we want to count employees with either M1 or M2 mismatches for the S alary attribute. With the resolutions M1:projection, M2:replacement, M3:exclusion, and M4:exclusion, the query π[cnt(S)](employees) answers the query. The result is 3. Note the generality of our approach. The key advantage of the proposed resolution approach is that it decouples schema definition and querying phases. This
78
O.G. Jensen and M. B¨ ohlen
means that the above examples do not depend on the specifics of the database schema. For example, the exact reason why someone should (not) be in a group does not matter. Queries and resolution strategies are conceptual solutions that do not depend on the conditional schema changes. This is a major difference to approaches that exploit conditions of the schema changes (or other implicit schema information) to formulate special purpose queries to answer the example queries. The examples illustrate that it is well possible to apply different resolution strategies to the exact same query yielding different results. The resolutions provides the semantics of the query. Therefore, an approach that resolves mismatches when the schema is changed (it has a fixed resolution strategy) has to make compromises with respect to the queries that it supports. We propose a parametric approach where attribute mismatches are resolved according to the need of the application. The above queries cannot be answered correctly with a fixed semantics for resolving attribute mismatches when a schema is changed. A flexible and contextdependent semantics is required to correctly answer queries in the presence of schema evolution. Tuple versioning supports such semantics in a natural way.
8
Related Work
Conditional schema evolution has been investigated in the context of temporal databases, where proposals have been made for the maintenance of schema versions along one [13,18,21] or more time dimensions [7]. Because schema change conditions are restricted to the time the exponential explosion of the number of schemas segments are avoided. The reason is that the qualifier of a segment only contains predicates on the timestamp attribute, thus effectively defining an interval. Any schema change with a condition based on a point in time will therefore fall within exactly one interval, splitting it into two intervals. So, while several schema segments may change due to a schema change, the number of segments will only increase by one. As shown in this paper, schema changes with conditions on different attributes are orthogonal and result in an exponential growth of the number of segments. The effect is that existing techniques, such as query processing strategies that require a user to specify a target schema version, do not apply to conditional schema changes. Unconditional schema evolution has receive much attention by both the temporal and object-oriented community (cf. [17,19] for an overview and an annotated bibliography). In temporal databases schema evolution has been analyzed in the context of temporal data models [8,1], and schema changes are applied to explicitly specified versions [23,6]. This requires an extension to the query language and forces schema semantics (such as missing or inapplicable information) down into attribute values [19,14]. In order to preserve the convenience of a single global schema for each relation null values have been used [14,9]. In particular, it is possible to use inapplicable nulls if an attribute does not apply to a specific
Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases
79
tuple [2,3,11,12,25,26]. This leads to completed schemas [19] with an enriched semantics for null values. The approach does not scale to an ordered sequence of conditional evolution steps with multiple current schemas. It is also insufficient for attribute deletions if we do not want to overwrite (and thus loose) attribute values. In response to this it has been proposed to activate and deactivate attributes rather than to delete them [20]. Also, note that the completed schema by itself, i.e., the unification of all attributes ever to appear in the schema, is insufficient, because the change history will be lost. This makes the correct tuple classification impossible. Temporal schema evolution proposes two different ways of coordinating schemas and instances. In synchronous management data can only be updated through the schema with the same temporal pertinence. This constraint is lifted in asynchronous management, where data can be updated through any schema. To illustrate synchronous and asynchronous management assume a schema S = (A, B, T ), where T is a timestamp attribute. At t = 3 attribute B is dropped and attribute C is added. This yields the schema S = (A, C, T ). Consider the following four tuple assertions: t1 = (a, b, 2), t2 = (a, c, 4), t3 = (a, b, 5), and t4 = (a, c, 1). Under synchronous management t1 is asserted to S, because its timestamp is 2 and S was the current schema until time 3. Similarly, t2 is asserted to S . Both t3 and t4 are invalid assertions under synchronous management, since their attributes correspond to the one schema and their timestamps indicate the other schema. However, note that t3 is a legacy assertion, i.e., it was a valid assertion to S before the schema change was applied. Thus, under synchronous management only assertions to the current schema are valid, and legacy assertions are rejected. In asynchronous management t1 and t3 are asserted to S while t2 and t4 are asserted to S . In effect, the constraint based on the timestamp attribute is disregarded and the assertion is based solely on matching the tuples with the schema. While asynchronous management supports both current and legacy assertions it also permits the assertion of invalid tuples. For example, the assertion of t4 is not a legacy assertion because it does not match S, and is not a current assertion because according to the schema change only tuples with a timestamp attribute larger than 3 have schema S . Unconditional schema evolution has also been investigated in the context of OODBs, where several systems have been proposed. Orion [4], CLOSQL [15], and Encore [22] all use a versioning approach. Typically, a new version of the object instances is constructed along with a new version of the schema. The Orion schema versioning mechanism keeps versions of the whole schema hierarchy instead of the individual classes or types. Every object instance of an old schema can be copied or converted to become an instance of the new schema. The class versioning approach CLOSQL provides update/backdate functions for each attribute in a class to convert instances from the format in which the instance is recorded to the format required by the application. The Encore system provides exception handlers for old types to deal with new attributes that are missing from the instances. This allows new applications to access undefined fields of
80
O.G. Jensen and M. B¨ ohlen
legacy instances. In general, the versioning approach for unconditional schema changes cannot be applied to conditional schema changes, because the number of versions that has to be constructed grows exponentially. Views have been proposed as an approach to schema evolution in OODBs [5,24]. Ra and Rundersteiner [16] propose the Transparent Schema Evolution approach, where schema changes are specified on a view schema rather than the underlying global schema. They provide algorithms to compute the new view that reflects the semantics of the schema change. The approach allows for schema changes to be applied to a single view without affecting other views, and for the sharing of persistent data used by different views. Since a view has to be selected for each schema, applying conditional schema changes result in an exponential number of views. Clearly, this is not tractable. New techniques, such as the ClassifyTuple algorithm proposed in this paper, that rely on the history of schema changes rather than the result of the conditional schema changes are required to ensure tractable support.
9
Conclusions and Future Research
In this paper we introduce conditional schema changes that can be applied to a subset of the extension of a relation. Conditional schema changes result in evolving schemas with several current schemas. We present tuple versioning to precisely keep track of schema mismatches at the level of individual tuples. Tuple versioning eliminates the need to migrate the data in response to schema changes. A key task of the DBMS is to correctly identify current, legacy, and invalid tuples. We show that this cannot be done solely based on the current evolving schema. The change history is needed to correctly identify current, legacy, and invalid tuples. Finally, we give an algorithm, which, given a tuple and a change history, determines the classification of that tuple. The algorithm exploits the properties of change histories to provide a time and space complexity proportional to the length of the history. Ongoing work includes processing queries over evolving schemas. We are investigating a parameterized semantics for query processing based on the strategy for resolving attribute mismatches presented in Section 7. Part of this research focuses on the problem of view evolution where both the underlying relations and the view itself are subject to evolution.
References 1. G. Ariav. Temporally Oriented Data Definitions: Managing Schema Evolution in Temporally Oriented Databases. Data Knowledge Engineering, 6(6):451–467, 1991. 2. P. Atzeni and V. de Antonellis. Relational Database Theory. Benjamin/Cummings, 1993. 3. P. Atzeni and N. M. Morfuni. Functional dependencies in relations with null values. Information Processing Letters, 18(4):233–238, 1984.
Current, Legacy, and Invalid Tuples in Conditionally Evolving Databases
81
4. J. Banerjee, W. Kim, H.-J. Kim, and H.F. Korth. Semantics and Implementation of Schema Evolution in Object-Oriented Databases. In ACM SIGMOD International Conference on Management of Data, pages 311–322. ACM Press, 1987. 5. Elisa Bertino. A view mechanism for object-oriented databases. In Alain Pirotte, Claude Delobel, and Georg Gottlob, editors, Advances in Database Technology EDBT’92, 3rd International Conference on Extending Database Technology, Vienna, Austria, March 23-27, 1992, Proceedings, volume 580 of Lecture Notes in Computer Science, pages 136–151. Springer, 1992. 6. C.D. Castro, F. Grandi, and M.R. Scalas. On Schema Versioning in Temporal Databases. In: Recent Advances in Temporal Databases. Springer, 1995. 7. C.D. Castro, F. Grandi, and R.R. Scalas. Schema Versioning for Multitemporal Relational Databases. Information Systems, 22(5):249–290, 1997. 8. J. Clifford and A. Croker. The Historical Relational Data Model (HRDM) and Algebra based on Lifespans. In 3rd International Conference of Data Engineering, Los Angeles, California, USA, Proceedings, pages 528–537. IEEE Computer Society Press, 1987. 9. G. Grahne. The Problem of Incomplete Information in Relational Databases. In Springer LNCS No. 554, 1991. 10. O. G. Jensen and M. H. B¨ ohlen. Evolving Relations. In Database Schema Evolution and Meta-Modeling, volume 9th International Workshop on Foundations of Models and Languages for Data and Objects of Springer LNCS 2065, page 115 ff., 2001. 11. A. M. Keller. Set-theoretic problems of null completion in relational databases. Information Processing Letters, 22(5):261–265, 1986. 12. N. Lerat and W. Lipski. Nonapplicable Nulls. Theoretical Computer Science, 46:67–82, 1986. 13. L.E. McKenzie and R.T. Snodgrass. Schema Evolution and the Relational Algebra. Information Systems, 15(2):207–232, 1990. 14. R. van der Meyden. Logical Approaches to Incomplete Information: a Survey. In: Logics for Databases and Information Systems (chapter 10). Kluwer Academic Publishers, 1998. 15. Simon R. Monk and Ian Sommerville. Schema Evolution in OODBs using Class Versioning. SIGMOD Record, 22(3):16–22, 1993. 16. Young-Gook Ra and Elke A. Rundensteiner. A transparent object-oriented schema change approach using view evolution. In Philip S. Yu and Arbee L. P. Chen, editors, Proceedings of the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei, Taiwan, pages 165–172. IEEE Computer Society, 1995. 17. J.F. Roddick. Schema Evolution in Database Systems an Annotated Bibliography. ACM SIGMOD Record, 21(4):35–40, 1992. 18. J.F. Roddick. SQL/SE - A Query Language Extension for Databases Supporting Schema Evolution. ACM SIGMOD Record, 21(3):10–16, 1992. 19. J.F. Roddick. A Survey of Schema Versioning Issues for Database Systems. Information Software Technology, 37(7):383–393, 1995. 20. J.F. Roddick, N.G. Craske, and T.J. Richards. A Taxonomy for Schema Versioning based on the Relational and Entity Relationship Models. In 12th International Conference on Entity-Relationship Approach, Arlington, Texas, USA, December 15-17, 1993, Proceedings, pages 137–148. Springer-Verlag, 1993. 21. J.F. Roddick and R.T. Snodgrass. Schema Versioning. In: The TSQL92 Temporal Query Language. Noewell-MA: Kluwer Academic Publishers, 1995. 22. Andrea H. Skarra and Stanley B. Zdonik. The Management of Changing Types in an Object-Oriented Database. In OOPSLA, 1986, Portland, Oregon, Proceedings, pages 483–495, 1986.
82
O.G. Jensen and M. B¨ ohlen
23. R.T. Snodgrass et al. TSQL2 Language Specification. ACM SIGMOD Record, 23(1), 1994. 24. Markus Tresch and Marc H. Scholl. Schema transformation without database reorganization. SIGMOD Record, 22(1):21–27, 1993. 25. Y. Vassiliou. Null values in Database Management: A Denotational Semantics Approach. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, pages 162–169, 1979. 26. Y. Vassiliou. Functional Dependencies and Incomplete Information. In International Conference on Very Large Databases, pages 260–269, 1980.
Magic Sets Method with Fuzzy Logic Karel Jeˇzek and Martin Z´ıma Department of Computer Science and Engineering, University of West Bohemia, Univerzitn´ı 8, 306 14 Plzeˇ n, Czech Republic Tel.: +420-19-7491212 Fax.: +420-19-7491213 {jezek ka,zima}@kiv.zcu.cz
Abstract. This paper describes a method of the efficient query evaluation when uncertainty is involved in a deductive database system. A deductive system enriched with fuzzy logic is able to serve better as a knowledge system. Speeding up its execution makes this system practically useful.
1
Introduction
Deductive database systems make it possible to deduce new facts not contained among the facts of the original (extensional) database. These new facts are derived on the base of deduction rules (intensional database). In case that the facts of the extensional database and/or the rules describing the intensional database are vague, the evaluation with uncertainty has to be used. Uncertain information may arise in databases and knowledge bases in various ways. Imprecise information may be expressed at the attribute value level, at the level of the predicate applicability or at the tuple (fact) level. Only the last two are considered in this paper. The efficiency of a query evaluation in database systems (including deductive ones) is crucial. Sophisticated query optimization techniques, such as magic sets and counting methods, have existed in the deductive database literature for many years [1][3] but have yet to be studied in detail for fuzzy deductive databases. Introducing uncertainty into a deductive program enforces changes in the program evaluation as well as changes in the common used Magic Sets Method. The structure of this paper is as follows. The fundamental Magic Sets Method [3] is shortly reminded in Section 2. Section 3 defines used principles of fuzzy logic and the evaluation of deductive programs with fuzzy logic. The goal queries optimization is described in Section 4, where we introduce the extended Magic Sets Method. It enables the efficient evaluation of logic programs when uncertainty is involved in the deduction process. The additional supposed extensions of the Magic Sets Method and some experimental results are summarized in the Conclusions. T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 83–92, 2002. c Springer-Verlag Berlin Heidelberg 2002
84
2
K. Jeˇzek and M. Z´ıma
The Magic Sets Method
Let us shortly mention the principle of the Magic Sets Method. This method transforms the original program into a program called magic program. The magic program is equivalent to (gives the same result as) the original one but its evaluation is usually substantially shorter. Constant arguments of the goal query are utilized for its efficient evaluation. Note 1. Suppose P denotes a set of rules, D denotes a set of facts belonging to a logical program P = P ∪ D. Definition 1. Let us have two different logical programs P1 ∪ D, P2 ∪ D and a query Q. We say that P1 and P2 are equivalent sets of rules if for every possible extensional database D the programs P1 ∪ D and P2 ∪ D produce the same answer to the query Q. We call the two programs equivalent as well. The first step of the Magic Sets Method is the adornment of the original program. It excludes all the rules of the program which do not participate in the goal query evaluation. Only derived predicates are adorned for in fact only they have to be computed. The adornment strings become part of the predicate names. The so-called sideways information passing graph is created for each rule together with the adornment. It describes the binding of arguments among individual predicates and consequently the flow of restrictive information among the predicates of the rule. Definition 2. An adornment for an n-ary predicate p is a string a of length n over the alphabet {b, f }, where b stands for bound and f stands for free. We assume a fixed order of the arguments of the predicate. Intuitively, an adorned occurrence of the predicate, pa , corresponds to a computation of the predicate with some arguments bound to constants, and the other arguments being free, where the bound arguments are those indicated by the adornment. Example 1. Let us suppose there is some intensional predicate p having 3 arguments in the head of rule r. In this case the adornment string of length 3 is attached to the predicate p. Let us now suppose the string bff is attached to the occurrence of the predicate p(X, Y, Z) when the rule r is adorned. Then the resulting literal shall be p_bff(X, Y, Z). Its first argument (the variable X) is bound and the next two arguments are free. In the second step the so-called magic program is constructed. This step consists of three parts: initialization, construction of magic rules and construction of modified rules. Definition 3. A magic predicate m pa is created as a projection of an original pa predicate on its bound arguments. The number of b characters in an adornment string (it has to be at least one) indicates the arity of the created magic predicate. The prefix m is used to identify the magic predicate.
Magic Sets Method with Fuzzy Logic
85
Example 2. Let us suppose the body of the adorned rule rad contains the adorned predicate p_bfb(X, Y, Z). Then we can create the magical version of the predicate p. The resulting magic predicate will have two arguments, since the corresponding adorned string has just b characters. That’s why the resulting literal has the form m_p_bfb(X, Z). In the course of initialization the magic fact (seed) is created. It contains the constants of the goal query. These constants shall be propagated through the resulting program with the help of magic predicates. The magic predicates are described in magic rules and determine which values shall participate in the query evaluation. Therefore they are consequently used in the bodies of modified original rules to restrict the number of computed tuples. We obtain the modified rules by putting the magic predicates into the bodies of the original rules. Every modified rule can contain a different number of magic predicates in its body.
3
The Fuzzy Logic
When working with the uncertainty model, we can use the fuzzy logic approach. Possible ways of the uncertainty implementation are described in [5][9] in detail. Out of various fuzzy logics, we have investigated G¨ odel, L D ukasiewicz and product fuzzy logic [4][8]. We use the real number from the interval (0, 1] as a confidence factor (CF). Definition 4. Let p, q be predicates, u and v their argument value vectors, u = u1 , u2 , . . . , uN , v = v1 , v2 , . . . , vM , where ui , i = 1 . . . N and vj , j = 1 . . . M have to be instantiated with constants. The semantics of G¨ odel fuzzy conjunction “∧G ”, fuzzy disjunction “∨G ” and fuzzy negation “notG ” is as follows: CF (p (u) ∧G q (v)) = min (CF (p (u)) , CF (q (v))) CF (p (u) ∨G q (v)) = max (CF (p (u)) , CF (q (v))) 1 if CF (p (u)) = 0 CF (notG p (u)) = 0 otherwise The following semantics of L ukasiewicz fuzzy conjunction “∧L ”, fuzzy disjunction “∨L ” and fuzzy negation “notL ” is as follows: CF (p (u) ∧L q (v)) = max (0, CF (p (u)) + CF (q (v)) − 1) CF (p (u) ∨L q (v)) = min (1, CF (p (u)) + CF (q (v))) CF (notL p (u)) = 1 − CF (p (u)) The last semantics of product fuzzy conjunction “∧P ”, fuzzy disjunction “∨P ” and fuzzy negation “notP ” is as follows: CF (p (u) ∧P q (v)) = CF (p (u)) ∗ CF (q (v)) CF (p (u) ∨P q (v)) = CF (p (u)) + CF (q (v)) − CF (p (u)) ∗ CF (q (v)) 1 if CF (p (u)) = 0 CF (notP p (u)) = 0 otherwise
86
K. Jeˇzek and M. Z´ıma
Note 2. We will use CF (p) instead of CF (p (u)) for brevity in the equations introduced below. Proposition 1. If p1 , p2 , . . . , pn are predicates, CF (p1 ), CF (p2 ), . . . , CF (pn ) their confidence factors, then G¨ odel fuzzy conjunction and fuzzy disjunction of these predicates are: n pi = min (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) CF G i=1 n
CF
i=1
G
pi
= max (CF (p1 ) , CF (p2 ) , . . . , CF (pn ))
and the L . ukasiewicz connectives are introduced as: n n CF pi = max 0, CF (pi ) − n + 1 L i=1 n
CF
i=1
L
pi
= min 1,
i=1
n
CF (pi )
i=1
The last formula represents the semantics of the product fuzzy conjunction: n n CF = p CF (pi ) i P i=1
Proof. Proof is obvious.
i=1
Note 3. In the case of product fuzzy disjunction it is not possible to derive a simple universal formula as in the case of the other introduced fuzzy logics. The resulting formula should be endless. Definition 5. Let a rule r be given with a head predicate p and body literals L1 , L2 . . . , Lm . Let v be the explicitly assigned confidence factor to r. Then the resulting confidence factor of p w.r.t. r shall be the product of v and the confidence of the rule r body. Example 3. In the case of G¨odel fuzzy logic, the head predicate p of a rule r has the confidence factor value: m Li ∗ v = min (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) ∗ v CF (p) = CF G i=1
Definition 6. Let rules r1 , r2 , . . . , rm contain the predicate p in their heads. The resulting confidence factor of the predicate p w.r.t. rules r1 , r2 , . . . , rm is given by the respective fuzzy disjunction of the following values:
Magic Sets Method with Fuzzy Logic
– – – –
87
confidence factor of p w.r.t. r1 , confidence factor of p w.r.t. r2 , ..., confidence factor of p w.r.t. rm .
Example 4. Let us suppose the predicate p has received three confidence factors v1 , v2 and v3 from three rules. If the L D ukasiewicz fuzzy logic is used the resulting confidence factor of p is: 3 vi = min (1, v1 + v2 + v3 ) CF (p) = CF L i=1
4
Adaptation of the Magic Sets Method
Introducing uncertainty into a program requires a modified definition of the program equivalency. The extended Magic Sets Method has to produce a magic program which is equivalent to the original one. Definition 7. Let P1 and P2 be the different sets of rules. P1 and P2 are equivalent, if for any extensional database D the programs P1 ∪ D and P2 ∪ D evaluated by means of an arbitrary fuzzy logic produce the same answer, including uncertainties of resulting tuples for any given goal query Q. The introduction of a program equivalence for an arbitrary fuzzy logic in Definition 7 requires to define the general fuzzy conjunction. Definition 8. Let p, q and r be predicates, CF (p), CF (q) and CF (r) their respective uncertainties and κ is a function of two variables defined the following way: 1. 2. 3. 4. 5.
κ is commutative κ is assocoiative κ (CF (p) , 0) = 0 κ (CF (p) , 1) = CF (p) max (0, CF (p) + CF (q) − 1) ≤ κ (CF (p) , CF (q)) ≤ min (CF (p) , CF (q))
Let us we call κ is a function of arbitrary fuzzy conjunction “∧K ”. Definition 9. Let p1 , p2 , . . . , pn−1 , pn be predicates and CF (p1 ), CF (p2 ), . . . , and Knthe function of an arbitrary fuzzy CF (pn−1 ), CF (pn ) their uncertainties conjunction of n variables “ K ”. We define “ K ” as follows: n pi = Kn (CF (p1 ) , CF (p2 ) , . . . , CF (pn−1 ) , CF (pn )) = CF K i=1
= κ (CF (p1 ) , κ (CF (p2 ) , . . . , κ (CF (pn−1 ) , CF (pn )) . . .))
88
K. Jeˇzek and M. Z´ıma
Note 4. We do not define the general properties of fuzzy disjunction, because it is not used in this text. Theorem 1 (Soundness). Let r be a rule with uncertainty and m p1 , m p2 , . . . , m pm be magic predicates. The modified rule rmod shall be created with the aid of inserting of the mentioned magic predicates into the body of the rule r. If all tuples of magic predicates m p1 , m p2 , . . . , m pm satisfy the condition CF (m pj ) = 1, then the rules r and rmod are equivalent (i.e. produce the same tuples including CF coefficients) for any fuzzy logic. Proof. The proof shows that rules r and rmod produce the same tuples (inclusive of CF coefficients) if all tuples of the magic predicates m p1 , m p2 , . . . , m pm satisfy the equality CF (m pj ) = 1. Suppose the rules r and rmod have the forms:
r : p X : − p1 X1 , p2 X2 , . . . , pn Xn CF v.
b CF v. rmod : p X : − p1 X1 , . . . , pn Xn , m p1 X1b , . . . , m pm Xm The following two formulas show the evaluation of head predicates of rules r and odel fuzzy conjunction is used. The double underscored results of rmod if the G¨ both formulas indicate the Theorem validity in the case of G¨ odel fuzzy logic. n CF (p) = CF pi ∗ v = min (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) ∗ v G i=1
CF (p) = CF
n G
∧G
pi
i=1
m j=1
G
m pj ∗ v =
= min (CF (p1 ) , . . . , CF (pn ) , CF (m p1 ) , . . . , CF (m pn )) ∗ v = = min (CF (p1 ) , . . . , CF (pn ) , 1, . . . , 1)) ∗ v = = min (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) ∗ v If the L D ukasiewicz fuzzy conjunction is used the next formulas indicate the CF evaluation. The double underscored results show the Theorem validity in the case of L D ukasiewicz fuzzy logic. n n CF (p) = CF pi ∗ v = max 0, CF (pi ) − n + 1 ∗ v L i=1
CF (p) = CF
n L
i=1
= max 0,
i=1
n i=1
pi
m
∧L
L
j=1
CF (pi ) +
m j=1
m pj ∗ v = CF (m pj ) − (n + m) + 1 ∗ v =
Magic Sets Method with Fuzzy Logic
n
= max 0,
CF (pi ) +
i=1
n
= max 0,
i=1 n
= max 0,
m
89
1 − (n + m) + 1 ∗ v =
j=1
CF (pi ) + m − n − m + 1
∗v =
CF (pi ) − n + 1
∗v
i=1
The next two formulas show the evaluation of the head predicates of rules r and rmod if arbitrary fuzzy conjunction (see Definitions 8 and 9) is used. The double underscored results of both formulas indicate the Theorem validity in the case of arbitrary fuzzy logic. n CF (p) = CF pi ∗ v = Kn (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) ∗ v K i=1
CF (p) = CF
n K
i=1
pi
∧K
m j=1
K
m pj ∗ v =
= Kn+m (CF (p1 ) , . . . , CF (pn ) , CF (m p1 ) , . . . , CF (m pm )) ∗ v = = Kn+m (CF (p1 ) , . . . , CF (pn ) , 1, . . . , 1)) ∗ v = = κ (Kn (CF (p1 ) , . . . , CF (pn )) , Km (1, . . . , 1)) ∗ v = = κ (Kn (CF (p1 ) , . . . , CF (pn )) , 1) ∗ v = = Kn (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) ∗ v The last formula summarizes the result of the whole proof. n m n pi ∗ v = CF pi ∧K m pj ∗ v CF (p) = CF K K K i=1
i=1
j=1
Theorem 2 (Completeness). Let r be a rule with uncertainty and m p1 , m p2 , . . . , m pm be magic predicates. The modified rule rmod shall be created by means of inserting of the mentioned magic predicates into the body of the rule r. Then there exists such uncertainty (CF coefficient) for every magic predicate that rules r and rmod produce the same tuples including CF coefficients for any fuzzy logic. Proof. The proof determines values of CF coefficients for all tuples of magic predicates m p1 , m p2 , . . . , m pm when the above introduced fuzzy logics are used. We assume that the rule r with uncertainty has the form:
p X : − p1 X1 , p2 X2 , . . . , pn Xn CF v. The resulting value of the confidence factor for predicate p is
90
K. Jeˇzek and M. Z´ıma
CF (p) = CF
n
i=1
K
pi
∗v
(1)
The body of the modified rule contains the predicates p1 , p2 , . . . , pn and may contain the magic predicates m p1 , m p2 , . . . , m pm . The resulting value of the confidence factor for the modified rule is m n pi ∧K m pj ∗ v CF (p) = CF (2) K K i=1
j=1
The original program and its corresponding magic program have to be equivalent (including the resulting values of CF coefficients). This implies that the values of equations (1) and (2) must be identical. We reach this equivalence easily for G¨ odel fuzzy logic where the fuzzy conjunction is expressed as the minimum. min (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) = = min (CF (p1 ) , . . . , CF (pn ) , CF (m p1 ) , . . . , CF (m pm )) The solution of this equation is the following system of inequalities: min (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) ≤ CF (m p1 ) ≤ 1 min (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) ≤ CF (m p2 ) ≤ 1 ... min (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) ≤ CF (m pm ) ≤ 1 where one solution may be e.g.: ∀i ∈ [1..m] holds CF (m pi ) = 1. We specify possible values of CF coefficients of magic predicates in the body of the modified rule rmod if L D ukasiewicz fuzzy logic is used. The resulting CF coefficient value of the original rule head predicate p in the case of L D ukasiewicz fuzzy logic has the form: n CF (p) = max 0, CF (pi ) − n + 1 ∗ v (3) i=1
and the resulting CF coefficient value of the modified rule head predicate p in the case of L D ukasiewicz fuzzy logic has the form: n m CF (p) = max 0, CF (pi ) + CF (m pj ) − n − m + 1 ∗ v (4) i=1
j=1
In this case the resulting values of CF coefficients of equations (3) and (4) have to be identical, too. The solution of identity of these equations is a simple equation the unambiguous solution results from.
Magic Sets Method with Fuzzy Logic m j=1
CF (m pj ) = m
⇐⇒
91
CF (m p1 ) = 1 CF (m p2 ) = 1 ... CF (m pm ) = 1
We assume that κ resp. Kn is an arbitrary fuzzy conjunction of two resp. n variables (see Definitions 8 and 9). The CF coefficient of the predicate p w.r.t. original rule r has the resulting value: CF (p) = Kn (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) ∗ v
(5)
and the CF coefficient of the predicate p w.r.t. modified rule rmod of the magic program has the resulting value: CF (p) = Kn+m (CF (p1 ) , . . . , CF (pn ) , CF (m p1 ) , . . . , CF (m pm )) ∗ v
(6)
CF coefficients produced by the equations (5) and (6) must have identical values. Kn (CF (p1 ) , CF (p2 ) , . . . , CF (pn )) = = Kn+m (CF (p1 ) , . . . , CF (pn ) , CF (m p1 ) , . . . , CF (m pm )) = = κ (Kn (CF (p1 ) , . . . , CF (pn )) , Km (CF (m p1 ) , . . . , CF (m pm ))) If we utilize the 4th property of an arbitrary fuzzy conjunction (see Definition 8) we obtain a simple equation the unambiguous solution results from. CF (m p1 ) = 1 CF (m p2 ) = 1 Km (CF (m p1 ) , . . . , CF (m pm )) = 1 ⇐⇒ ... CF (m pm ) = 1
Now we can look at the consequences of the proofs of Theorems 1 and 2: When we wish to apply the Magic Sets Method together with a fuzzy logic we must set the CF values to 1 for all tuples of magic predicates. This requirement can be easily achieved if we use classic two-valued logic for evaluation of the magic rules, i.e. we do not respect CF coefficients of the non-magic predicates from the bodies of magic rules. Theorem 3. Let P be a logic program with uncertainty and P m be its corresponding magic program. If we use classic two-valued logic for the magic rules evaluation and any fuzzy logic for modified rules evaluation then programs P and P m are equivalent according to Definition 7. Proof. The proof results from the proofs of Theorems 1 and 2.
92
5
K. Jeˇzek and M. Z´ıma
Conclusions
The Magic Sets Method application in the environment of fuzzy logic programs was described. A correct processing of fuzzy operation resulting in a magic program which is equivalent to the original one was entered up. We focus on fuzzy conjunction as it has direct influence on the evaluation of modified rules. The theory introduced in the paper was experimentally verified on the deductive database system EDD [7] and applied on the database of bank clients. The logic program appraises clients of banks according to several aspects. These aspects are described by the help of rules and their CFs. The program separates clients to safe and unsafe ones. The original program consists of 62 rules and it takes more than 30 GB of disk memory. Its running time might exceed one year. The resulting magic program consists of 125 rules and during the evaluation the program generates 500 KB of data into the database. The running time of this magic program was less than 2 minutes. The next challenging task of the Magic Sets Method extensions concerns the negation. A program with negation has to be stratified [6]. A non-stratified program can generate wrong results or its evaluation is infinite. The Magic Sets Method applied to a stratified program does not always produce the stratified resulting magic program. The general conditions for producing stratified or nonstratified magic programs were not yet formulated. Acknowledgement. This work was supported by the Research Project of Czech Ministry of Education No. MSM235200005 “Information Systems and Technology”.
References 1. Bancilhon, F., Maier, D., Sagiv, Y., Ullman, J.D.: Magic Sets and Other Strange Ways to Implement Logic Programs. Proc. 5th ACM Symp. Principles of Database Systems (1986) 1–15 2. Berka, P.: PKDD ’99 Discovery Challenge: Guide to the Financial Set. http://lisp.vse.cz/pkdd99/Challenge/berka.htm 3. Beeri, C., Ramakrishnan, R.: On the Power of Magic. Proc. 6th ACM Symp. Principles of Database Systems (1987) 269–283 4. H´ ajek, P., Godo, L.: Deductive Systems of Fuzzy Logic. Tatra Mountains Mathematical Publications 13 (1997) 37–68 5. Lakshmanan, L.V.S., Shiri, N.: A Parametric Approach to Deductive Database with Uncertainty. IEEE Transactions on Knowledge and Data Engineering 13 (2001) 554–570 6. Meskesh, M.: Extending the Stratification Approach for Deductive Database Systems. PhD Thesis, Technische Hochschule, Aachen (1996) 7. Toncar, V.: Datalog Extensions and the Use of Datalog for Large Data Analysis. PhD Thesis, University of West Bohemia, Pilsen (2000) 8. Vojt´ aˇs, P.: Uncertain Reasoning with Floating Connectives. Proc. 3rd International Workshop on Artificial Intelligence and Technics (1996) 31–40 9. Wagner, G.: Foundations of Knowledge Systems with Applications to Databases and Agents. Kluwer Academic Publishers (1998)
Information Retrieval Effectiveness of Turkish Search Engines Yıltan Bitirim1 , Ya¸sar Tonta2 , and Hayri Sever3 1
Department of Computer Engineering Eastern Mediterranean University Famagusta, T.R.N.C. (via Mersin 10 Turkey)
[email protected] 2 Department of Library Science Hacettepe University 06532 Beytepe, Ankara, Turkey
[email protected] 3 Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst, MA 01003, USA
[email protected]
Abstract. This is an investigation of information retrieval performance of Turkish search engines with respect to precision, normalized recall, coverage and novelty ratios. We defined seventeen query topics for Arabul, Arama, Netbul and Superonline. These queries were carefully selected to assess the capability of a search engine for handling broad or narrow topic subjects, exclusion of particular information, identifying and indexing Turkish characters, retrieval of hub/authoritative pages, stemming of Turkish words, correct interpretation of Boolean operators. We classified each document in a retrieval output as being “relevant” or “nonrelevant” to calculate precision and normalized recall ratios at various cut-off points for each pair of query topic and search engine. We found the coverage and novelty ratios for each search engine. We also tested how search engines handle meta-tags and dead links. Arama appears to be the best Turkish search engine in terms of average precision and normalized recall ratios, and the coverage of Turkish sites. Turkish characters (and stemming as well) still cause bottlenecks for Turkish search engines. Superonline and Netbul make use of the indexing information in metatag fields to improve retrieval results.
1
Introduction
With respect to the statistics on web usage we have compiled using different Internet resources and articles1 , the number of Internet users, host sites, and 1
See the sites at (1) http://www.nua.com/surveys, (2) http://www.netsizer.com/index.html and (3) http://www.inktomi.com/new/press/2000/billion.html for web usage statistics.
T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 93–103, 2002. c Springer-Verlag Berlin Heidelberg 2002
94
Y. Bitirim, Y. Tonta, and H. Sever
documents world-wide are at least 419 millions, 120 millions, and one billion, respectively. Furthermore, it is estimated that the number of hosts and web documents are doubled in every year [1]. These facts alone make the need for search engines to retrieve the information on web apparent, along with other types of search systems and web mining agents. There are numerous works in literature on different perspectives of search engines, such as their functional and architectural views [2], design issues[3,4,5] and performance evaluation [6,7,8,9,10]. It is a delusive conception to regard information retrieval systems and search engines the same, just because they both strive to satisfy information needs of users. Noticeable differences for search engines emerge from raw facts such as dynamic nature (40 in every month[1]), poor information value (or spare data) [11], poor authoring quality[1,12], duplication, meta representation of web pages. These have given rise to build up new components (or software agents in general), namely crawling and spying components (link checker, page change monitor, validator, and new page detector), and retrieval strategies (utilization of multiple evidence on relevance of web pages [4, 13]). For the above reasons, the development of search engines have their own challenges even if they continue to enjoy many solid research results of information retrieval field. A formal approach to the evaluation of Turkish search engines is, hence, an apparent need and of course valuable work for flourishing Turkish search engines. In this paper, we discuss the method and configuration of the experiment, and summarize main results of an extensive six-month work on the four search engines, namely, Arabul, Arama, Netbul and Superonline.
2
Method
We aimed to answer several research questions with regard to the effectiveness of popular Turkish search engines in handling broad as well as narrow information needs of users, in excluding some particular information and in satisfying oneor two-word query expressions. In addition, we looked at qualitative issues such as incorporation of stemming facility for both indexing document and search terms, correct implementation of Boolean operators, incorporation of sound-like operator to properly handle accented Turkish characters, update (visit) period of crawlers for indexed (existing) links. To assess the coverage and novelty aspects of Arama, Arabul, Netbul, and Superonline engines, we picked up some terms from the top 10 list of the most frequently used search terms. We identified the following seventeen query topics2 : ( 1) Internet and ethics, ( 2) baroque music and its characteristics, 2
We refer the reader to the site at cmpe.emu.edu.tr/bitirim/engine for further information on our work, e.g., the original Turkish query topic statements or their query expressions for search engines.
Information Retrieval Effectiveness of Turkish Search Engines
95
( 3) information about the medicine “Prozac” (but not about the rock group “Prozac”) ( 4) related works about the evaluation of Turkish search engines on the Internet, ( 5) mp3 copies of songs of “Barı¸s Man¸co” (note the spelling) ( 6) mp3 copies of songs of “Baris Manco”, ( 7) what is DPT (State Planning Organization)? ( 8) general information about “alien”, ( 9) general information about “aliens” (note the plural form), (10) documents mentioning former president of the Republic of Turkey, “S¨ uleyman Demirel” AND the current president “Ahmet Necdet Sezer,” (11) documents mentioning former president of the Republic of Turkey, “S¨ uleyman Demirel” OR the current president “Ahmet Necdet Sezer,” (12) approaches of presidents of the Republic of Turkey “S¨ uleyman Demirel” OR “Ahmet Necdet Sezer” to TEMA,3 (13) information about space, (14) information about universe, (15) information about space OR universe,4 (16) the relation between “Mustafa Kemal Atat¨ urk” and “Fikriye Hanım,” ¨ ˙ (17) information about the Speaker of the Turkish Parliament “Omer Izgi”. It may be worth stating that above 17 query topics were converted to structured formal queries to simplify the parsing process for each search engine. Needless to say, different search engines have different syntactic rules to parse and process search queries. For instance, each search engine treats Boolean operators (AND, OR, NOT) in a somewhat different way. 2.1
Evaluation of Relevance
The relevance judgment was binary, though normalized recall metric allows us to pose more relevancy levels. Documents with the same content but different web addresses (i.e., mirror pages) were considered as different ones. Duplicated information items was conceived as good as one for its relevance judgment. If the retrieved document was not accessible for some reason or in different languages other than Turkish or English, then it was considered as non-relevant information item. 2.2
Performance Measurement
The effectiveness measurements in information retrieval are typically of precision and recall, which can be defined for a user query as the proportion of retrieved 3 4
The Turkish Foundation for Combating Soil Erosion, for reforestation and the protection of natural habitats. “space” and “universe” are synonymous words.
96
Y. Bitirim, Y. Tonta, and H. Sever
and relevant documents over retrieval output and relevant documents, respectively. It has been a common practice in the evaluation of search engines to exclude recall values for obvious reason, though there were some recommendations in the literature for estimating average recall value [14] by pooling method [15]. We used, however, precision at different sizes of retrieval output, i.e., cut-off points. The precision at different cut-offs can be used to roughly see how scores of relevant documents are distributed over their ranks. Note that, in our evaluation, when the number of documents retrieved is smaller than the cut-off point at the hand, precision was calculated over total documents retrieved. This score-rank curve is closely related to the normalized recall, say Rnorm [16]. The normalized recall is based on the optimization of expected search length [17]. On other words, It utilizes the viewpoint that a retrieval output ∆1 is better than another one ∆2 if the user gets fever non-relevant documents with ∆1 than with ∆2 . We calculated normalized recall at four cut-off points for each query per search engine in order to parallel with precision values. The Rnorm is defined as R+ − R− 1 1+ Rnorm (∆) = + 2 Rmax where R+ is the number of document pairs where a relevant document is ranked higher than non-relevant document, R− is the number of document pairs where + is the a non-relevant document is ranked higher than relevant one, and Rmax maximal number of R+ . We strongly believe that the effectiveness figures of search engines should be accompanied by their coverage, novelty, and recency ratios to obtain a complete picture. The coverage ratio is the proportion of the number of documents (which are previously known as relevant to the user) retrieved to the total number of documents known as relevant. The novelty ratio is the proportion of the relevant documents (which are not previously known to the user) retrieved to the relevant documents retrieved [18,19]. The recency factor is simply the percentage of dead links. The coverage and novelty ratios of Turkish search engines were measured using the most frequently searched five one-word search queries on Arabul search engine, namely, “mp3”, “oyun” (game), “sex”, “erotik” (erotic), and “porno” (porn). These top search terms appear to be consistent with those of globallyknown search engines [20,21]. We used a pool of 1000 documents for each of above search terms. All these three measures, i.e., coverage, novelty, and recency, were normalized across the corresponding values of search engines. Finally, we tested if Turkish search engines make use of metadata to retrieve documents. For each search engine, we selected a web document containing meta tags “keyword” and “description”. Then we used the terms that appeared in the document’s metatags and performed a search on each search engine to determine if metatags are used for retrieval purposes.
Information Retrieval Effectiveness of Turkish Search Engines
2.3
97
Analysis of Data
We analyzed the precision and normalized recall ratios for each search engine to determine if they significantly differ in terms of retrieval effectiveness. As the distribution of precision and normalized recall ratios were not normal, we used nonparametric Kruskal-Wallis and Mann-Whitney statistical tests that are used for ordinal data. Kruskal-Wallis (H) was used to test if precision and normalized recall ratios of each search engine in different cut-off points were different from others and if the difference was statistically significant. If different, then pair-wise Mann-Whitney tests were applied to determine which search engines engendered it. The same statistical tests were also applied to identify if there was any difference among search engines in terms of their recency values. Pearson’s (r) was used to test if there was any relationship between precision and normalized recall ratios.
3
Experiment Results
Findings of our experiment and the analysis of results are discussed below. 3.1
The Number of Documents Retrieved by Search Engines
The number of zero retrievals (i.e., no documents retrieved) or retrievals that contain no relevant documents (i.e., the precision ratio is zero) can be used to evaluate the retrieval performance of search engines. In our experiment, Arabul could not retrieve any document for 6 of 17 queries. The same figure was 1 out of 17 for Arama. Although Netbul and Superonline retrieved at least one document for each query, Netbul could not retrieve any relevant documents for 8 of 17 queries. Superonline, Arabul, and Arama could not retrieve any relevant documents for 5, 5, and 3 queries out of 17, respectively. If we examine both zero retrievals and retrievals with no relevant documents, Arabul could not retrieve any relevant documents for 11 out of 17 queries (65%). (Netbul: 8 (47%); Superonline: 5 (29%); and Arama; 4 (24%.) The number of relevant documents retrieved for each query on each search engine is given in Table 1. The first number in the row labelled “Total” shows the total number of relevant documents retrieved and the second one (in parentheses) shows the total number of documents retrieved by each search engine. As Table 1 shows, Arama retrieved the highest number of relevant documents for 17 queries (64). Arama’s mean number of relevant documents per search query (3.8) was also higher than those of other search engines. Total number of documents retrieved by all search engines was 971, of which 169 were relevant. To put it differently, about 5 in 6 documents retrieved by search engines were not relevant. 3.2
Precision Ratios
Mean precision values of search engines in various cut-off points (for first 5, 10, 15, and 20 documents retrieved) are shown in Figure 1. Arama has the highest
98
Y. Bitirim, Y. Tonta, and H. Sever Table 1. Number of relevant documents retrieved Query Arabul Arama Netbul Superonline 1 0 3 0 0 2 1 2 0 2 3 0 3 4 10 4 0 0 0 0 5 0 3 0 0 6 0 4 3 1 7 2 12 1 1 8 6 4 3 4 9 9 4 2 3 10 0 5 2 8 11 0 6 4 10 12 0 0 0 0 13 5 3 2 2 14 0 0 0 1 15 2 8 5 6 16 0 7 0 6 17 0 0 0 0 Total 25 (119) 64 (277) 26 (273) 54 (302) Average 1.5 3.8 1.5 3.2
precision ratios on all cut-off points (mean 28%). Then comes Superonline (20%), Arabul (15%), and Netbul (11%). Tests showed that there was no statistically meaningful difference between precision values of search engines while cut-off points were 10, 15, and 20. Yet, a statistically meaningful difference was observed when the cut-off point was 5: Arama retrieved more relevant documents per search query than those of Arabul and Netbul. Otherwise, search engines scored similar precision ratios in higher cut-off points. 3.3
Normalized Recall Ratios
As we pointed out earlier, the normalized recall ratio measures if search engines display relevant documents in the top ranks of the retrieval outputs. If a search engine could not retrieve any documents for a search query, the normalized recall value for that query will be zero. Mean normalized recall values of search engines in various cut-off points (for first 5, 10, 15, and 20 documents retrieved) are shown in Figure 2. As Figure 2 shows, Arama has the highest (mean 54%) normalized recall values in all cut-off points. The mean normalized recall value of Arabul is the lowest as it could not retrieve any documents for 6 of 17 queries. Tests showed that there was no statistically meaningful difference between normalized recall values of search engines while the cut-off point was 20. In other words, none of the search engines displayed relevant documents in distin-
Information Retrieval Effectiveness of Turkish Search Engines
0.45
arabul (0.0825) arama (0.275) netbul (0.1075) superonline (0.2)
0.4 0.35 Precision
99
0.3 0.25 0.2 0.15 0.1 0.05 0 5
10
15
20
Cut-off points Fig. 1. Mean precision ratios
guishably higher ranks than others in general. Yet, Arama scored better than Arabul (in cut-off points 5, 10, and 15) and Netbul (in cut-off point 10), and the differences were statistically significant. 3.4
The Relation between Precision and Normalized Recall Ratios
The relationship between mean precision and mean normalized recall ratios was statistically significant. In other words, when the mean precision value was high, the mean normalized recall value was high, although the relationship got weakened as the cut-off point had increased (cut-off(5)→ Pearson’s r = .97, cutoff(10)→ r = .89, cut-off(15)→ r=.70, cut-off(20)→ r=.61). It appears that the number of relevant documents in the retrieval output tends to decrease as one goes down the list. 3.5
Turkish Character Usage in Search Engines
We used the fifth and sixth queries to examine how Turkish search engines respond to search queries that contain Turkish characters in them. Both queries were on mp3 copies of songs of the late Turkish pop singer, Barı¸s Man¸co. His name was spelled in two different forms: one with proper Turkish characters “ı”, “¸s” and “¸c”, the other with the accented versions of the same characters i, s and c. As shown in Table 1, Arama, Netbul, and Superonline handled fifth
100
Y. Bitirim, Y. Tonta, and H. Sever
Normalized Recall
0.70
arabul (0.1935) arama (0.5341) netbul (0.2965) superonline (0.3722)
0.60 0.50 0.40 0.30 0.20 0.10 5
10
15
20
Cut-off Points Fig. 2. Mean normalized recall ratios
and sixth query expressions in different ways. In case of Arabul, it differed on corresponding retrieval sets, which is apparently an indication of discriminative responses of the system against these two queries. 3.6
Coverage, Novelty, and Recency Ratios
Four search engines retrieved a total of 9944 unique relevant documents for all five queries (“mp3”, “oyun” (game), “sex”, “erotik” (erotic), and “porno” (porn)), which were collected in the pool. Superonline retrieved about 40% of all unique documents (Arabul 31%, Arama 24% and Netbul 7%). As shown in Figure 3, Superonline’s coverage ratio was much higher than that of other search engines for the most frequently searched queries on the Turkish search engines5 . Furthermore, it was evident that Netbul had poor coverage of web domain. We see the same trend for novelty ratios of search engines with slightly changed values. As shown in the same figure, Arama has the largest ratio (39%) of all broken links. Search engines Arabul and Superonline follow Arama with 28% each. Netbul has the lowest ratio 5%. The average number of broken links per query for search engines were as follows: Arama: 5.1, Superonline: 2.8, Arabul: 1.4, and 5
Arama is the indisputable leader in covering documents with Turkish addresses, though we did not publish the results in regard to .tr domain because of space limitation.
Information Retrieval Effectiveness of Turkish Search Engines
101
45 40 35 30 25
Coverage Novelty Recency
20 15 10 5 0 Netbul
Arama
Arabul
Superonline
Fig. 3. Macro average of normalized coverage and novelty measures of search engines for a pool of 1000 documents for each of top five search terms. Normalized recency was calculated over seventeen queries.
Netbul: 0.7. In general, one in six (17%) documents retrieved by search engines contained broken links. 3.7
Metadata Usage
Web documents contain indexing information in their metatag fields (i.e., metatags for “author”, “description”, “keywords”, and so on). Some search engines make use of indexing information that appears in metatag fields of documents to increase the likelihood of retrieving more relevant documents. To test the use of metadata for retrieval purposes by Turkish search engines, we first identified certain documents with metatag “keywords” field filled. We further checked if these documents were already indexed by each search engine. We found one document for each search engine satisfying both conditions. We then used the key words that appeared in metatag field “keywords” as search queries and tested if each search engine was able to find the document in question. It appears that Arama and Arabul have not benefited from metadata for retrieval purposes whereas Netbul and Superonline took advantage of the indexing information contained in metatags to upgrade the retrieval status of documents.
102
4
Y. Bitirim, Y. Tonta, and H. Sever
Conclusions
Major findings of our research can be summarized as follows: On the average, one in six documents retrieved by Turkish search engines was not available due to dead or broken links. Netbul retrieved fewer documents with dead or broken links than other search engines did. Some search engines retrieved no documents (so called “zero retrievals”) or no relevant documents for some queries. On the average, five in six documents retrieved by search engines were not relevant. Average precision ratios of search engines ranged between 11% (Netbul) and 28% (Arama) (Superonline being 20% and Arabul 15%). Arama retrieved more relevant documents than that of Arabul and Netbul in the first five documents retrieved. Search engines do not seem to make every efforts to retrieve and display the relevant documents in higher ranks of retrieval results. Average normalized recall ratios of search engines ranged between 20% (Arabul) and 54% (Arama) (Superonline being 37% and Netbul 30%). Arama retrieved the relevant documents in higher ranks than that of Arabul and Netbul. The strong positive correlation between the precision and normalized recall ratios got weakened as the cut-off value increased. Search engines were less successful in finding relevant documents for specific queries or queries that contained broad terms. Although nonrelevant documents were higher in number, search engines were more successful in single-term queries or queries with Boolean “OR” operator. The use of Turkish characters such as “¸c”, “ı”, and “¸s” in queries still creates problems for Turkish search engines as retrieval results differed for such queries. For retrieval purposes, Netbul and Superonline seem to index and make use of metadata fields that are contained in HTML documents under “keywords” and “description” meta tags. We did not encounter any anomaly case for implementation of Boolean operators.
References 1. M. Kobayashi and K. Takeda. Information retrieval on the web. ACM Computing Surveys, 32(2):144–172, June 2000. 2. J. Jansen. Using an intelligent agent to enhance search engine performance. First Monday, 1996. http://www.firstmonday.dk/issues/issue2 3/jansen/index.html. 3. W. Mettrop and P. Nieuwenhuysen. Internet search engines: Fluctuations in document accessibility. Journal of Documentation, 57:623–651, 2001. 4. W.B. Croft and H. Turtle. A retrieval model for incorporating hypertext links. In Proceedings of ACM Hypertext Conference, pages 213–224, New Orleans, LA, November 1989. 5. V.N. Gudivada, V.V. Raghavan, W.I. Grosky, and R. Kasanagottu. Information retrieval on the world wide web. IEEE Internet Computing, 1(5):58–68, 1997. 6. H. Chu and M. Rosenthal. Search engines for the world wide web: A comparative study and evaluation methodology. In Steve Hardin, editor, Proceedings of the 59th ASIS Annual Meeting, pages 127–135, Baltimore, Maryland, October 1996. 7. H.V. Lerghton and J.V. Srivastava. First 20 precision among WWW search services. Journal of the American Society for Information Science, 50:870–881, 1999.
Information Retrieval Effectiveness of Turkish Search Engines
103
8. C. Oppenheim, A. Morris, and C. McKnight. The evaluation of WWW search engines. Journal of Documentation, 56:190–211, 2000. 9. J. Savoy and J. Picard. Retrieval effectiveness on the web. Information Processing and Management, 37:543–569, 2001. 10. M. Gordon and P. Pathak. Finding information on the WWW: The retrieval effectiveness of search engines. Information Processing and Management, 35:141– 180, 1999. 11. J. S. Deogun, H. Sever, and V. V. Raghavan. Structural abstractions of hypertext documents for web-based retrieval. In Roland R. Wagner, editor, Proceedings of Ninth International Workshop on Database and Expert Systems Applications, (in conjunction with DEXA’98), pages 385–390, Vienna, Austria, August 1998. 12. M.E. K¨ u¸cu ¨k, B. Olgun, and H. Sever. Application of metadata concepts to discovery of internet resources. In Tatyana Yakhno, editor, Advances in Information Systems (ADVIS’00), volume 1909, pages 304–313. Springer Verlag, Berlin, GR, October 2000. 13. J.M. Kleinberg. Authoritative source in a hyperlinked environment. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 668–677, 1998. 14. S. C. Clarke and P. Willet. Estimating the recall performance of web search engines. Aslib Proceedings, 49(7):184–189, July/August 1997. 15. D. Hawking, N. Craswell, P. Thislewaite, and D. Harman. Results and challenges in web search evaluation. In D. Harman, editor, Proceedings of the 8th Text REtrieval Conference (TREC-8), Gaithersburg, Maryland, November 1999. 16. Y.Y. Yao. Measuring retrieval effectiveness based on user preference of documents. Journal of the American Society for Information Science, 46:133–145, 1995. 17. W.B. Cooper. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation, 19:30–41, 1968. 18. S. Lawrence and C.L. Giles. Searching the world wide web. Science, 280(5360):98– 100, 3 April 1998. http://www.neci.nec.com/ lawrence/science98.html. 19. R.R. Korfhage. Information Storage and Retireval. Wiley, New York, NY, 1997. 20. B. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: A study of user queries on the web. SIGIR Forum, 32(1):5–17, 1998. 21. C. Silverstein, M. Henziger, H. Marais, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6–12, 1999.
Comparing Linear Discriminant Analysis and Support Vector Machines Ibrahim Gokcen and Jing Peng Dept. of EECS, Tulane University, New Orleans, LA, 70118
Abstract. Both Linear Discriminant Analysis and Support Vector Machines compute hyperplanes that are optimal with respect to their individual objectives. However, there can be vast differences in performance between the two techniques depending on the extent to which their respective assumptions agree with problems at hand. In this paper we compare the two techniques analytically and experimentally using a number of data sets. For analytical comparison purposes, a unified representation is developed and a metric of optimality is proposed.
1
Introduction
The goal of pattern classification is to assign an input feature vector x = (x1 , ..., xl ) in IRl , representing an object, to a member of the class set {ωi }Ji=1 . This goal can be accomplished by inducing a classifier from a given set of training samples. We assume that there are only two classes and the class set can be represented as {−1, 1}. Different methods for classifier design have been proposed over the years, such as Linear Discriminant Analysis (LDA), Support Vector Machines (SVMs) and Nearest Neighborhood (NN) classification [5,6,7,10]. Each such method has its advantages and disadvantages depending on how well its assumptions match reality. For example, both LDA and SVMs are optimal with respect to their individual objectives. However, there can be vast differences in performance between the two methods. While the difference between the two may seem obvious, there is little work comparing the two methods in the literature. There has also been little work analyzing the respective objectives of the two methods to get a better insight into them [2,4]. In this paper, we present a comparative study of local LDA and local SVMs in terms of their empirical evaluation on a number of classification tasks and optimality in terms of having maximal margin. We also consider the case for which the objectives of two methods can be achieved at the same time. This can be analyzed by finding the conditions, under which the normals computed by the two methods are the same. Optimality in terms of margin can be analyzed by comparing the magnitude of normals computed by the methods. This study provides a basis upon which to exploit their differences and similarities for practical application areas such as Information Retrieval, Pattern Recognition and Data Mining. T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 104–113, 2002. c Springer-Verlag Berlin Heidelberg 2002
Comparing Linear Discriminant Analysis and Support Vector Machines
105
The paper is organized as follows. In the next section we give the theory behind the two statistical methods, SVMs and LDA. Construction of local decision functions shall be briefly explained. A rigorous comparison between the methods is given in Sect. 3. A unified way to represent the normals computed by the methods is also given in that section. In the fourth section, we shall give the properties of the UCI data and explain how to split them into test and training sets. We give the results of classification accuracy also in Sect. 4. Several runs have been done to make the results statistically significant. The paper concludes with a discussion of the results and future research directions.
2
Theoretical Framework
The basic tools that we shall use in this paper are SVMs and LDA and a brief theory about them should be given, before anything else. 2.1
LDA
In LDA, we project data onto a single dimension, where class assignment is made for a given test point [4]. This dimension is computed according to w = Σ −1 (µ1 − µ2 )
(1)
where Σ is the covariance matrix and µi are the means of the two classes. The vector w represents the same direction as the discriminant in the Bayes classifier along which the data has the maximum separation, provided that the two classes have the same covariance. Bayes classifier is known to optimize the average generalization error over all learning algorithms that can be defined in the hypotheses space [8]. Therefore, the hyperplane determined by and the likelihood of the two classes is optimal. In the sphered space (where radius of the ball that contains the training points is 1), (1) reduces to the difference between the means. That is, if we transform x through
we have
x ˜ = Σ −1/2 x,
(2)
w = µ1 − µ2
(3)
In practice, data are often highly non-linear. In such situations, LDA can not be expected to work well. As such, w is unlikely to provide any useful discriminant information. On the other hand, piecewise local hyperplanes can approximate any decision boundary, thereby enabling w to capture local discriminant information. In fact, more often than not, local linear decision surfaces provide better approximations than non-linear global decision surfaces. Vapnik states that local risk minimization has an advantage when, on the basis of the given structure on the set of functions, it’s impossible to approximate the desired function well, using a given number of observations. However, it may be possible to
106
I. Gokcen and J. Peng
provide a reasonable local approximation to the desired function at any point of interest [14]. In our approach, the covariance matrices are determined locally, using some neighborhood around the test point. That is, ˆ= Σ
J
pi (xi − µ ˆj )(xi − µ ˆj )t ,
(4)
j=1 yi =j
ˆi in class where µ ˆi represent the class means, and pi the relative occurence of µ j. While we don’t have a strong theory to base the selection of the number of the nearest neighbors KL , that determines the neighborhood, cross-validation can be used to determine its optimal values. In the experiments reported here, KL is determined empirically. ˆ is singular, we compute the pseudo-inverse of Σ ˆ by In case Σ −1 ˆ + = V Λ 0 Vt , Σ (5) 0 0 ˆ is ill-conditioned, we set a threshold where Λ = diag(σ1 , σ2 , · · · , σr ). In case Σ to 0.05 to zero small singular values. 2.2
SVMs
SVMs are based on the Structural Risk Minimization (SRM) principle and aim at maximizing the margin between the points of two classes by solving a convex quadratic programming problem. The solution to that problem gives us a hyperplane having the maximum margin, that is attainable between the two classes. SRM is a trade-off between the quality of the approximation of the given data and the complexity of the approximating function. Vapnik showed that generalization error is bounded by a number proportional to the ratio R2 /γ 2 , where R is the radius of the sphere that contains all training examples and γ is the margin [14]. Therefore, in order to have a tighter bound and a better generalization, we need to reduce the radius while at the same time maximizing the margin, which is the goal of SRM principle and the SVMs. The convex quadratic programming problem of SVMs (convexity implies that the optimum solution is global and unique) can be represented with L(w, b, ξ, α) =
l l 1 C 2 ξi − αi [yi (< w · xi > +b) − 1], + 2 2 i=1 i=1
(6)
where the first term is the objective function (inverse of margin with 12 in front for convenience) to be minimized and second and third terms denote the constraints that have to be satisfied [3]. αi ≥ 0 are the Lagrange multipliers, C is the misclassification cost and ξi are the slack variables.
Comparing Linear Discriminant Analysis and Support Vector Machines
107
To solve (6), we have to set the derivative of L with respect to w, b and ξ to 0. The resulting equations give us two important equalities w=
l
yi αi xi
(7)
i=1
0=
l
yi αi
(8)
i=1
Using (7) to compute w, we can determine the decision boundary as f (x) = w · x + b =
l
yi αi xi · x + b
(9)
i=1
Instead of the dot product in (9), we can use a kernel K [13] that complies with Mercer’s Theorem [11]. The simplest such K kernel would be the linear kernel (K(xi , x) = xi · x). Another kernel that we can use is the so-called Radial Basis −xi −x2
(RB) kernel (K(xi , x) = e( σ2 ) ). Equation (8) says that, either the point is a support vector or its corresponding Lagrange coefficient is 0. This observation leads to the sparseness property of the SVMs. We don’t need to use all of the training points to come up with a decision function that maximizes the margin between classes. Support vectors are sufficient to find the weight vector which is normal to the decision boundary. The exact place of the decision surface is determined by the value of b, whose value is given in [3]. Equations (7) and (8) can be used in understanding some important characteristics of the computed hyperplane and comparing it with hyperplanes computed by other methods.
3
LDA vs. SVMs
While both LDA and SVMs compute optimal hyperplanes with respect to their individual objectives, they differ significantly in several aspects. The hyperplane computed by LDA is optimal only when the covariance matrices for all of the classes are identical, which is often violated in practice. Computationally, if the dimension of the feature space is large, there will be insufficient data to locally estimate the O(n2 ) elements of the within sum-of-squares matrices, thereby making them highly biased. Moreover, in high dimensions the within sum-of-squares matrices tend to be ill-conditioned. In such situations, one must decide at what threshold to zero small singular values in order to obtain better solutions. SVMs, on the hand, compute optimal hyperplanes with respect to margin maximization without making such an assumption. Furthermore, the generalization ability of the hyperplanes determined by SVMs is high even in high dimensions. This is because SVMs only have to calculate the inner products between feature vectors, as evidenced by (9).
108
I. Gokcen and J. Peng
Now consider the normals computed by LDA and SVMs. Because the optimal hyperplane (maximizing the margin and thus minimizing the magnitude of the normal) is unique, we have wSV M ≤ wLDA
(10)
For the equal covariance case, we know that the normal computed by LDA is also optimal in the Bayesian sense. We can argue that the inequality holds in general, since non-equal covariances result in different normals, which would not be equal to the optimum value. The equality however holds for only some special cases. To see what these cases are and to have a better understanding of the methods, we can adopt a representation based on vectors and matrices. We will show that, using such a representation simplifies the equations and allows us to have an insight into the solution, by looking at two simultaneous equations. In the sphered space, the equality between the normals can be written as µ1 − µ2 = Σyi αi xi
(11)
We assume that all the input points are transformed according to (2) before proceeding. Therefore we are going to use x instead of x ˜ from now on. If we rearrange both sides of (11) we have 1 1 Σy=1 xi − Σy=−1 xj = Σy=1 αi xi − Σy=−1 αj xj N1 N2
(12)
where Ni are the number of training points that belong to class ωi (N1 + N2 = l where l is the total number of training points), αi,j are the Lagrange-coefficients (only support vectors have non-zero coefficients, so we can say that only the support vectors contribute to the right-hand sides of the above equations) and xi,j are the training points that are used to compute the decision boundary. Also from (8) we have (13) Σy=1 αi − Σy=−1 αj = 0 If we rearrange the equation we get Σy=1 αi = Σy=−1 αj
(14)
Ni αi xi Right-hand side of (12) comes from the SVMs and includes the sum Σi=1 This sum can be represented as
T αi1 αi2 ... αiNi
xi1 xi2 ... = αi Xi xiNi
(15)
where αi s are the 1xNi Lagrange coefficient vectors and Xi are the Ni xd training point matrices for classes ωi . Here d represents the dimensionality of the training data. All the vectors are assumed to be column vectors in all the equations. On
Comparing Linear Discriminant Analysis and Support Vector Machines
109
the other hand, left-hand side of (12) comes from LDA and includes the term 1 Ni Σxi , which can be represented as
xi1 xi2 1 ... ... = Ni Γi Xi 1 xiNi Ni 1 Ni 1 Ni
T
(16)
where Γk is the 1xk ( k = Ni ) vector of ones (transpose of Γ0 in [6]) for which the number of ones is equal to k and Xi are the matrices defined above. With these, (12) becomes α1 X1 − α2 X2 =
1 1 Γ1 X1 − Γ2 X2 N1 N2
(17)
Right-hand side of this equation depends only on the training data and the values N1 and N2 , so once the training points are drawn according to a distribution PX , it’s a constant. Another equality that comes from (13) can be written as α1 ΓT1 − α2 ΓT2 = 0
(18)
With (17) and (18) we represented (12) and (14) in a different way. This representation reduces the equations to a set of simultaneous equations. Once the training data is drawn according to some probability distribution, there are only two unknowns: α1 and α2 . The values of the unknowns are found at the end of Lagrangian optimization. Given a training set, if all Lagrange coefficients are equal to each other, SVM normal reduces to the LDA normal. This can be seen by assigning 2l to all the elements of both α1 and α2 , in the case N1 = N2 = 2l . An example distribution where SVM and LDA hyperplanes are coincident is given in Fig. 1. Two different classes are shown with circles and crosses and the line between the training points is the hyperplane computed by the SVM. For more general cases, it suffices to put the resulting α1 and α2 in (17) and (18) to find the relationship between respective normals. In those cases, the following metric can be used to determine how optimal LDA hyperplane is, in terms of margin maximization: β=
wSV M wLDA
(19)
In (19), β is 1, if LDA hyperplane maximizes the margin as much as the SVM hyperplane. In other cases, β is between 0 and 1. We already know that SVM hyperplane gives us the maximum margin and the margin is proportional to the inverse of the magnitude of the normal wS V M . Using (3), (7), symmetricity of kernels and the identity w · w = w2 , we can incorporate kernels into the metric β as shown in (20-23).
i,j yi yj αi αj xi · xj β= (20) (µ1 − µ2 ) · (µ1 − µ2 )
110
I. Gokcen and J. Peng 6
5
4
3
2
1
0
0
1
2
3
4
5
6
Fig. 1. Example distribution
=
=
=
1 N12
1 N12
1 N12
N1
i=1
i,j
x i · xi −
1 N1 N2
N1
i=1
i=1
yi yj αi αj xi · xj
i,j
x i · x j + xj · x i +
yi yj αi αj Kij i,j (Kij + Kji ) +
i,j
Kii −
1 N1 N2
N1
Kii −
i,j yi yj αi αj Kij 2 1 i,j Kij + N22 N1 N2
1 N22
N2
j=1
1 N22
N2
Kjj
j=1
N2
j=1
Kjj
x j · xj
(21)
(22)
(23)
In the equations, points that belong to the first class are indexed by i and the points that belong to the second class are indexed by j. K is the symmetric kernel and Kij := K(xi , xj ). Equation (23) allows us to represent the optimality metric β in a compact way using kernels.
4
Empirical Evaluation
In this section we compare the classification methods using a number of data sets. The methods are: (1) LDA - local linear classifier computed by LDA. For each input query, we use KL nearest neighbors to the query to compute (1) from which the local linear classifier for classification is obtained. (2) SVM-L - local linear classifier computed by SVMs. For each input query, a linear SVM from KL nearest points to the query is built to classify the training points. We use SVMlight [9]. (3) SVM-R - SVM classifier using using radial basis kernels. In all the experiments, the features were first normalized over the training data to have zero mean and unit variance, and the test data features were normalized using the corresponding training mean and variance. Procedural parameters
Comparing Linear Discriminant Analysis and Support Vector Machines
111
for each method were determined empirically through cross-validation. Also, in all the experiments where SVMlight was involved, we did not attempt to estimate optimal values for ε. We instead used its default value (0.001). The values of γ in the radial basis kernel exp(−γx − x ) and C (trade-off between training error and margin) that affect the performance of SVM-R were chosen through cross-validation for each problem. Similarly, optimal procedural parameters for each method were selected through experiments for each problem. In Sect. 4.2, the classification accuracies for different methods are given in a tabular format. 4.1
Data Sets
We use 10 data sets to assess the performance of the methods. First 9 are taken from UCI Repository of Machine Learning Database [12]. Last one is a simulated data set that is created according to a known probability distribution. Having the capability of adjusting different parameters of the data enables us to better compare the methods. Splitting the data into training and testing sets is done based on the number of data points in a given data set. We use the following splitting schemes: – For the data sets that have less than 400 data points (Ionosphere, Hepatitis, Vote, Sonar, Iris, and Liver data), we randomly chose 60% as the training points and 40% as the test points. – For the data sets that have more than 400 data points (Pima, Cancer and Oq data), we randomly chose 200 training points and 200 test points. Training and test sets are mutually exclusive. – For the simulated data, we create 100 training and testing points for each class using different means for each dimension (µd1 and µd2 ). Moreover, we can play with the first two eigenvectors of each distribution (vd1 and vd2 ) to end up with different distribution shapes for different classes. Means and eigenvectors constitute parameters sets Pi = {µd1 , µd2 , vd1 , vd2 } and we used two such sets P1 and P2 to test the methods. P1 denotes a distribution with more overlapping classes than P2 . 4.2
Results
Table 1 shows that local linear SVMs outperform local linear LDA in all standard benchmark data sets except for pima and the simulated data. Some of the data sets are not linearly seperable and that’s the reason behind using soft margin SVMs. The results are significant in terms of both error percentages and error variances over 20 independent runs. SVMs tend to perform worse for linearly nonseperable cases even in a local setting. This can be seen by looking at the results on P1 and P2 . P1 has more overlapping classes and therefore SVMs perform worse. However, they still perform better than LDA, as can be seen in the cases of liver and hepatitis data. Although similar comparisons were made before, to our knowledge, no previous study has focused on computing the decision boundaries locally. For both methods, we use the number of neighbors that gives the best
112
I. Gokcen and J. Peng
classification results among many possible choices to compute the covariance matrices and decision boundaries. Number of neighbors can be chosen to be any number up to the total number of training points in the data set. RB kernel is just used as a reference point and our actual goal is to compare local linear SVMs and local linear LDA. The optimality metric we propose provides a way to compare different LDA solutions with respect to the maximal margin classifier. The greater the value of the metric, the better the LDA solution is. In the formulas we assume that sphering is done in the feature space and since the kernel matrix can be considered as a covariance matrix, sphering can be done without further effort [1]. The unified matrix representation developed in Sect. 3 allows us to see important characteristics of hyperplane coincidence. LDA and SVM solution become coincident when all the points are support vectors, that is the features are not correlated. In these types of problems, LDA solution is good enough. In other cases, goodness of LDA solution can be determined using the optimality metric. Thus, we have provided both an empirical and an analytical comparison of the two methods and demonstrated their relationship in margin maximization sense, using a simple metric. Table 1. Results
5
Data SVM -L µ σ
SVM-R µ σ
LDA µ σ
iris ion hep liver sonar vote cancer oq pima simul. P1 P2
4.88 5.39 15.24 26.97 14.40 3.71 2.97 3.07 23.50
9.39 20.31 17.30 30.28 26.58 8.33 9.09 9.87 24.37
4.75 6.70 14.60 26.56 13.27 3.82 3.07 3.60 24.60
2.61 1.43 4.52 2.77 3.71 1.75 0.78 1.59 2.26
44.07 43.15
2.32 45.92 3.89 45.47
3.11 0.96 3.93 2.46 5.06 1.78 0.62 1.35 2.41
4.77 3.28 4.86 2.87 4.71 2.09 2.10 2.57 1.99
3.45 44.62 2.91 3.55 51.97 8.73
Summary and Conclusion
In this paper we systematically compare classifiers based on LDA and SVMs using a number of data sets. The experimental results show that local linear classifiers based on SVMs outperform local linear classifiers based on LDA. Furthermore, the improvement in performance is statistically significant over most data sets. A comparison of normals computed by two methods in the optimality
Comparing Linear Discriminant Analysis and Support Vector Machines
113
sense is also made by considering the case, for which the normals are coincident. This is done by adopting a unified representation. If all points equally contribute to the decision boundary computation, SVMs reduce to LDA. More general cases are also considered and an optimality metric that uses kernels is proposed. Further research on optimality comparison will be our future work.
References 1. Bach, F. R., Jordan, M. I.: Kernel Independent Component Analysis. Technical Report No. UCB/CSD-01-1166, UC Berkeley (2001) 2. Burges, C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2 (1998) 121–167 3. Cristianini, N., Shewe-Taylor, J.: Support Vector Machines. Cambridge University Press (2000) 4. Duda, R. O., Hart, P. E.: Pattern Classification and Scene Analysis. John Wiley & Sons Inc., (1973) 5. Friedman, J. H.: Flexible Metric Nearest Neighbor Classification. Tech. Report, Dept. of Statistics, Stanford University, (1994) 6. Fukunaga, K.: Statistical Pattern Recognition. Academic Press, (1990) 7. Hastie, T., Tibshirani, R.: Discriminant Adaptive Nearest Neighbor Classification. IEEE Trans. on Pattern Analysis and Machine Intelligence 18:6 (1996) 607–615 8. Herbrich, R., Graepel, T., Campbell, C.: Bayes Point Machines. Journal of Machine Learning Research, 1 (2001) 245–279 9. Joachims, T.: Making large-scale SVM-learning practical. Advances in Kernel Methods – Support Vector Learning, MIT Press, Cambridge, MA (1999) 169–184 10. Mclachlan, G. J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York, (1992) 11. Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Royal Soc. London, A 209 (1909) 415– 446. 12. Merz, C. J., Murphy, P. M., Aha, D. W.: UCI repository of machine learning databases. Department of Information and Computer Science, University of California at Irvine, Irvine, CA, (1997) 13. Scholkopf B., Smola, A. J.: Learning with Kernels. MIT Press, Cambridge, MA. (2002) 14. Vapnik, V. N.: The Nature of Statistical Learning Theory. Springer-Verlag New York Inc. (1995)
Cross-Language Information Retrieval Using Multiple Resources and Combinations for Query Expansion 1
1, 2
Fatiha Sadat , Masatoshi Yoshikawa , and Shunsuke Uemura 1
1
Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). 8916-5 Takayama, Ikoma, Nara 630-0101. Japan 2 Information Technology Center, Nagoya University {fatia-s, yosikawa, uemura}@is.aist-nara.ac.jp
Abstract. As Internet resources become accessible to more and more countries, there is a need to develop efficient methods for information retrieval across languages. In the present paper, we focus on query expansion techniques to improve the effectiveness of an information retrieval. A combination to a dictionary-based translation and statistical-based disambiguation is indispensable to overcome translation’s ambiguity. We propose a model using multiple sources for query reformulation and expansion to select expansion terms and retrieve information needed by a user. Relevance feedback, thesaurus-based expansion, as well as a new feedback strategy, based on the extraction of domain keywords to expand user’s query, are introduced and evaluated. We tested the effectiveness of the proposed combined method, by an application to a French-English Information Retrieval. Experiments using CLEF data collection proved a great effectiveness of the proposed combined query expansion techniques.
1
Introduction
With the explosive growth of international users, distributed information and the availability of linguistic resources for research, accessible through the World Wide Web, an information retrieval became such a crucial task to fulfill user’s needs, find, retrieve and understand relevant information, in whatever language and form. Cross-Language Information Retrieval (CLIR), consists of providing a query in one language and searching document collections in one or multiple languages. Therefore, a translation form is required. In this paper, we focus on query translation using bilingual Machine Readable Dictionaries (MRDs) with a combination to statistics-based disambiguation to avoid polysemy after translation. Automatic query expansion, which has been known to be among the most important methods in overcoming the word mismatch problem in information retrieval, is considered as a major interest. The proposed study is general across languages in information retrieval however; we have conducted experiments and evaluations with an application to French and English languages.
T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 114–122, 2002. © Springer-Verlag Berlin Heidelberg 2002
Cross-Language Information Retrieval
115
The rest of this paper is organized as follows: Section 2 gives an overview for query translation and disambiguation. Query expansion techniques with different combinations are introduced in Section 3. Experiments and evaluations are discussed in Section 4. Section 5 concludes the paper.
2
Query Translation / Disambiguation in CLIR
In our approach, a term-by-term translation using a bilingual MRD, is performed after a simple stemming process of query terms to replace each term with its inflectional root, to remove most plural word forms, to replace each verb with its infinitive form and to remove stop words and stop phrases. The next step is a term-by-term translation using a bilingual machine-readable dictionary. Missing words in the dictionary, which are essential for the correct interpretation of the query, can be solved by an automatic compensation through a synonym dictionary related to that language or by an existing monolingual thesaurus. This case requires an extra step of looking up the query term in the synonym dictionary or thesaurus, when missing words in the bilingual machine-readable dictionary, to find equivalent terms or synonyms of the concerned query term, before a dictionary translation. A disambiguation method [5] to filter, select best translations, among candidates and create a target query to retrieve documents, is described as follows: Suppose, Q a source query with n terms {s1, s 2, …, sn}. 1. Make all possible combinations between terms of one source query: (s1, s 2), (s1, s3), …. (sn, s n-1). 2. Rank all combinations, depending on their highest values of co-occurrence tendencies, which is based on Log-Likelihood Ratio LLR [2], as follows: -2log l = K11 log .
1
&5
+K12 log .
1
&5
+K21 log
.1 & 5
+K22 log .
1
(1)
& 5
Where, N = K11 + K12 + K21 + K22 , K11 = f (w1 ,w2 ), K12 = f (w1) - K11 , K21 = f (w2) - K11 , K22 = N - K12 - K22, C1= K11 + K12 , C2 = K21 + K22 , R1= K11 + K21 , R2 = K12 + K22, f (w1 ,w2 ) is a number of times both w1 and w2 occur in a window size of a corpus, f (w) is the number of times word w occurs in a corpus. 3. Select the combination (si, sj), with the highest co-occurrence tendency and where at least one of the source terms translation was not fixed yet. 4. Retrieve all related translations to this combination, from the bilingual dictionary, 5. Apply a two-terms disambiguation process to all possible translation candidates, 6. Fix the best target translations for this combination and discard the other translation candidates. 7. Go to the next combination with the next highest co-occurrence tendency, and repeat steps 3 to 7 until every source query term’s translation is fixed.
116
F. Sadat, M. Yoshikawa, and S. Uemura
An overview of the proposed information retrieval system is shown in Fig.1. Query expansion is completed through a monolingual thesaurus, a relevance feedback (interactive or automatic) or a domain-based feedback.
Fig. 1. Overview of Cross-Language Information Retrieval system
3
Query Expansion in CLIR
Query expansion has proved its effectiveness to improve the performance of an information retrieval [1], [3]. We use an approach of combined automatic query expansion before and after translation, with an extraction of expansion terms through the following techniques: Relevance feedback, with the selection of best terms, domainbased feedback with the extraction of domain keywords to add to the original query and thesaurus-based expansion with a retrieval of synonyms and multiple words senses from a monolingual thesaurus. 3.1
Pseudo Relevance Feedback
A fixed number of term concepts will be extracted from the top retrieved documents, and their co-occurrence in conjunction with original query terms will be computed. However, any query expansion must be handled very carefully. Selecting any expansion term could be dangerous. Therefore, our selection is based on statistical cooccurrence frequency in conjunction with all terms in the original query, rather than with just one query term.
Cross-Language Information Retrieval
117
Assume that we have a query Q with n terms: {term1 ... termn}, a ranking factor based on the co-occurrence frequency between each term in the query and the expansion term candidate, is evaluated, such as: Rank(expterm) =
Q
∑FR RFFXUUHQFHWHUP H[SWHUP
(2)
M
=
M
where, co-occurrence (termj, expterm) represents the co-occurrence tendency between a query term termj and an expansion term candidate expterm, and can be evaluated by any estimation such as a log-likelihood ratio, etc ... Thus, all co-occurrence values were computed, summed for all query terms and an expansion candidate with the highest rank is selected as an expansion term for the concerned query. 3.2
Domain-Based Feedback
We introduce a domain-based feedback [6] as query reformulation strategy. Domainbased feedback will extract domain keywords from a set of top retrieved documents using standard relevance feedback, in order to expand original query set. Web directories, such as Yahoo!1 or AltaVista2, are human constructed and designed for human web browsing. They provide a hierarchical category scheme and documents are sorted into the given scheme. Our strategy relies on terms extraction using a classical relevance feedback with a condition that these terms represent a directory or category, which is denoted by a keyword describing its content and thus will be considered as a specific domain to collection of documents. The process is described as follows: Extract some terms or seed words, by using relevance feedback as well as a ranking strategy to select the expansion term, as explained in the previous section. This set is denoted by set1. Collect domain keywords candidates, from categories and directories related to some hierarchical web directories, such as: Yahoo! 1 or AltaVista2 or Open Directory3, which is denoted by set . 2
Select a fixed number of domain keywords as seed words from set1 but also a candidate of set2. In case of large number of terms in the concerned intersection, a statistical process will be applied to rank resulting domain keywords and select best ones. The resulting set of terms will be used for a domain-based feedback, which may involve many expansion terms or just a subset of them. 3.3
Thesaurus-Based Expansion
This approach is based on expanding a query with a fixed number of relevant terms from a structure derived from a lexical database, WordNet [9] for English queries and 1
http://www.yahoo.com/ http://www.altavista.com/ 3 http://dmoz.org/ 2
118
F. Sadat, M. Yoshikawa, and S. Uemura
EuroWordNet [10], [11] for French queries could be seen as powerful tools to study lexical semantic resources and their language-specificity [8], [11]. Our suggestion is that words with semantic relations with a query term can be used, as an expansion term candidates to the original query. According to the research reported by Voorhees [9] on the use of lexical relations of WordNet for a query expansion, we can proceed by a simple look up to find synsets of a full query, otherwise, we proceed by a term-by-term search in case of non-existence of a full query in the lexical database. Also, a simple one-term query can be represented by a compound synonym. In this case, we make a conjunction between simple terms of the concerned synonym. An example is a simple term "war" which will be expanded by a compound synonym "military action" and replaced by "war military action". Moreover, a statistical frequency might be used for ranking and selection to avoid words that do not occur frequently with original terms, such as "reckoner" which will be removed from the synset list of "computer". An appropriate weighting scheme will allow a smooth integration of these related terms by reducing their influence over the query [4]. Thus, all terms recovered from the thesaurus will be given weights, expressing their similarity to original query terms, based on their position in the conceptual hierarchy (depth = 1) as well as number of terms accompanying them in the same synset. Some strategies were proposed [4], [9] for sense disambiguation and weight assignment to synonyms and other terms in a thesaurus. In this study, weights assigned to any synonym of one synset would be related to an envelope of 0.5 dived by number of terms in the corresponding synset, which is proportional in the same synset. Following these assumptions, the expanded query with synonyms, would contain: {computer, data processor, electronic computer, information processing system, calculator, figurer, estimator}. The proposed weighting factor for retrieved expansion terms from synsets of a conceptual hierarchy related to the WordNet thesaurus, is described as follows: Weight(term,expj) =
6LPWHUP H[S , ×0 M
(3)
Where, M is the number of terms that belong to the same synset. Sim (term, expj) is the similarity between the term and an expansion candidate expj and could be estimated by any similarity measure, such as the Cosine measure [7] as follows:
Sim ( term , exp
j
)=
∑v
si
(4)
v ti
i
∑ v ∑ v si
i
ti
i
where, vsi and vti are frequencies of term and expj in a corpus, respectively. However, expanding a query with any of those weighted synonyms implies a careful selection and ranking, depending on statistically most weighted terms, in conjunction with all query terms, not just one term query. For a query Q with k terms {term1, term2, ..., termk}, weights factors would be computed for an expansion term candidate,
Cross-Language Information Retrieval
119
summed for all query terms (1 i k) if the expansion term appears in the related hierarchy and the highest weighted term is selected for a query expansion, as follows: k
Weigth(query,expterm) =
:HLJKW WHUPLH[SWHUP ∑ M
(5)
=
3.4
Combining Different Approaches
Following the research reported by Ballesteros and Croft [1] on the use of a local feedback, adding terms that emphasize query concepts in the post and pre-translation phases, improves precision and recall. This combined method is supposed to reduce the ambiguity by de-emphasizing irrelevant terms added by translation and will improve precision and recall of an information retrieval. The new query Qnew can be defined as follows: Qnew = Qorig + a1
∑ Ti + a2 ∑ T bef
Where, Qorig is an original query, ∑ Ti and bef
j
(6)
aft
∑ Tj
represent an added set of terms
aft
before and after translation/disambiguation, consecutively. The two parameters a1 and a2, which represent the importance of each expansion strategy, are given by human experimentally at this moment, but could be estimated using an ExpectationMaximization algorithm.
4
Experiments and Evaluation
Conducted experiments and evaluations of the effectiveness of proposed strategies for query translation disambiguation and expansion, were completed with an application to a French-English information retrieval, i.e. French queries to retrieve English documents. Linguistics resources used in this research are described as follows: Test Data Tipster volume 1 for TREC4 test collection was used for cross-language evaluation. Topics composed of fields for title and for description, were considered in the conducted experiments. Key terms contained in these fields, which are averaged 5.7 terms by query are used to generate French and English source queries. Monolingual Corpus The Canadian Hansard corpus (Parliament Debates) is a bilingual French-English parallel corpus. It contains more than 100 million words of English text and the corresponding French translations. In this study, we have used Hansard as a monolingual corpus for French and English languages. 4
http://trec.nist.gov/data.html
120
F. Sadat, M. Yoshikawa, and S. Uemura
Bilingual Dictionary COLLINS5 Series 100 French-English dictionary was used to translate source queries. The bilingual dictionary includes 75.000 references and 110.000 translations, which seems to be plenty for research. Thesauri WordNet [9] and EuroWordNet [11] are used for thesaurus-based expansion and possible compensation, in case of limitation in the bilingual dictionary. Stemmer and Stop Words Stemming part was performed by the Porter6 Stemmer. Retrieval System SMART7, an information retrieval system based on vector space model that has been used in many researches for CLIR would retrieve English documents. 4.1
Experiments and Results
Retrieval with original English queries was represented by Mono_Eng. No_DIS is the result of translation without disambiguation, which means selecting the first translation as target, for each source query term. N_DIS refers to the translation and disambiguation method (trans_disambiguation). Expansion methods were represented by: Feed.bef / Feed.aft, for a relevance feedback before / after the trans_disambiguation, respectively. Feed.bef_aft refers to a combined relevance feedback before and after the trans_disambiguation. Domain-based feedback was evaluated with Feed.dom, after the trans_disambiguation. A combined method with a relevance feedback was tested with Feed.bef_dom. WordNet-based expansion was evaluated using synsets of target translations with Feed_wn, as well EuroWordNet-based expansion on synsets of source queries with Feed.ewn. Combined thesauri-based expansion and relevance feedback, is represented by Feed.bef_wn and Feed.ewn_aft, with domain-based feedback, by Feed.dom_wn and Feed. ewn_dom. Combined thesauri-based expansion was tested with Feed.ewn_wn. Finally, Feed.ewn_wn_dom represents combined thesauri with domain-based expansion. Results and performances of these methods are described in Table 1, with average precision and difference comparing to the monolingual counterpart. 4.2
Discussion
The disambiguation method N.DIS showed a better improvement in terms of average precision, 101.94% of monolingual retrieval comparied to No_DIS or to monolingual retrieval. Feed.aft showed a good help to the average precision, with a 101.33 % of the monolingual counterpart. Combined relevance feedback technique before and after trans_disambiguation Feed.bef_aft showed a better result, with a 102.89% of the
5
Collins Series 100 Bilingual Dictionary http://www.tartarus.org/~martin/PorterStemmer/ 7 ftp://ftp.cs.cornell.edu/pub/smart 6
Cross-Language Information Retrieval
121
monolingual counterpart. This suggests that a combined query expansion before and after translation and disambiguation, improves the effectiveness of an information retrieval. Domain-based feedback showed a drop in term of average precision comparing to previous methods. However, combined with a relevance feedback before and after trans_disambiguation, a greater result with a 103.69% in terms of average precision was deducted. Table 1. Best results and evaluations for different combinations of query translation/ disambiguation and expansion: Relevance feedback, domain-based feedback and thesaurus-based expansion Method Avg. Prec % Mono
Feed_wn 0.2518 95.81
Mono_Eng No_DIS 0.2628 0.2214 100 84.24
Feed.ewn 0.2579 98.13
N_DIS 0.2679 101.94
Feed.aft 0.2663 101.33
Feed.bef_aft 0.2704 102.89
Feed.bef_dom 0.2725 103.69
Feed.bef_ Feed.ewn_a Feed.dom_ Feed.ewn Feed.ewn_ Feed.ewn_ wn ft wn _dom wn wn_dom 0.2571 97.83
0.2588 98.47
0.2540 96.65
0.2545 96.84
0.2608 99.23
0.2741 104.29
Thesaurus-based expansion with WordNet Feed_wn or EuroWordNet Feed.ewn as well as combination to relevance feedback Feed.bef_wn, Feed.ewn_aft or domainbased feedback Feed.dom_wn, Feed.ewn_dom showed drops in average precision. In the other side, a combined thesauri-based expansion Feed.ewn_wn showed a better result but again a drop in average precision. The best result was achieved by the combined thesauri-based expansion and domain-based feedback Feed.ewn_wn_dom with a 104.29% of the monolingual counterpart, in term of average precision. This suggests that adding domain keywords to generalized thesauri improves the effectiveness of retrieval. Thus, key techniques used in this successful method can be summarized as follows: A statistical disambiguation method based on co-occurrence tendency is crucial to avoid wrong sense disambiguation and select best target translations, Adding domain keywords to the original query and then selecting thesaurus word senses, to avoid wrong sense disambiguation, is considered as an effective approach for the retrieval of any information, Each type of query expansion has different characteristics and therefore their combinations could provide a valuable resource for query expansion and showed the greatest improvement in term of average precision,
5
Conclusions and Future Works
Linguistics resources are readily available to achieve an efficient and effective query translation method in Cross-Language Information Retrieval. What we proposed and evaluated in this paper could be summarized as follows: first, a query disambiguation,
122
F. Sadat, M. Yoshikawa, and S. Uemura
which is considered as a valuable resource for query translation. Second, combined query expansion techniques before and after the translation and disambiguation methods, through a relevance feedback or domain-based feedback, has showed its effectiveness compared to the monolingual retrieval, the simple word-by-word dictionary translation or the translation and disambiguation. Third, thesauri-based with synonyms and domainbased feedbacks showed the greatest improvement to an information retrieval. Our ongoing work involves a deeper investigation on different relations of WordNet and EuroWordNet thesauri, beside synonymy, and use multiple word senses for query expansion. An approach of learning from documents categorization or classification, not necessarily web documents, to extract relevant keywords for a query expansion, is among our future researches. Finally, our main interest is to find more effective solutions to fulfill needs of an information retrieval across languages.
References 1.
Ballesteros, L. and Croft, W. B.: Phrasal Translation and Query Expansion Techniques for th Cross-Language Information Retrieval. In Proceedings of the 20 ACM SIGIR Conference (1997) 84–91. 2. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, vol.19. No.1 (1993) 61–74. 3. Loupy, C., Bellot, P., El-Beze, M. and Marteau, P.-F.: Query Expansion and Classification of Retrieved Documents. In Proceedings of TREC-7. NIST Special Publication (1998). 4. Richardson, R., Smeaton, A.F.: Using WordNet in Knowledge-based Approach to Information Retrieval. In Proceedings BCS-IRSG Colloquium, CREWE (1995). 5. Sadat, F., Maeda, A., Yoshikawa, M. and Uemura, S.: Statistical Query Disambiguation, Translation and Expansion in Cross-Language Information Retrieval. In Proceedings of the LREC 2002 Workshop on Using Semantics for Information Retrieval and Filtering: State of the Art and Future Research, Las Palmas, Spain, May-June (2002). 6. Sadat, F., Maeda, A., Yoshikawa, M. and Uemura, S.: Query Expansion Techniques for the CLEF Bilingual Track. In Proceedings of the CLEF 2001 Cross-Language System Evaluation Campaign (2001) 99–104. 7. Salton,G and McGill, M.: Introduction to Modern Information Retrieval. New York: McGraw-Hill (1983). 8. Yamabana, K., Muraki, K., Doi, S. and Kamei, S.: A Language Conversion Front-End for Cross-Linguistic Information Retrieval. In Proceedings of SIGIR Workshop on CrossLinguistic Information Retrieval, Zurich, Switzerland (1996). 9. Voorhees, M. E.: Query Expansion using Lexical-Semantic Relations. In Proceedings of th the 17 ACM SIGIR Conference (1994) 61–69. 10. Vossen, P.: EuroWordNet, A Multilingual Database for Information Retrieval. In Proceedings of the DELOS Workshop on Cross-language Information Retrieval, Zurich (1997). 11. Vossen, P.: EuroWordNet, A Multilingual Database with Lexical Semantic Networks. The Kluwer Academic Publishers (1998).
Extracting Shape Features in JPEG-2000 Compressed Images Jianmin Jiang1, Baofeng Guo2, and Pengjie Li3 1
Department of EIMC, School of Informatics, University of Bradford, Bradford, United Kingdom
[email protected] 2 Department of Computer Science, University of Bristol, Bristol, United Kingdom
[email protected] 3 Signal Processing Lab, Institute of Acoustics, Chinese Academy of Sciences, China
[email protected]
Abstract. As the latest effort by JPEG in international standardization of still image compression, JPEG-2000 contains a range of important functionalities superior to its earlier DCT based versions. In the expectation that the compression standard will become an important digital format for many images and photographs, we present our recent work in this paper on image indexing and retrieval directly in wavelets domain, which is suitable for JPEG-2000 compressed image retrieval without involving its full decompression. Our methods mainly extract shape and texture features from those significant wavelet coefficients and transform their energy into histogram-based indexing keys for compressed image retrieval. While our method gains the advantage that decompression can be eliminated, the experiments also support that the retrieving accuracy is better than the existing counterparts.
1
Introduction
Since the launch of the DCT-based JPEG image compression standard in early nineties, it has proved a great success in providing tools for digital imaging, digital photography, multi-media, and computer vision. As the digital technology advances, the modern digital imagery is becoming more and more demanding, not only from the quality point of view, but also from the image size aspect, which makes the DCTbased JPEG out of date. As a result, a new standard developed from wavelets transform emerges in 2000, which provides a range of features that are of importance to many high-end and emerging applications by taking advantage of new technologies developed over the past decade. Compared with earlier versions of JPEG standards, JPEG-2000 can be highlighted to have: (a) superior performance in low bit-rate compression; (b) continuous-tone and bit-level (black and white) compression; (c) integration of lossless and lossy compression; (d) progressive transmission by both pixel T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 123–132, 2002. © Springer-Verlag Berlin Heidelberg 2002
124
J. Jiang, B. Guo, and P. Li
precision and resolution; (e) region-of-interest coding; (f) open architecture; (g) robustness to bit errors and (h) potential and possibilities for protective image security via digital watermarking, encryption and signature etc. Its main operational structure follows the same principle as that of DCT-based JPEG, which contains digital transform, quantization and entropy coding. However, in order to achieve all those superior performances outlined above, each part of the operation needs completely different methodologies and approaches. These can be briefly summarized as follows:
To achieve region-of-interest compression, the input image is divided into components (such as those color components, R, G, B), and each component is divided into tiles. While the size of those tiles remains variable for flexibility, all the tiles must stick to the same size, and each tile is independently compressed via wavelets transform, quantization of coefficients and entropy coding; To improve compression efficiency, a pre-processing of pixel values is introduced, which include: (a) DC level shifting by subtracting a fixed constant value from all pixels; (b) component transformation via two types of transformation matrices, one is for lossy compression and the other is for lossless compression. To enable the functionalities of progressive coding, scalability, high quality in low bit rate, digital wavelet transform is adopted to transform the spatial domain into subbands; To remove the redundancy and achieve high ratio of compression, a dynamic range of quantization steps is used for the quantization of those wavelet coefficients. These include both uniform scalar quantization and trellis coded quantization[3,4]; Entropy coding is implemented based on arithmetic coding. While arithmetic coder itself does not have any space for research, how to provide the best possible context to drive the coder remains an important issue. In JPEG-2000, each subband is divided into non-overlapping rectangles and each rectangle is divided into non-overlapping code blocks (no less than 32x32). In order to accommodate the need for random content access and error resilient coding, each code block is independently encoded without reference to any other blocks or any other subband. Considering the compatibility with earlier standard, such as JBIG, each code block is encoded a bit plane at a time starting from the most significant bit plane (non-zero bit) to the least significant bit plane. Similar to the zero-tree technique[5], each bit is encoded by one of the three passes, significance propagation, the magnitude refinement, and the cleanup pass, inside each of which, a context is produced to drive the arithmetic coder.
Together with other digital compression standards, JPEG-2000 is no doubt expected to be one of the popular compressed image formats in the coming years. While those compression techniques enable those image data to be manageable, content access to large number of such compressed images become a new challenge to the community
Extracting Shape Features in JPEG-2000 Compressed Images
125
of information systems, multimedia and computer science. To this end, research on content-based image indexing and retrieval started to prosper, especially those indexing techniques in compressed domain [1–2, 6]. To extract content information or indexing keys which are capable of characterizing the content of those compressed images via JPEG-2000, the first step is to perform entropy decoding to get us into those significant coefficients inside each subband. Since the entropy decoding designed in JPEG-2000 works in bit-planes, it is perfect for extracting content information in a progressive manner and the extraction can be stopped at any point without touching the remaining bit stream of the compressed image. The remaining part of the paper is organized as follows. Section 2 describes shape extraction from wavelet domain, Section 3 describes the texture extraction and Section 4 reports our experimental results, and finally Section 5 makes concluding remarks.
2
Shape Extraction in Wavelets Domain
Since the core processing component in JPEG-2000 is wavelets transform, our proposed shape extraction can be focused on processing of those decoded wavelets coefficients along the line of decompression. As each bit-plane entropy decoding scan is performed, isolated significant wavelet coefficients start to emerge. By recording the address of those emerging significant coefficients, a so-called significance map can be constructed [6], which is essentially a bit-level image block with 1 representing a significant coefficient and 0 a non-significant coefficient. Inspired by those shape description in pixel domain, moments can be calculated inside the significance map to characterize the shape information. Specifically, given a significance map f (i , j ) , its moments with various orders can be defined as:
P
= ∑∑ L M I L M S
ST
L
T
S T = L
(1)
M
and the centre of gravity of the shaped region can be given as: P P [ = \ = P P
(2)
Further, the definition of central moments are:
µ
ST
( ) ( M − \)
= ∑∑ L − [ L
S
T
I L M
S T = L
(3)
M
To characterize the shape information, 9 moments can be calculated, which include [ \ µ µ µ µ µ and µ µ . As each level of wavelet decomposition contains three subbands, a total number of 81 moments would have to be calculated
126
J. Jiang, B. Guo, and P. Li
[6]. As the three subbands, LH, HL and HH, essentially represent horizontal, vertical and diagonal decomposition, however, it remains possible that all the directional wavelet coefficients can be combined into a single subband by using wavelet-based edge detection techniques, in which, the wavelet-based edge is obtained from the module of horizontal coefficients and vertical coefficients [9]. Correspondingly, the edge points can be obtained by searching for the local maxima in these wavelet coefficients. As a result, the original 9 subbands except the LL subband are now combined into 3 subbands. As a result, a total of 27 moments are now needed for the extraction of shapes, which represents a histogram with 27 elements or float numbers. To accommodate the need for region-of-interest coding, JPEG-2000 applied a tiling processing elements to divide the image into rectangle tiles with the same size. In order to connect those coefficients from different tiles into meaningful regions inside the original image, we introduce several simple but effective morphological operators[9] to those moments constructed for shape descriptors [6]. In order to generate a continued and closing object contour, closing operators [9] are used on the edgeoriented map extracted from the wavelet coefficients. The closing operator is simply performed by conducting dilate and erode operations [9], which is proved to be effective and efficient for simple objects. As its computation is fast, only several simple steps are required. After the closing operation, the original isolated edge points are linked together to form contour curves of objects. To finalize the object region, a flood-fill algorithm [10] is further adopted in our scheme based on those closing edges. Flood-fill is a computer graphics algorithm for binary maps and widely used for area filling, which can be briefly outlined by the following pseudo codes in C: void floodFill4 (int x, int y, int fill, int background) { if (getPixel(x,y) == background) { setObject(fill); setPixel(x,y); floodFill4 (x+1, y, fill, background); floodFill4 (x-1, y, fill, background); floodFill4(x, y+1, fill, background); floodFill4(x, y-1, fill, background); } }. As a result, the task of extending isolated significant points to meaningful regions is completed. The specific process of all the operations explained above is illustrated in Figure 1, in which the improvement via the morphological operations is also shown. The original image is given in part (a), and the edge map extracted from wavelet coefficients is shown in part (b). After the closing and filling operations, the resulting images are shown in part (c) and part (d) respectively.
Extracting Shape Features in JPEG-2000 Compressed Images
(a)
(b)
(c)
127
(d)
Fig. 1. Significant points and mathematical morphological operations
3
Experimental Results and Assessment
To evaluate the performance of our proposed shape extraction scheme in contentbased image retrieval, we designed a series of experiments in comparison with the existing techniques implemented as a benchmark. In order to ensure a fair comparison, the nearest benchmark we can find is the wavelets-based shape extraction given in reference [6], which is also the inspiration of our work. In the waveguide technique[6], Liang and Kuo proposed a number indexing key construction algorithms, including texture descriptor, color descriptor and shape descriptor, based on wavelets transform coefficients. Inspired by the existing shape feature extraction in pixel domain, where binary shape image needs to be generated to do shape analysis via: (a) moment-based approaches; (b) parametric curve distance measure, and (c) the turning angle matching etc. Since the embedded zero-tree compression technology[5] produced significance map to exploit the correlation between subbands, and the significance map is essentially a binary record of the address of those significant coefficients, moments-based approach becomes the natural choice to describe the shape feature via such binary shape image (significance map). Based on the definitions given in (1) and (2), the waveguide adopted nine moments, two means, three variances and four skewnesses, to describe the shape feature at each subband. As the wavelet transform is limited to three scales, there exist altogether 81 elements to construct an indexing key, and all the elements are essentially float numbers. Com-
128
J. Jiang, B. Guo, and P. Li
pared with our proposed algorithm, which only has 27 elements for each indexing key, a saving of more than 60% storage has been achieved.
Fig. 2. The 20 objects contained in COIL-20 database with complex geometric, appearance and reflectance
For the test database, we choose the Columbia Object Image Library (COIL20)[11] to assess the accuracy of retrieval for both our proposed scheme and the benchmark scheme (ftp site: zen.cs.columbia.edu). The images contained in this database have a wide variety of complex geometric and reflectance characteristics[11], which provides a useful test bed for evaluation of various shape descriptors for image retrieval purposes. Figure 2 illustrates all the typical objects inside the database. In addition, there are a large number of samples with different poses corresponding to every object. For example, the toy duck can be changed to produce different samples with various poses. This is shown in Figure 3. As a result, COIL-20 database contains 1440 grey scale images corresponding to 20 objects (72 image per object). Although it looks smaller than generally expected, the robustness of evaluation is reliable and sustainable compared with those large databases, say, of more than 10000 images with different themes. This is because the COIL-20 essentially contained objects with a certain level of similarity. Even when mixed with other images with general themes, the difficulty in retrieving accurate images are still constrained within those images inside the COIL-20. As a matter of fact, we tried the toy duck object mixed with other 10000 JPEG compressed images, and the retrieval results are exactly the same. Figure 4 illustrate a few samples of those images mixed with COIL-20 to enlarge the test database.
Extracting Shape Features in JPEG-2000 Compressed Images
129
Fig. 3. Ten samples for toy duck object in COIL-20 database
Fig. 4. Samples of general images with various themes
In our experiments, the database is firstly compressed using a wavelet-based JPEG2000 [12], in which three level pyramid decompositions are carried out and the tile size is set to be the same size as that of the input image. As a result, the whole wavelet coefficients consist of ten subbands. Moment-based shape feature vectors are computed via both the method described in WaveGuide [6] and the proposed method. We compare the performances of the two algorithms in retrieval precision, which is derived by computing an average retrieval precision from all individual testing. This is
130
J. Jiang, B. Guo, and P. Li
designed to reflect the overall retrieval performance of these two algorithms. The retrieval precision is defined as follows:
S= where
QXPF QXPF = QXPF + QXP I 0
(4)
numc , num f , and M are the numbers of correct retrieval, false retrieval,
and the first retrieval candidates with the smallest matching distance. Based on the above design, all retrieval results are summarised in Table 1 and Figure 5.
Table 1. Overall retrieval precision comparison (first matching number = 5)
Algorithms Waveguide[1] The proposed
Precision (percentage) P = 27.40% P = 41.44%
From the results given in Table 1, it can be seen that the overall retrieval accuracy of the proposed scheme is significantly higher than that of WaveGuide[6]. As analyzed earlier, this proves that the strategy of extending isolated significant points to meaningful regions indeed improves the shape characterization capability based on the moment invariant. It should also be noted that the comparative results are achieved by the proposed algorithm with only one third of the memory required by WaveGuide [6]. In other words, the indexing keys for the proposed algorithm contains 27 elements instead of 81.
4
Conclusions
In this paper, we described a scheme to extract shape features directly JPEG-2000 compressed domain for content-based image retrieval. While there exist many unsolved issues in content-based image retrieval, our approach proves to be efficient and effective in comparison with the existing efforts in this area. Since our proposed scheme is focused only on shape feature extraction, the test bed we used is carefully chosen for evaluation purposes for shape-based retrieval, which is adopted by many other researchers reported in the published literature. As JPEG-2000 provides a new framework and an integrated toolbox for a range of useful functionalities, its application and popularity is well expected in the next few years. Therefore, the work we described here should serve as a good starting point for further research, exploration and development towards reliable, robust and efficient image indexing and retrieval.
Extracting Shape Features in JPEG-2000 Compressed Images
131
Fig. 5. Experimental results summary
Finally, the authors wish to acknowledge the financial support from The Council for Museums, Archives & Libraries in the UK under the research grant LIC/RE/082 and the Chinese Natural Science Foundation.
References 1. 2.
3.
4.
Liu C. and Mandal M. ‘Image indexing in the JPEG-2000 framework’ Proceedings of SPIE: Internet Multimedia Management Systems, Vol 4210, Nov 5-8, 2000, pp 272–280; Bhalod J., Fahmy G.F. and Panchanathan S. ‘Region based indexing in the JPEG-2000 framework’ Proceedings of SPIE: Internet Multimedia Management Systems II, Vol 4519, 2001, pp 91–96; ‘Progressive lossy to lossless core experiment with a region of interest: results with the S, S+P, two-ten integer wavelets and with the difference coding method, ISO/IEC JTC1/SC29/WG1 N741, March 1998; Core experiment on improving the performance of the DCT: results with the visual quantization method, deblocking filter and pre/post processing, ISO/IEC JTC1/SC29/WG1 N742, March, 1998;
132 5.
J. Jiang, B. Guo, and P. Li
J. M. Shapiro, “Embedded image coding using zerotrees of wavelet coefficents,” IEEE transaction on Signal Processing, Vol. 41, pp. 3445–3462, Dec. 1993. 6. K. C. Liang and C.C Kuo ‘Waveguide: a joint wavelet-based image representation and description system’ IEEE Trans. Image Processing, Vol 8, No 11, November, 1999, pp 1619–1629. 7. S.D. Servetto, and et al. ‘Image coding based on a morphological representation of wavelet data’ IEEE Trans. Image processing, Vol 8, No 9, 1999, pp 1161–1174. 8. Stephane Mallat “Wavelet for a Vision”, Proceedings of the IEEE, Vol. 84, No. 4, April 1996, pp. 604–614. 9. J. Serra. Image Analysis and Mathematical Morphology. Academic Press, 1982. 10. Hearn and Baker, Computer Graphics – C Version, 2nd ed. (1997), ISBN: 0-13-530924-7; 11. Sammeer A. Nene, Shree K. Nayar, and Hiroshi Murase, “Columbia Object Image Library (COIL-20),” Technical Report No. CUCS-006-96, Department of Computer Science, Columbia University; 12. Skodras A. et. Al ‘the JPEG-2000 still image compression standard’, IEEE Signal processing Magazine, September 2001, pp 36–58.
Comparison of Normalization Techniques for Metasearch Hayri Sever1 and Mehmet R. Tolun2 1
Computer Science Department University of Massachusetts Amherst MA, 01003, USA 2 Department of Computer Engineering Eastern Mediterranean University Gazimagusa, TRNC, via Mersin 10, Turkey
Abstract. It is well-known fact that the combination of the retrieval outputs of different search systems in response to a query, known as metasearch, improves performance on average, provided that these combined systems (1) have compatible outputs, (2) produce accurate probability of relevance estimates of documents, and (3) be independent of each other. The objective of a normalization technique is to target the first requirement, i.e., document scores of different retrieval outputs are brought into a common scale so that document scores can be comparable across combined retrieval outputs. This has been a recent subject of researches in metasearch and information filtering fields. In this paper, we present a different perspective on multiple evidence combination and investigate various normalization techniques, mostly ad-hoc in nature, with a special focus on the SUM, which shifts minimum scores to zero and then scales their summation to one. This formal approach is equivalent to normalize the distribution of scores of all documents in a retrieval output by dividing them by their sample mean. We have made extensive experiments using ad hoc tracks of third and fifth TREC collections and CLEF’00 database. We argue that (1) the normalization method SUM is consistently better than the other traditionally proposed ones when combining outputs of search systems operating on a single database; (2) the SUM for combination of outputs of search systems operating on mutually exclusive databases is still valuable alternative to the one weighting score distributions of documents by their databases’ size.
1
Introduction
Combination of multiple evidence of document relevance for effective retrieval has been subject of many past and recent researches [6,3,7]. The research in
This material is based on work supported in general by the Center for Intelligent Information Retrieval. Any opinions, findings and conclusions or recommendations expressed in this material are the author(s) and do not necessarily reflect those of the sponsor(s).
T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 133–143, 2002. c Springer-Verlag Berlin Heidelberg 2002
134
H. Sever and M.R. Tolun
this field has targeted single framework (also known as internal combination) or multiple framework (also known as external combination or metasearch). In a single framework combination, various types of evidence available to a retrieval system are incorporated. Source of evidence can be either (both) multiple representations of a document [11,6,17] or (and) an information need description [2,3,13]. A document can be, for example, indexed automatically or manually. These two types of representations do not have to be alternative representations, but complementary of one to another as proposed in [17]. In the past, many studies on multiple document representations based on some combination of different structures of a document (e.g., indexing title, abstract, body, and citations, use of controlled vocabulary, etc) have focused on their relative retrieval effectiveness, though performance differences between individual representations were not remarkable. Katzer et al [11] investigated document representations from the point of their pairwise overlaps and founded that their overall overlap degrees on the average among retrieved documents were quite low and higher among relevant and retrieved ones. As we discuss later this early findings really indicate that different representations target different portions of database (or document collection). This leads to the case where multiple representation evidence in regard to a query expression may be combined linearly1 [6]. In continuation with single framework approach to gathering multiple evidence on relativeness of a document, the result of a relevance feedback process can be regarded as a different representation of an information description (or a query topic), though relevance feedback aims to get an optimal query [18]. In fact, Lee, in his study on query combination, used two vector modification methods based upon standard Rocchio and Ide, and three probabilistic feedback formulas, such as conventional probabilistic search weights and its two adjusted derivations [14]. In Belkin et al [2,3], authors used INQUERY retrieval system2 over TREC-1 and TREC-3 data sets to make unweighted and weighted combination of query expressions formulated by experienced searchers with respect to a query topic. In the same [3], but separate study, different query expressions using natural language vector query and P-Norm clauses [19] with different weights were created as evidence of relevance. All of cited studies on query combination consistently stated that more evidence improves retrieval effectiveness on average. Outputs of a number of single frameworks with respect to the same information description may also be combined in order to provide greater retrieval effectiveness than any single framework involved in the combination, which is widely known as metasearch . This type of combination naturally becomes a multiple 1
2
Note that a hypertext page may convey more structural information such as captions of links and metatags for evidence of relevance or the type of relationship between pages [5]. INQUERY was developed at the University of Massachusetts [4]. It is a Bayesian probabilistic inference system, based upon the idea of combining multiple sources of evidence in order to infer plausibly the relevance of a document to a query. For further information, please, refer to the publication section of http://ciir.cs.umass.edu
Comparison of Normalization Techniques for Metasearch
135
framework operating on single, overlapping, or distributed database(s)3 . The driving force behind this direction can be attributed to a widely held belief that the odds of a document being relevant to a given information need is increased in proportion to the number of times retrieved by different search agents [20,2]. This is due to the observations by several separate studies that different retrieval systems (or search engines) often retrieve different documents while they achieve comparable performance scores4 .Especially, with respect to a single description of an information need, retrieval systems retrieve similar relevant documents but different nonrelevant documents [14,24,6]. Even if this validates the process of combination of search results, there is yet to be a clear description of how external combination of evidence should be handled for optimum effectiveness. There is, however, some evidence that simple combination strategies such as summing or (weighted) averaging might be adequate for information retrieval [22]. Croft in his survey provided that the combined retrieval systems should (1) have compatible outputs, (2) produce accurate probability of relevance estimates of documents, and (3) be independent of each other [6]. The objective of a normalization technique is to target the first requirement, i.e., document scores of different retrieval outputs are brought into a common scale so that they become comparable across combined retrieval outputs. The second requirement means that each system should produce probabilities of relevance with low error. The last requirement can be attributed to the case where combined systems should use different document and information need representations and retrieval algorithms (or even in general different retrieval processes). Metasearch problem can be decomposed into a score normalization step followed by a combination step. While the combination step has been studied by a number of researchers, there has been little work on the normalization techniques to be used. The normalization step can in fact be more critical for performance in many situations [16]. Existing normalization schemes [3,14,16] have been proposed on heuristic grounds and their performance seems to depend somewhat on the dataset used [16]. Thus, the appropriate choice of normalization is not clear. Recently, Manmatha and Sever have argued that the formal way to normalize scores for metasearch is by taking the distributions of the scores into account [15].
2
Method of Experiment
This section is organized around the definition of the problem, description of the data, and discussion of model in parallel with prior work. 3 4
To be more specific, a multiple framework operating on a single and mutually exclusive distributed database(s) are called data fusion and collection fusion, respectively. This view is for search systems operating on the same database. It may be, however, worth noting that the rationality behind metasearch engines on Internet has become somewhat different. It is based on the observation that there exists little overlap of web coverage between search engines [12].
136
2.1
H. Sever and M.R. Tolun
Problem
In contrary to the score distribution model by Swet [21], mix of two Gaussian distributions5 for relevant and non relevant documents, researchers have shown [1,24] that the score distributions for a given query may be modeled using an exponential distribution for the set of non-relevant documents and a normal distribution for the set of relevant documents. The non-relevant distributions can be normalized by mapping the minimum score to zero and equalizing the means of the different exponential distributions. Since relevance information is not available in practice, the non-relevant distribution may be approximated by fitting an exponential distribution to the scores of all documents (rather than fitting only the non-relevant documents). This approximation is usually reasonable because the proportion of relevant documents in search engines is usually small. In fact, averaging document distributions, i.e., equalization by mean, is equivalent to the normalization technique SUM, proposed by [16]. The SUM normalizes scores of documents in given a retrieval output by shifting minimum to zero and then scaling sum of scores to one (or unity). In this paper, we show that the normalization method SUM (1) is consistently better than the other traditionally proposed ones when combining outputs of search systems operating on a single database (or data fusion); (2) is a contender for the external combination of search systems’ outputs operating on mutually exclusive databases (or collection fusion). 2.2
Data
TREC6 (Text REtrieval Conference) is an annual conference sponsored by the National Institute of Standards and Technology of USA, in which participants are given very large text corpora and submit the results of their retrieval techniques in a sort of contest. Specifically, in the adhoc track, each participant submits the top 1000 documents returned by their system in response to 50 topics supplied by NIST, and each participant can submit up to 4 runs. The objective of TREC is to provide a very large test collection for several information retrieval tracks/tasks and to enforce the use of the same evaluation technique(s) for performance measurements. TREC Corpora has been widely used in experimental setup for metasearch (including normalization and combination) research targeting single database. We also used TREC-3(topics with no of 151-200, 40 runs) and -5(topics with no of 251-300, 61 runs) in our experiment configuration. The objective of the Cross-Language Evaluation Forum (CLEF) is, as a counter-part of TREC in Europe, to develop and maintain an infrastructure for the testing and evaluation of information retrieval systems operating on European languages, in both monolingual and cross-language contexts, 5
6
This type of model is still widely held by topic detection and tracking (TDT) community [8], see http://www.nist.gov/speech/tests/tdt/index.htm for publications and tutorials on TDT. We refer the reader to http://trec.nist.gov for further information on TREC.
Comparison of Normalization Techniques for Metasearch
137
and to create test suites of reusable corpora that can be employed by system developers for benchmarking purposes (visit http://clef.iei.pi.cnr.it/ for the CLEF Home Page). In our experiments, we compared our results in collection fusion with that of Twenty-One (TNOUT) retrieval system [10], for which visit http://dis.tpd.tno.nl/. Specifically we used the corpus provided for TNOUT resources for post hoc CLEF’00 experiments, http://dis.tpd.tno.nl/clef2000. 2.3
Model
As mentioned earlier, in most of the work on metasearch, especially during first half of 90’s, the combination step has been the focus of the study. Belkin et al. in their extensive study on metasearch proposed a number of combination methods given in Table 1. Table 1. Combining Methods Suggested by Belkin et al. Name CombMIN CombMAX CombMED CombSUM CombANZ CombMNZ
Method minimum of individual scores. maximum of individual scores. median of individual scores. summation of individual scores. CombSUM ÷ number of nonzero scores. CombSUM × number of nonzero scores.
Of six combination methods, CombMIN, CombMAX, and CombMED are the ones which select a single score from a set of scores for a document, in contrary to CombSUM, CombANZ, and CombMNZ methods which assign a new score using individual scores of a document (of course with respect to to the same information need). Objectives of designing these methods are as follows. The CombMIN method is to minimize the probability that a nonrelevant document would be highly ranked; the CombMAX is to minimize the number of relevant documents being poorly ranked; and the CombMED is a simplistic method to avoid from conflicting objectives of CombMIN and CombMAX. The CombSUM, CombAVR, and CombMNZ methods were all designed to somehow utilize by some extent the assertion that the odds of a document being relevant to a query topic is increased in proportion to the number of times retrieved by different retrieval systems. Instead of weighting retrieval systems equally, a linear weighting scheme may be imposed for each system on the evidence of relevance [9,24]; however, we did not entertain this type of method due to the fact that it requires substantial training with enough relevancy information for high number of engines, and normalization methods are somehow neutral to combination methods, though there are some evidence against this posit [16].
138
H. Sever and M.R. Tolun
Belkin et al. normalized document scores by dividing them by the largest score across all queries in adhoc track of TREC-27 . They reported that the CombSUM was consistently better than (or at least equal to) the other combination techniques in Table 1. On the other hand, when Lee entertained with metasearch techniques with query-level normalization (called ‘standard’ in Table 2) over TREC-3 Corpus, he found out that the CombMNZ performed slightly better than CombSUM. Table 2. Normalization Methods Suggested by Montague and Aslam Name Standard Sum ZMUV
Method Shift min to 0 and scale max to 1. Shift min to 0 and scale the sum to 1. Shift mean to 0 and scale variance to 1.
Montague and Aslam [16] in their study proposed three different normalization schemes for metasearch, provided that document (or relevancy) scores are real values no matter what they mean (e.g., they may be probability of relevance, odds of relevance, log odds of relevance, or some other measure): Using TREC-3, -5, -9, and a small subset of TREC-5 (based on diversity of runs), they evaluated the combination techniques given in Table 1 with the normalization methods ‘Standard’, ‘Sum’, and ‘ZMUV’. They combined random groups of {2, 4, ·, 12} of runs. We however chose to pick up best 6 engines starting from the best (1-way), and then the best and second best (2-way), and so on. This was the conscious decision on our side for the following reasons. 1. Scores should reflect accurate probability of relevance information [6,24] and 2. Point of diminishing returns for metasearch is reached when about four retrieval systems are used in the combination [23].
3
Evaluation
Vogt and Cotrell in [24] along with Lee [14] experimentally asserted that steady improvement out of combination is likely when pair of individual search systems retrieve similar sets of relevant documents and dissimilar sets of nonrelevant documents. We picked up two runs with similar performance values, namely Westp1 (with rank of 10th and average precision of 0.3157) and Pircs1 (with rank of 12th and average precision of 0.3001), from adhoc track of TREC-3, and then calculated their macro overlapping extents by Dice’s coefficient for retrieved 7
We call this way of normalization a global normalization in contrast to the case where each document score is normalized with respect to given a retrieval output of a retrieval system against a query expression, which is called query-level normalization.
Comparison of Normalization Techniques for Metasearch
139
relevant and nonrelevant documents for each of 50 queries with different size of retrieval outputs. The Dice’s coefficient is as follows. 2
|X ∩ Y | , |X| + |Y |
where X and Y are sets of documents which in our case are set of retrieved relevant and nonrelevant documents, respectively. Macro (average calculation by taking average of averages) values of Dice’s coefficient is shown in Figure 1. These values were typically similar to mostly that of pairwise combinations of runs over TREC-3 and TREC-5 with similar performance values.
0.9
Dice Coeeficient
0.8
Relevancy Nonrelevancy
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1
3
5
7
9
11
13
15
17
19
Size No Fig. 1. Overlap Curve of Relevant and Nonrelavant Documents. Size No corresponds to the retrieval outputs’ sizes of 10, 20, .., 100, 200, ..., 1000
Montegue & Aslam [16], in their experiment have focused on how well on average metasearch methods with different normalizations perform. Over four TREC data sets (or corpora), they reported that (1) ZMUV works better with CombSUM; (2) SUM performed better with CombMNZ; (3) there was no a clear winner between the pairs of ZMUV-CombSUM and SUM-CombMNZ; and (4) both pairs consistently outperformed Lee’s normalization method (StandardCombMNZ) [14]. When we repeated this conjecture using top six engines accumulatively over TREC-3, and -5 data sets, we had different picture as shown in Tables 3 and 4. Our results clearly indicates that no matter what metasearch method is employed, the normalization method SUM is consistently better than the other traditionally proposed ones when combining outputs of search systems operating on a single database8 . There is one point left to mention though. We 8
To evaluate performance of retrieval systems we have computed non-interpolated precision values, which is one of the evaluation measures used by TREC community.
140
H. Sever and M.R. Tolun
0.9
SUM-CombSUM (0.3012) tnoutexsys (0.2655)
0.8
Precision
0.7 0.6
tnoutnx1 (0.2256)
0.5
tnoutex1 (0.2214)
0.4 0.3 0.2 0.1 0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
Recall Fig. 2. Data Fusion for CLEF Runs Table 3. Non-interpolated precision values of accumulatively combination of TREC-3’s top 6 engines SUM
ZMUV
Standard
SUM
ZMUV
Standard
Best
CombSUM CombSUM CombSUM CombMNZ CombMNZ CombMNZ Combination
inq102 citya1 brkly7 inq101 assctv2 assctv1 average
0.4226 0.4551 0.4807 0.4715 0.4773 0.4730 0.4634
0.4226 0.4445 0.4593 0.4471 0.4500 0.4438 0.4446
0.4226 0.4495 0.4742 0.4683 0.4750 0.4706 0.4600
0.4226 0.4503 0.4742 0.4677 0.4713 0.4665 0.4588
0.4226 0.4382 0.4553 0.4438 0.4455 0.4396 0.4408
0.4226 0.4464 0.4700 0.4648 0.4692 0.4648 0.4563
0.8165 0.8982 0.9327 0.9358 0.9519 0.9552 0.9151
designed an oracle which simply rank relevant documents ahead of nonrelevant ones for given a retrieval output. We called such an oracle as best combination to present uppermost performance values for an external combination (note that a successful combination should consistently improve on the best of its inputs and hence, a tight lower limit (baseline) for a combination in our study is the performance value of the best system which is given in the very first row of Tables 3 and 4.) The best combination values ironically show that there is plenty of room for optimization techniques, such as relevance feedback [18], to improve performance a single engine even if it proves itself to be the best. We also used pair of SUM-CombSUM of metasearch technique to see how a simple minded but interestingly powerful normalization technique perform against mutually exclusive databases. For this one, we experimented on twentyone system’s results on CLEF’00 as explained before. Here the multilingual run
Comparison of Normalization Techniques for Metasearch
141
Table 4. Non-interpolated precision values of accumulatively combination of TREC-5’s top 6 engines SUM
ZMUV
Standard
SUM
ZMUV
Standard
Best
CombSUM CombSUM CombSUM CombMNZ CombMNZ CombMNZ Combination
ETHme1 uwgcx1 LNmFull2 Cor5M2rf genrl3 CLCLUS average
0.3165 0.3804 0.3790 0.3710 0.3727 0.3833 0.3672
0.3165 0.3677 0.3651 0.3592 0.3588 0.3685 0.3560
0.3165 0.3656 0.3693 0.3654 0.3661 0.3782 0.3602
0.3165 0.3814 0.3810 0.3713 0.3786 0.3888 0.3696
0.3165 0.3761 0.3747 0.3635 0.3633 0.3771 0.3619
0.3165 0.3730 0.3791 0.3681 0.3705 0.3823 0.3649
0.7394 0.8648 0.8946 0.9063 0.9278 0.9398 0.8788
Table 5. R-P Table of Collective Fusion on CLEF’00 SUM WSUM SUM WSUM SUM WSUM CombSUM tnoutex1 CombSUM tnoutnx1 CombSUM tnoutexsys 0 0.5986 0.6945 0.5916 0.6301 0.6894 0.8031 0.1 0.4326 0.4827 0.4054 0.4402 0.4276 0.5395 0.2 0.3624 0.3919 0.3333 0.3738 0.3314 0.4488 0.3 0.2974 0.3266 0.2843 0.3188 0.285 0.3786 0.2481 0.2491 0.2339 0.2504 0.2466 0.2934 0.4 0.5 0.1971 0.2024 0.2089 0.2178 0.1973 0.2291 0.6 0.1345 0.1432 0.1655 0.1775 0.1581 0.1822 0.7 0.0833 0.0908 0.1232 0.123 0.1139 0.1427 0.8 0.0578 0.0605 0.0904 0.0992 0.0785 0.1074 0.9 0.0285 0.0274 0.0479 0.049 0.0411 0.0509 1 0.0107 0.0091 0.0111 0.0105 0.0123 0.0143 AveP 0.2049 0.2214 0.2093 0.2256 0.2055 0.2655
‘tnoutnx1’ was with Dutch as a query language and composed of 4 bilingual intermediate runs, namely ‘tnoutne1’, ‘tnoutnd1’, ‘tnoutnf1’, ‘tnoutni1’ with performance values of 0.0734, 0.1009, 0.0625, and 0.0684 respectively. Similarly the multilingual run ‘tnoutexsys’ was with English as a query language and was composed of 4 bilingual intermediate runs, namely ‘en-en’, ‘en-de’, ‘en-fr’, ‘en-it’ with performance values of 0.1122, 0.0937, 0.0719, and 0.0898 respectively; the multilingual run ‘tnoutex1’ was with English as a query language and was composed of 4 bilingual intermediate runs, namely ‘tnoutee2’, ‘tnouted1’, ‘tnoutef1’, ‘tnoutei1’ with performance values of 0.0991, 0.0798, 0.0528, and 0.0765 respectively. Figure 2 shows how well SUM-CombSUM performed over individual collection fusion systems when it externally combined all. Table 5 shows pairwise performance values (R-P Table) of SUM-CombSUM and collection fusion runs when they combined intermediate runs. Here we see that SUM-CombSUM is still promising when it combines individual runs of collection fusion, but fails to do better than tnout system when it comes to combining mutually exclusive results.
142
H. Sever and M.R. Tolun
One possible explanation for this one might be that twenty one system makes weighted averaging of document score distributions by exploiting its background knowledge on single databases (e.g., rare terms differently affect the score calculation on language retrieval models depending upon the size of single database [10].)
4
Conclusion
The experiments have shown that the best metasearch performance is attained by trying to equalize exponential nature of document score distributions. This type of normalization is still a valuable alternative against weighted normalizations for collection fusion.
References 1. A. Arampatzis and A. van Hameren. Maximum likelihood estimation for filtering thresholds. In Proceedings of the 24th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 285–293, New Orleans, LA, September 2001. 2. N. Belkin, C. Cool, W. Croft, and J. Callan. The effect of multiple query representation on information retrieval system performance. In Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 339–346, Pittsburgh, PA, USA, 1993. 3. N. Belkin, P. Kantor, E. Fox, and J. Shaw. Combining the evidence of multiple query representations for information retrieval. Information Processing & Management, 31(3):431–448, 1995. 4. J. Callan, W. Croft, and S. Harding. The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert System Applications (DEXA 3), pages 78–83, Berlin, 1992. Springer-Verlag. 5. W. Croft and H. Turtle. A retrieval model for incorporating hypertext links. In Proceedings of ACM Hypertext Conference, pages 213–224, New Orleans, LA, November 1989. 6. W. Croft. Combining approaches to information retrieval. In W. Croft, editor, Advances in Information Retrieval, pages 1–36. Kluwer Academic Publishers, 2000. 7. C. Dwork, S. R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In World Wide Web, pages 613–622, 2001. 8. J. Fiscus and G. R. Doddington. Topic detection and tracking overview. In J. Allan, editor, Topic Detection and Tracking: Event-based Information Organization, pages 17–31. Kluwer Academic Publishers, 2002. 9. S. Gauch, G. Wang, and M. Gomez. Profusion: Intelligent fusion from multiple, distributed search engines. Journal of Universal Computer Science, 2(9):637–649, Sept. 1996. http://www.jucs.org/jucs 2 9/profusion intelligent fusion from. 10. D. Hiemstra, W. Kraaij, R. Pohlmann, and T. Westerveld. Translation resources, merging strategies and relevance feedback for cross-language information retrieval. In C. Peters, editor, Cross-language information retrieval and evaluation, Lecture Notes in Computer Science (LNCS-2069), pages 102–115. Springer Verlag, NY, 2001.
Comparison of Normalization Techniques for Metasearch
143
11. J. Katzer, M. J. McGill, J. Tessier, W. Frakes, and P. DasGupta. A study of the overlap among document representations. Information Technology: Research and Development, 1(4):261–274, Oct 1982. 12. S. Lawrence and C. Giles. Searching the World Wide Web. Science, 280(5360):98– 100, April 1998. 13. J. H. Lee. Combining multiple evidence from different properties of weighting schemes. In E. A. Fox, editor, Proceedings of the 18th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 180–188, Seattle, WA, July 1995. 14. J. H. Lee. Analyses of multiple evidence combination. In E. A. Fox, editor, Proceedings of the 20th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 267–276, Philadelphia, Pennsylvania, July 1997. 15. R. Manmatha and H. Sever. A formal approach to score normalization for metasearch. In Proceedings of Human Language Technology Conference, San Diego, CA, March 2002. 16. M. Montague and J. Aslam. Relevance score normalization for metasearch. In Proceedings of the ACM 10th Annual International Conference on Information and Knowledge Management (CIKM), pages 427–433, Atlanta, Georgia, November 2001. 17. T. Rajashekar and W. Croft. Combining automatic and manual index representations in probabilistic retrieval. Journal of American Society for Information Science, 46(4):272–283, 1995. 18. G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of American Society for Information Science, 41(4):288–297, 1990. 19. G. Salton, E. Fox, and H. Wu. Extended boolean information retrieval. Communications of the ACM, 26(11):1022–1036, 1963. 20. T. Saracevic and P. Kantor. A study of information seeking and retrieving. III. searchers, searches, and overlap. Journal of American Society for Information Science, 39(3):197–216, 1988. 21. J. Swets. Information retrieval systems. Science, 141:245–250, 1963. 22. K. Tumer and J. Ghosh. Linear and order statistics combiners for pattern classification. In A. Sharkey, editor, Combining Artificial Neural Networks, pages 127–162. Springer-Verlag, 1999. 23. C. C. Vogt. How much more is better? Characterizing the effects of adding more IR systems to a combination. In Proceedings of Content-Based Multimedia Information Access (RIAO), pages 457–475, Paris, France, April 2000. 24. C. Vogt and G. Cottrell. Fusion via a linear combination of scores. Information Retrieval, 1(2-3):151–173, 1999.
On the Cryptographic Patterns and Frequencies in Turkish Language Mehmet Emin Dalkılıç1 and Gökhan Dalkılıç2 Ege University International Computer Institute, 35100 Bornova, øzmir, Turkey
[email protected] http://www.ube.ege.edu.tr/~dalkilic 2 Dokuz Eylül University, Computer Eng. Dept., 35100 Bornova, øzmir, Turkey
[email protected] http://bornova.ege.edu.tr/~gokhand
1
Abstract. Although Turkish is a significant language with over 60 million native speakers, its cryptographic characteristics are relatively unknown. In this paper, some language patterns and frequencies of Turkish (such as letter frequency profile, letter contact patterns, most frequent digrams, trigrams and words, common word beginnings and endings, vowel/consonant patterns, etc.) relevant to information security, cryptography and plaintext recognition applications are presented and discussed. The data is collected from a large Turkish corpus and the usage of the data is illustrated through cryptanalysis of a monoalphabetic substitution cipher. A new vowel identification method is developed using a distinct pattern of Turkish—(almost) non-existence of double consonants at word boundaries.
1
Introduction
Securing information transmission in data communication over public channels is achieved mainly by cryptographic means. Many techniques of cryptanalysis use frequency and pattern data of the source language. The cryptographic pattern and frequency data are usually obtained by compiling statistics from a variety of source language text such as novels, magazines and newspapers. Such data is available for many languages. A good source is an Internet site “Classical Cryptography Course by Lanaki” (http://www.fortunecity.com/skyscraper/coding/379/lesson5.htm) which includes data for English, German, Chinese, Latin, Arabic, Russian, Hungarian, etc. No such data online or otherwise can be located for Turkish which is a major language used by a large number of people. Turkish, a Ural-Altaic language, despite its being one of the major languages of the world [1], it is one of the “lesser studied languages” [2]. Even less studied are the information theoretic parameters (e.g., entropy, redundancy, index of coincidence) and cryptographic characteristics (patterns and frequencies relevant to cryptography) of the Turkish language. Earliest work (that we are aware of) on the information theoretic aspects of the Turkish is presented by Atlı [3]. Although Atlı calculated the digram entropy and gave word length and consonant/vowel probabilities, he could not go T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 144−153, 2002. Springer-Verlag Berlin Heidelberg 2002
On the Cryptographic Patterns and Frequencies in Turkish Language
145
beyond the digram entropy due to insufficient computing resources. In a more recent time, Koltuksuz [4] addressed the issue of cryptanalytic parameters of Turkish where he extracted n-gram entropy, redundancy and the index of coincidence values up to n = 5. Adopting Shannon’s entropy estimation approach [5], present authors empirically determined the language entropy upper-bound of Turkish as 1.34 bpc (bits per character) with a corresponding redundancy of roughly 70% [6]. Data presented in this paper is obtained from a large Turkish text corpus of size 11.5 Megabytes. The corpus --contains files that are filtered so that they consist solely of the 29 letters of the Turkish alphabet and the space, is the union of three corpora; the first one is compiled from the daily newspaper Hürriyet by Dalkılıç [7], the second one contains 24 novel samples of 22 different authors by Koltuksuz [4], and the last one consists of mostly news articles and novels collected by Diri [8]. This paper presents a variety of Turkish language patterns and frequencies (e.g., single letter frequency profile, letter contact patterns, common digrams, trigrams and frequent words, word beginnings, vowel/consonant patterns, etc.) relevant to information security, cryptography and plaintext recognition. Presented data is put to use to solve the following mono-alphabetic substitution cipher problem found in [9]. DLKLEöPFÜ FLMLTU FLLö CBÖL ÖIBJL öÜNLEPö APEYPKÜBÜB SIBPJP MLYLB SÇKZPA DPBNPEPFÜBP SÜö. CEöLÖLYÜ öPZPFYCMH PZZÜF LÖLFUBL OPøÜE. ALøV JPZYPBZÜ öPYBPJÜ MHZ. FCB ÜDHNH öPYBPBÜB MLJELùUBÖL. Throughout the cryptanalysis example uppercase letters for ciphertext and lowercase for plaintext (typeset in Courier font) are used to improve readability.
2
Letter Frequencies
Turkish alphabet contains eight vowels {A, E, I, ø, O, Ö, U, Ü} and twenty-one consonants {B, C, Ç, D, F, G, ö, H, J, K, L, M, N, P, R, S, ù, T, V, Y, Z} totaling to 29 letters. In lower case, vowels {a, e, ı, i, o, ö, u, ü} and consonants {b, c, ç, d, f, g, ÷, h, j, k, l, m, n, p, r, s, ú, t, v, y, z} are written as shown. {I, ı} and {ø, i} being two different letters may be confusing to many readers who are not native (Turkish) speakers. Table 1 shows the individual Turkish letter probabilities if space is suppressed in the text. That table also introduces the frequency ordering of Turkish as AEøNRLIKDMYUTSBOÜùZGÇHöVCÖPFJ. Table 1. Normal Turkish letter frequencies (%) in decreasing order
A E ø N R L
Æ Æ Æ Æ Æ Æ
11.82 9.00 8.34 7.29 6.98 6.07
I K D M Y U
Æ Æ Æ Æ Æ Æ
5.12 4.70 4.63 3.71 3.42 3.29
T S B O Ü ù
Æ Æ Æ Æ Æ Æ
3.27 3.03 2.76 2.47 1.97 1.83
Z G Ç H ö V
Æ Æ Æ Æ Æ Æ
1.51 1.32 1.19 1.11 1.07 1.00
C Ö P F J
Æ Æ Æ Æ Æ
0.97 0.86 0.84 0.43 0.03
146
M.E. Dalkiliç and G. Dalkiliç
2.1 Letter Groupings Unlike letter frequencies and their order which fluctuate considerably, group frequencies are fairly constant in all languages [10]. The following are some useful Turkish letter groups where each group is arranged in decreasing frequency order. • • • • • • •
Vowels {A, E, ø, I, U, O, Ü, Ö} 42.9% High freq. consonants {N, R, L, K, D} 29.7% Medium freq. consonants {M, Y, T, S, B} 16.2% Low freq. consonants {ù, Z, G, Ç, H, ö, V, C, P, F, J } 11.3% High freq. vowels {A, E, ø, I} 34.3% Highest freq. letters {A, E, ø, N, R} 43.4% High freq. letters {A, E, ø, N, R, L, I, K, D} 63.8%
2.2 Cryptanalysis Using Letter Patterns Monoalphabetic substitution ciphers replace each occurrence of a plaintext letter (say a) with a ciphertext letter (say H). Table 2 shows the frequency counts for the example ciphertext given in Section 1. Clearly, an exact match between these counts and the normal plaintext letter frequencies of Table 1 is not expected. Nevertheless, it is very likely that highest frequency ciphertext letters {P,L,B,Ü} are substitutes for letters from the highest frequency normal letters set which is {a,e,i,n,r}. It is expected that the low frequency ciphertext symbols {Ç,O,T,V,ù,G,R} resolve to the letters from the low frequency plaintext letters set of { ú,z,g,ç,h,÷,v,c, p,f,j}. Table 2. Letter frequencies in the example ciphertext
P L B Ü ö F
3
Æ Æ Æ Æ Æ Æ
21 20 16 13 9 8
E Y Z J M Ö
Æ Æ Æ Æ Æ Æ
7 7 7 5 5 5
C H N A K S
Æ Æ Æ Æ Æ Æ
4 4 3 3 3 3
U D I ø Ç O
Æ Æ Æ Æ Æ Æ
3 3 2 2 1 1
T V ù G R
Æ Æ Æ Æ Æ
1 1 1 0 0
Letter Contact Patterns
Letter contact data (transition probabilities) is an important characteristic of any language because contacts define letters through their relations with one another. For instance, in Turkish vowels avoid contact, doubles are rare. These behaviors can be observed from Table 3 which holds the normal contact percentage data for Turkish. Table 3 is modeled after a similar table for English by F. B. Carter reproduced in [10]. Taking any one letter, say A: On the left, it was contacted 16% of the time by L, 10% by D, etc., and 99.43% of its total contacts on that side were to consonants. On
V
00.57 73.18 68.05 77.36 35.58 00.12 88.82 16.57 99.86 85.99 00.05 00.78 80.88 79.49 57.54 63.72 97.89 00.69 00.09 90.62 93.20 72.68 94.07 49.41 00.18 00.02 88.31 94.24 96.32
C
99.43 26.82 31.95 22.64 64.42 99.88 11.18 83.43 00.14 14.01 99.95 99.22 19.12 20.51 42.46 36.28 02.11 99.31 99.91 09.38 06.80 27.32 05.93 50.59 99.82 99.98 11.69 05.76 03.68
C3 H4 N5 S5 T5 B6 R8 M8 Y9 K9 D10 L16 L3 M3 N4 R7 E8 ø18 A39 ɖ 3 U3 I4 R4 ø4 O11 A21 E21 N22 K4 U5 R5 N5 ɖ 7 E13 A17 ø33 M3 Y3 ø4 L10 E12 A16 R18 N26 C4 K4 B4 S5 G5 V5 T6 R6 Y6 N7 M8 D15 L16 U3 ɖ 4 I4 O4 ø15 E23 A35 A3 ø3 E3 V6 U6 Z8 Y10 L17 R19 N20 ɖ 3 ɐ3 O10 U11 ø15 I16 E18 A24 R3 U5 E9 ø11 A59 Ɂ 3 ù4 Y5 ö6 K6 M6 T8 S9 L10 D11 N12 R12 Ɂ 3 G4 ö4 T6 M6 S6 K7 N8 L9 R10 D11 B14 ɖ 4 ø6 R15 E16 O26 A28 ù3 L3 ɖ 4 U4 O7 R7 ø10 I11 E18 A25 T3 ù3 L3 U4 Y5 R5 I6 K6 N7 O8 E10 ø12 A14 ù3 ɖ 3 T4 N4 U7 R8 I8 L9 E13 ø14 A17 O5 ɖ 6 U7 I17 E17 ø20 A23 B4 T5 D8 Ɂ 8 S13 K14 Y33 Ɂ 3 Y4 B9 K10 D12 S16 G41 ɖ 3 R3 U6 E10 O10 I11 ø12 A37 ɐ3 ɖ 4 U6 I6 O10 ø16 E22 A28 M3K4 ɖ 4 N5 I5 U5 R7 ø15 E18 A23 O3 R4 ɖ 8 E8 U9 I18 ø23 A26 L3 U4 T5 ø6 R8 ù9 S10 K10 A15 E18 C3 Y4 ö4 T6 S6 M7 K7 L8 R9 N10 D13 B14 Ɂ 3 ù3 Z3 S5 B5 K6 M6 N6 L6 R7 Y8 G11 D13 T13 ɖ 3 ø4 U5 A28 E41 O4 ɖ 5 ɐ6 U8 I10 E15 ø22 A25 E7 ɐ9 U10 ɖ 11 I15 A21 ø21
A B C Ɂ D E F G ö H I ø J K L M N O ɐ P R S ù T U ɖ V Y Z
R19 N16 L8 K8 Y6 M5 D5 S4 ù4 T4 Z3 H3 ö3 ø38 A22 U16 E13 ɖ 4 O3 A33 E31 U11 ø10 I9 ø22 I15 A15 E15 O14 ɖ 5 L4 M3 T3 E27 A24 ø19 I12 U9 ɖ 5 O3 R22 N17 L9 K8 T7 M6 D5 Y5 S5 V3 ö3 C3 A31 E20 ø12 I11 T6 L4 R3 O3 U3 E31 ø22 ɐ18 ɖ 15 A6 U3 I3 ø28 I27 U12 R8 A7 L7 E4 ɖ 3 A42 E18 ø12 U4 I3 ɖ 3 T3 O3 N31 R11 L10 K9 ù8 M7 Y7 Z5 ö4 S3 N22 R17 L11 Y8 M7 ù6 K5 Z5 S4 Ɂ 4 T3 ø36 A19 E17 L7 U6 I6 D5 O3 A26 ø14 L10 E10 I8 T7 O6 U6 ɖ 3 A29 E24 ø11 I8 D5 M5 U4 L3 A29 E22 ø15 I11 U7 L4 ɖ 4 D3 D17 E13 ø12 I12 A10 L9 U6 C4 M3 R28 L21 N16 K9 ö4 C4 Y4 R20 N19 Y17 Z15 L7 ö4 T4 S3 A31 E12 I11 L10 T6 O6 R5 ø5 M4 S3 A16 ø14 I12 D11 E9 L6 U6 M5 K4 T4 S3 A18 ø16 I16 E13 T9 O7 U7 ɐ3 ɖ 3 T16 I14 A13 L11 ø11 E9 M7 K6 U4 ɖ 4 A19 E16 ø15 I14 ɖ 8 U6 L5 T4 M4 O3 N20 R15 L10 M9 Y7 K6 ù5 Z5 ö5 T4 S4 N23 R15 Z9 K8 L7 ù7 Y6 M6 S5 E43 A24 ø8 R5 U5 L5 A29 O17 E15 L8 I7 ø5 ɖ 4 U3 D3 A19 E18 L14 ø11 I11 D7 ɖ 6 U4 M3
Table 3. Normal (expected) contact percentages of Turkish digrams
V 00.81 98.53 96.81 88.22 98.70 00.16 80.83 98.37 81.57 84.98 00.11 00.47 86.19 74.87 79.93 87.83 57.15 00.25 00.02 67.27 60.19 83.26 56.60 82.03 00.62 00.12 82.06 81.82 69.70
C 99.19 01.47 03.19 11.78 01.30 99.84 19.17 01.63 18.49 15.02 98.89 99.53 13.81 25.13 20.07 12.17 42.85 99.75 99.98 32.73 39.81 16.74 43.40 17.97 99.38 99.88 17.94 18.18 30.30
On the Cryptographic Patterns and Frequencies in Turkish Language 147
148
M.E. Dalkiliç and G. Dalkiliç
the right, it was contacted 19% of the time by R, and 0.81% of the time by vowels. Note that the table only contains contacts with a frequency of 3% or more. The most marked characteristics of Turkish letter contacts are vowels do not contact vowels on either side and doubles are rare. Since no consonant avoids vowel contact, vowels and consonants are very much distinguishable. Only significant doubles are TT and LL for consonants and AA for vowels. Doubles for vowels are so rare that even AA did not make into the Table 3. 3.1 Most Frequent Digrams and Trigrams In Turkish, among 292 digrams about one third and among the 293 trigrams about one tenth constitute the 96% of the total usage. In Table 4, none of the first 100 digrams contains doubles. Fifty of them are in the form consonant-vowel (CV), 42 are in the form vowel-consonant (VC) and only 8 are in the form (CC). Usage of the first ten digrams in Table 4 sums up to 16.9%, first 50 to 47.9% and first 100 to 69.5%. Table 4. The 100 Most Frequent Digrams in Turkish (Frequencies in 100,000) AR LA AN ER øN LE DE EN IN DA øR Bø KA YA MA
2273 2013 1891 1822 1674 1640 1475 1408 1377 1311 1282 1253 1155 1135 1044
Dø ND RA AL AK øL Rø ME Lø OR NE RI BA Nø EL
1021 980 976 974 967 870 860 785 782 782 738 733 718 716 710
NI AY YO EK RD TA AM DI SA øY Kø UN NA AD YE
703 698 686 683 681 670 638 637 624 619 618 606 602 592 588
OL Sø LI RE SI Mø TE ET øM Tø HA AS BU VE IR
586 578 576 566 565 564 562 560 541 537 528 527 516 508 503
Aù NL TI EM ÜN DU GE AT SE ED UR ON KL IL øù
500 496 494 494 492 487 480 479 457 452 452 452 447 438 434
BE KE EY ES IK RL MI øK CA LD CE NU Iù øZ LM
433 424 421 411 407 393 392 379 379 362 361 359 355 353 353
KI RU öø AZ øS Gø öI AH YL ÜR
350 349 347 343 343 342 340 338 324 319
Observe that Turkish letter contact data (Table 3) is not symmetric. For instance BU is in the table but its reverse UB is not. High frequency digrams (Table 4) with rare reverses (i.e., reverses are not in Table 3) e.g. ND, OR, RD, DI, OL, BU, NL, TI, ÜN, DU, GE, ON, RL, LD, YL and high frequency digrams with reverses which are also high frequency digrams (i.e., both the digram and its reverse are in Table 4) e.g., AR, LA, AN, ER, øN, LE, DE, EN, IN, DA, øR, KA, YA, MA, RA, AL, AK, øL, Rø, ME, Lø, NE, Rø, EL, AY, EK, TA, AM, SA, UN, NA, AD, YE, Sø, LI, RE, Mø, TE, øM, HA, AS, IR, SE, ED, UR, IL are useful for distinguishing letters from each other. Table 5 shows that none of the most common Turkish trigrams contain two vowels or three consonants in sequence; that is, no VVV, VVC, CVV, or CCC patterns. Most common trigram patterns in Table 5 are VCV (46 times), followed by CVC (35 times). CCV and VCC are seen 11 and 8 times, respectively. The first 10, 50 and 100 trigrams represent, respectively, 7.2%, 19.0%, and 28.7% of total usage.
On the Cryptographic Patterns and Frequencies in Turkish Language
149
Table 5. The 100 Most Frequent Trigrams in Turkish (Frequencies in 100,000) LAR BøR LER ERø ARI YOR ARA NDA øNø INI ASI DEN NDE RøN øLE
1237 952 949 764 757 643 521 482 432 428 387 383 383 372 367
ANI AMA RIN NLA DAN IND EDø ADA AYA KAR ALA LAN ENø SIN øND
362 357 345 338 338 336 326 321 316 299 298 296 294 294 291
ESø NIN YLE ADI øYO ELE øNE SøN ANL KLA ERE ALI ELø øYE BøL
283 280 277 273 271 271 266 265 263 262 262 258 256 255 246
øLø BAù ARD NøN RDU MIù OLA IöI Eöø EME INA ANA KEN øÇø IYO
245 243 242 239 231 229 227 226 223 223 222 220 218 217 217
RLA Møù YAN ECE AYI LMA øöø EDE TAN NDø KAL ONU UNU END ÇøN
216 213 212 209 207 207 207 206 205 204 204 201 200 199 198
AöI ORD GEL MAN ACA ÖYL KAD ERD ORU RAK DøY KLE VER EMø GÖR
194 194 194 192 192 191 187 183 178 177 172 171 170 169 169
RDI SON ILA BEN CAK øRø EYE AùI ÇIK KAN
169 168 167 166 165 163 163 162 160 159
3.2 Cryptanalysis Using Letter Contact Patterns Vowels usually distinguish themselves from consonants by not contacting each other. Table 6 shows the letter contact frequencies for the six most frequent ciphertext letters {P,L,B,Ü,ö,F}. Highest frequency letters {P,L} do not contact each other and almost certainly are vowels. Letters {B,ö} both contact {P,L}, and thus they are consonants. Letter Ü does not contact either of the {P,L} and it is likely to be a vowel. Letter F must be a consonant because it contacts two (presumable) vowels {L,Ü}. When the table is extended to include all ciphertext letters, it is determined that {P,L,Ü,C,H,I,Ç} are very likely to be vowels. Table 6. Letter frequencies in the example ciphertext
P L B Ü ö F
P --4 -4 --
L -1 1 -1 2
B 3 1 -4 ---
Ü --2 -1 2
ö 1 1 -1 ---
F 3 1 -1 ---
Now, let us focus on the two doubles LL in FLLö and ZZ in PZZÜF. These doubles may be associated to the only significant doubles of Turkish {tt,ll,aa} i.e., {L,Z} Æ {t,l,a}. Due to frequency and vowel analysis, L is a strong candidate for a, we may temporarily1 bind LÆa. This makes letter P the strongest candidate for letter e i.e., PÆe. Letter Ü is a vowel and it is either one of {ı,i }. BÜB is a repeated trigram with the ‘121’ pattern and in Table 5, the only ‘121’ patterned trigrams
1
Cryptanalysis is mostly a trial and error process and all bindings are temporary.
150
M.E. Dalkiliç and G. Dalkiliç
with starting and ending with a consonant (i.e., CVC) are NIN and NøN. Thus, we may bind letter BÆn and stop here to continue later.
4 Word Patterns Primary vowel harmony rule is that all vowels of a Turkish word are either back vowels {A,I,U,O} or front vowels {E,ø,Ü,Ö}. Secondary vowel harmony rule states that (i) when the first vowel is flat {A,E,I,ø}, the following vowels are also flat e.g., BAKIRCI, øSTEK, (ii) when the first vowel is rounded {U,O,Ü,Ö}, the subsequent vowels are either high and rounded {U,Ü} or low and flat {A,E}, and (iii) low and rounded vowels {O,Ö} can only be in the first syllable of a word. Last Phoneme Rule is that Turkish words do not end in the consonants {B,C,D,G}. Each of these rules has exceptions. Using a root word lexicon, Gungor [11] determined that only 58.8% of the words obey the primary vowel harmony rule. The secondary vowel harmony rule is obeyed by 72.2%. The most obeyed rule is the last phoneme rule with 99.3%. The average word length of Turkish is 6.1 letters about 30% more than that of English. Words with 3 to 8 letters represent over 60% of total usage in Turkish text. 4.1 Common Words, Beginnings, and Endings When word boundaries are not suppressed in ciphertext, frequent word beginnings, endings and common words provide a wealth of information. The most frequent 100 Turkish words and the most common 50 n-grams for each of the other categories together with their percentage in total usage are given below. Common words BøR VE BU DE DA NE O GøBø øÇøN ÇOK SONRA DAHA Kø KADAR BEN HER DøYE DEDø AMA HøÇ YA øLE EN VAR TÜRKøYE Mø øKø DEöøL GÜN BÜYÜK BÖYLE NøN MI IN ZAMAN øN øÇøNDE OLAN BøLE OLARAK ùøMDø KENDø BÜTÜN YOK NASIL ùEY SEN BAùKA ONUN BANA ÖNCE NIN øYø ONU DOöRU BENøM ÖYLE BENø HEM HEMEN YENø FAKAT BøZøM KÜÇÜK ARTIK øLK OLDUöUNU ùU KADIN KARùI TÜRK OLDUöU øùTE ÇOCUK SON BøZ VARDI OLDU AYNI ADAM ANCAK OLUR ONA BøRAZ TEK BEY ESKø YIL BUNU TAM øNSAN GÖRE UZUN øSE GÜZEL YøNE KIZ BøRø ÇÜNKÜ GECE (23%) Note that any n-gram enclosed within a pair of punctuation mark(s) and space(s) is counted as a word. For instance “BAKAN’IN” (minister’s) is taken as two words “BAKAN” and “IN”. The only one-letter word in Turkish is O(it/he/she). The list contains many non-content words such as BøR (one), VE (and), BU (this), DE DA (too/also), NE (what), Kø (that/who/which), AMA (but), øLE (with) and fewer conceptual words such as TÜRKøYE (Turkey), ZAMAN (time), øNSAN (human), GÜZEL (beautiful).
On the Cryptographic Patterns and Frequencies in Turkish Language
151
Digram word endings. AN EN øN AR IN DA ER DE Dø AK LE Nø NA NE NI øM DI Rø RI OR EK YE RA DU UN YA Kø øR LA IM Lø Sø IK IR LI ET TI Tø CE SI UM Iù RE öI øZ øK øù MA IZ Bø (73.1%) Trigram word endings. LAR DAN LER DEN YOR ARI INI NDA øNø ERø øNE INA NDE NIN NøN RDU YLE MIù AYA ASI Møù RAK IöI RIN CAK ESø RDI ARA øYE NRA MAK MEK TAN øöø DAR RøN EYE MAN LIK RUM UNU ADA RDø ADI KEN DIR TEN DøR LøK YLA (41.6%) Digram word beginning . Bø KA YA DE BA BU GE VE OL DA HA SA BE GÖ SO KO TA Gø SE NE HE AL GÜ YE AN Dø øÇ KE Kø AR TE ÇO DÜ KU øN VA øS ME KI DO PA ON øL ÇA DU YO MA TÜ ÇI Mø (67.3%) Trigram word beginnings. BøR BAù øÇø KAR GEL GÖR SON BEN OLA KAD YAP BøL KAL VER KEN ÇIK DEö VAR GÜN YAN GøB øST BAK DøY TÜR HER ARA OLM DED ÇOK DÜù DAH BUN GER OLD YER KON GEÇ PAR DUR KUR BøZ ANL ÇOC YAR YIL BUL SEN OLU YOK (30.6%) A careful observation of Turkish word endings and beginnings given above reveal a distinct feature of Turkish; the first two and the last two letters of a word contain a vowel. In other words, (almost) no Turkish word starts or ends with a consonantconsonant (CC) or vowel-vowel (VV) pattern. Very few words (about 2%), mostly foreign origin e.g. TREN (train), KREDø (credit), RøNG (ring) do not obey this rule. A vowel identification method is developed for Turkish using this “no CC or VV patterns at word boundaries rule” and presented in the next subsection. 4.2 A New Vowel Identification Method for Turkish When spacing is not suppressed in a ciphertext for a mono-alphabetic cipher the following technique can be employed to distinguish vowels from consonants. First, make a list of digram word beginnings and endings. Let us call it the PairList. Then pick a pair containing a high frequency (in the ciphertext) letter. Remove the pair from the PairList. Then, create two empty lists List1 and List2 and put one letter of the pair to List1 and the other to the List2. Next, repeat the following steps until all elements in List1 and List2 are marked (processed), (i) pick and mark first unmarked element (say X) in List1 or List2. In the PairList find each pair in the form XY or YX and put Y to the list that X does not belong to and remove that pair from the PairList. At the end of the process, remove duplicates from both lists and smaller list will be (very likely) vowels and the other list will contain consonants. For those few words that do not obey the “no CC or VV patterns at word boundaries” rule may cause a letter to end up in both lists. If that is the case, get two counts: separately count the number of times the letter contacts to the members of List1 and List2. Since a contact between the members of a list indicates either CC or VV pattern, if one count is dominant remove the letter from that list. For example if letter X is seen many times with the elements of List1 and few times with the elements of List2, remove it from List1, and keep it in List2. If no count is dominant, remove it from both lists.
152
M.E. Dalkiliç and G. Dalkiliç
At the end, the PairList may contain pairs whose letters not placed in either list because they do not make contact (at word beginnings or endings) to any other letter in lists (List1 and List2). In such situations, again count number of contacts to each list’s elements for both letters of the left behind pairs, this time using all contacts, not only the digrams at word beginning and endings. Then, using these counts determine whether they fit in the vowel list or the consonant list. Let us illustrate this method for our ongoing example: PairList = {DL,FL,CB, ÖI,öÜ,AP,SI,ML,SÇ,DP,SÜ,CE,öP,PZ,LÖ,OP,AL,JP,öP,MH,FC,ÜD, FÜ,TU,Lö,ÖL,JL,Pö,ÜB,JP,LB,PA,BP,Üö,YÜ,ÜF,BL,ÜE,øV,ZÜ,JÜ, HZ,CB,NH}. First, pick the DL pair and create List1 ={D}, List2 ={L}. Next, pick D from List1 and process pairs containing D i.e., DP and ÜD resulting in List1={D*}, List2 ={L,P,Ü} where D* means D has been processed. Then, pick L from List2 and process pairs containing L e.g., FL,ML,AL,… producing List1={D*,F,M,A,Ö,ö, J,B}, List2={L,P,Ü}, and continue until all letters in both lists are processed. Final lists are formed as List1={D*,F,M,A,Ö,ö,J,O,Z,B,E,Y,S}, List2={L,P,Ü, C,I,H,Ç}. Since List2 is shorter, it contains vowels. Pairs {øV} and {TU} are left in the Pairlist. In the ciphertext, ø occurs twice and contacts P, Ü, L and they are all vowels. Thus, ø must be a consonant and V must be a vowel. Similar analysis adds T and U to the consonant and vowel lists respectively.
4.3 Cryptanalysis Using Word Patterns We temporarily marked {L,P,Ü,C,I,H,Ç,U,V} as vowels and bound LÆa, PÆe, BÆn, and ÜÆ{ı,i}. The primary vowel harmony rule gives us a way to distinguish between ı and i: If ÜÆı association is correct, Ü will coexist in many ciphertext words with LÆa, otherwise ÜÆi is true and Ü will be seen together with PÆe in many ciphertext words. Since {PÆe,ÜÆi} seen together in eight words while {LÆa,ÜÆı} in only three words, the likely option is ÜÆi. Let us concentrate on the next two highest frequency letters ö and F which appear in a rare pattern öLLF Æ ?aa?. There are not too many four letter words with the unusual pattern ?aa?. There are only 7 matching words; faal, maaú, naaú, saat, vaat, vaaz, zaaf. Except the word saat each candidate contains a letter from the low frequency consonants group {f,ú,v,z}. Thus, it is likely that FÆs, and öÆt. At this point there are many openings to explore. -a-a-tesi sa-a-- saat –n-a –-n-a ti-a-et -e—-e-inin -ne-e –a-an ----e- -en-e-esine –it. –-ta-a-i te-es---e—-is a-as-na –e-i-. –a-- -e-—-n-i te-ne-i ---. s-n i---- te-nenin –a—-a—-n-a. The partial words -a-a-tesi, –it, s-n, te-nenin can easily be identified as Pazartesi(Monday), git (go), son (last), and teknenin (yatch’s). Furthermore, since we have already identified t, and a, only remaining candidate for l is Z. Putting all this together, we have the partial decryption: pazartesi sa-a— saat on-a –-n-a ti-aret –erkezinin gne-e –akan g-zle- pen-eresine git. Orta-aki telesko—-
On the Cryptographic Patterns and Frequencies in Turkish Language
ellis a-as-na –evir. –a-ip--- teknenin –a-ra—-n-a.
-elkenli
tekne-i
–-l.
153
son
Completion of the decryption is left to the curios reader. Full decryption reveals that the ciphertext contains a single substitution error. Can you find it?
5 Conclusion We have presented some Turkish language patterns and frequency data compiled from a large text corpus. The data presented here is relevant not only to the classical cryptology but also to the modern cryptology due to its potential use in automated plaintext recognition and language identification. We have also demonstrated two things; first, the data’s usage on a complete cryptanalysis example and the second, new insight can be attained through careful and systematic study of language patterns. We have discovered a distinct pattern of Turkish language and used it to develop a new approach for vowel identification. What we could not address due to the limited space are (i) the fluctuations of the data for short text lengths, and (ii) the application of the data to other cipher types, especially substitution ciphers without word boundaries and the transposition ciphers. Our future work plan includes the investigation of n-gram versatility (the number of different words in which the n-gram appears), and positional frequency of n-grams.
References 1. Schultz, T. and Waibel, A.: Fast Bootstrapping of LVCSR Systems with Multilingual Phoneme Sets. Proceedings of EuroSpeech’97 (1997) 2. Oflazer, K.: Developing a Morphological Analyzer for Turkish. NATO ASI on Language Engineering for Lesser-studied Languages, Ankara, Turkey (2000) 3. Atlı, E.: Yazılı Türkçede Bazı Enformatik Bulgular. Uyg. Bilimlerde Sayısal Elektronik Hesap Makinalarının Kullanılması Ulusal Sempozyum, Ankara, Turkey (1972) 409-425 4. Koltuksuz, A.: Simetrik Kriptosistemler için Türkiye Türkçesinin Kriptanalitik Ölçütleri. PhD. Dissertation, Computer Eng. Dept., Ege University, øzmir, Turkey (1995) 5. Shannon, C.E.: Prediction and Entropy of Printed English. Bell System Technical Journal. Vol. 30 no 1 (1951) 50-64 6. Dalkilic, M.E. and Dalkilic, G.: On the Entropy, Redundancy and Compression of Contemporary Printed Turkish. Proc. of. Intl. Symp. on Comp. and Info. Sciences (2000) 60-67 7. Dalkilic, G.: Günümüz Yazılı Türkçesinin østatistiksel Özellikleri ve Bir Metin Sıkıútırma Uygulaması. Master Thesis, Int’l. Computer Inst., Ege Univ., øzmir, Turkey (2001) 8. Diri, B.: A Text Compression System Based on the Morphology of Turkish Language. Proc. of International Symposium on Computer and Information Sciences (2000) 12-23. 9. Shasha, D.:Dr. Ecco’nun ùaúırtıcı Serüvenleri (Turkish translation: The Puzzling Adventures of Dr. Ecco.) TUBITAK Popüler Bilim Kitapları 24 (1996) 10. Gaines, H. F.: Cryptanalysis. Dover, New York (1956) 11. Güngör, T.: Computer Processing of Turkish: Morphological and Lexical Investigation PhD. Dissertation, Computer Eng Dept., Bo÷aziçi Univ., østanbul, Turkey (1995)
Automatic Stemming for Indexing of an Agglutinative Language Sehyeong Cho1 and Seung-Soo Han2 1
MyongJi University, Department of Computer Science San 38-2 Yong In, KyungGi, Korea 2 MyongJi University, Department of Information Engineering San 38-2 Yong In, KyungGi, Korea {shcho,shan}@mju.ac.kr
Abstract. Stemming is an essential process in information retrieval. Though there are extremely simple stemming algorithms for inflectional languages, the story goes totally different for agglutinative languages. It is even more difficult if significant portion of the vocabulary is new or unknown. This paper explores the possibility of stemming of an agglutinative language, in particular, Korean language, by unsupervised morphology learning. We use only raw corpus and make use of no dictionary. Unlike heuristic algorithms that are theoretically ungrounded, this method is based on statistical methods, which are widely accepted. Although the method is currently applied only to Korean language, the method can be adapted to other agglutinative languages with similar characteristics, since language-specific knowledge is not used.
1
Introduction
Stemming in an agglutinative language such as Korean is a non-trivial task. It requires morphological analysis, which in turn requires a lexicon. Even with a lexicon, we have difficulty if a significant portion of the vocabulary is new or unknown. For instance, virtually all youngsters who chat over the Internet use “ ”1 [y uh] instead of “ ”[y ao] as the concluding eomi2. A morphological analyzer or a POS-tagger will
ÿ
þ
1
Throughout the paper, examples in Korean are first written in Korean. If necessary, it will be followed by ARPAbet[20] symbols in square brackets. The translations are in round parentheses, where each eojul(several morphemes are concatenated to form an eojul) is represented by the meaning part optionally followed by the role of grammatical morphemes written between curly braces. For instance, the sentence “He went” in Korean is “ ”, and will be represented in this paper as “ ”[g ux g aa g aa t aa](he{subject} go{past}). 2 Literally, an eomi is a suffix. However, in Korean language, it is a class of grammatical morpheme that determines the tense, mood, voice, or honor.
üû
ÿþýüû
T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 154−165, 2002. Springer-Verlag Berlin Heidelberg 2002
ÿþý
Automatic Stemming for Indexing of an Agglutinative Language
155
mark it as error in the sentence, unless they are equipped with a brand new dictionary that encompasses all such morphemes. Unfortunately constructing a new dictionary or even updating a dictionary is a time-consuming task. This paper describes a novel method of automatically constructing a lexicon given only a raw corpus using machine learning. The purpose of the research is twofold. The first is to construct a lexicon-free stemmer that is to be used in an Internet search engine. By not requiring a human-made dictionary, we are freed from the problem of intellectual property rights. Also, machine learning makes the process of constructing a lexicon extremely fast, and therefore accelerates the adaptation to the constantly evolving language. The second is to explore the possibility of unsupervised machine learning of natural languages. Of morphology learning, grammar learning, and semantics learning, we focus only on morphology learning in this paper. Machine learning can be classified into unsupervised learning and supervised learning. A morphology learning process is regarded as being supervised if the training corpus is POS-tagged; it is regarded unsupervised if raw corpus is used. This paper only concerns the unsupervised learning, since the corpus that is available will probably be a collection of sentences in a “new” language. Section 2 discusses related work. In section 3, we discuss the characteristics of Korean morphology, and show that simple-minded solutions are not adequate. In section 4, we propose a multi-stage method that is based on a statistical technique. Section 5 discusses the results and future work.
2
Related Work
Studies in morphology learning and lexicon-free stemmers in inflectional languages have been there for quite a time[1]. Early work used pre-constructed suffix list and/or rules concerning the stems. More recent work shows attempts in language-neutral morphology learning scheme[2] [3]. Marquez[4] developed a machine-learning method for part-of-speech tagging, but it requires a dictionary and learns only knowledge for ambiguity resolution. Porter[5] developed a simple lexicon-free stemmer. Porter’s algorithm uses knowledge about commonly used suffixes and deletes the suffixes from inflected or derived words. Such a simple algorithm works because English does not have so many suffixes. Gaussier[6] proposed an unsupervised leaner that uses an inflectional lexicon to learn inflectional operations. He proposed a method of classifying words with common prefixes as candidates, and used clustering to group words into families. However, it requires inflectional lexicon, making the method language-dependent. Goldsmith[3] and DeJean[7] also developed morphology learners, but they are restricted to inflectional languages. They use statistical method to find out candidates for stems, and searches for appropriate inflectional suffixes. There are cases where proper suffixes are inappropriately applied. Also, ambiguity problem remains. When selecting candidate affixes, Goldsmith[3], Gaussier[6], and Schone[2] all used p-
156
S. Cho and S.-S. Han
similarity. Two terms are p-similar if they share the first p letters. The affixes to p(or more)-similar words are collected first, and K most frequent affixes are selected as candidates. The value of p is arbitrary, and depends on the language or the knowledge of the researcher. The value of K is also arbitrary: Goldsmith[3] chose the top 100, and Shone[2] chose 200. Schone[2] added to Goldsmith’s and Gaussier’s methods the technique of singular value decomposition to reduce the dimension. This method is used to find out the hidden similarity to resolve ambiguity problem. This method may be compatible with the learning algorithm suggested in this paper. Korean language morphological analyzers and POS taggers are reported to have a high quality, enough for use in commercial products. According to Shin[8], the accuracy of Korean POS taggers range from 89 to 97 percents. Since even the tags assigned by an expert’s manual work (which are used as the “gold standard”) usually have a few percents errors[9], this percentage can be considered near-perfect. Unfortunately, these taggers rely on manually constructed lexicons. Nam[10] used statistical method for constructing information base for noun-derived suffixes. The suffix list, however, is from an already constructed dictionary. Kang[11] used the characteristics of Korean language syllables for morphological analysis. For instance an eojul3 ending with the syllable “ ”[uh n] or “ ”[n uh n] is unlikely to be a single word. Such knowledge is used as heuristic rules. Since we aim at using no lexicographic knowledge, this method is not suitable for our purpose. Despite the many pieces of work on Korean morphology, no work has been found to have attempted Korean morphology learning.
ÿ
þ
3 Characteristics of Korean Morphology and Some Statistical Observations Korean language is an agglutinative language, and affixes that indicate the case, mood, tense, or honor are agglutinated to content morphemes. This may look similar to inflection or derivation in that affixes are concatenated to word stems. However, agglutinated functional morphemes play more important grammatical and semantic roles. One of the categories, josa, is the set of words that are attached to nominals to determine the case. Thus “ .”(YungHee{subject} cat{object} like{present}) and “ .”(cat{object} YungHee{subject} like{present}) are two sentences with the same meaning despite the change in positions of the eojuls in the sentence. Another important class of grammatical morphemes is eomi. Eomis are sometimes further divided into leading eomis and trailing eomis. Eomis are used to indicate honor, tense, mood, and voice. These are concatenated into one eojul, and we must be able to identify the morpheme boundaries before we proceed to POS-tag, analyze syntax, or identify meanings of sentences. The content morphemes are usually open classes, and therefore new words are constantly added and are large in numbers. Grammatical morphemes are usually
ýüþû úùø÷û öõôó úùø÷û ýüþû öõôó
3
The unit of writing separated by white spaces or punctuation marks. An eojul consists of one or more morphemes.
Automatic Stemming for Indexing of an Agglutinative Language
157
closed classes and small in numbers, although the number is far higher (well over a thousand in Korean) than inflectional suffixes. One important property of grammatical morphemes is that they appear at the end of eojuls. They are smaller in numbers compared to content morphemes, and thus will appear more often. Thus we have reasons to believe that strings that appear as suffixes “often” are grammatical morphemes. The problem is: “how do we formalize the intuition?” We tried four different statistical measures as criteria for distinguishing grammatical morphemes from others: the absolute frequencies, the relative frequencies (the ratio of the number of occurrences of a string at the end of eojuls to the number of all occurrences of the same string), the usage count (the number of different prefixes that are combined with the suffix to form an eojul), and the T-test values[12]. T-test may need some explanations: Assuming that language generation is a random process, suppose that the probability of choosing a string is p0 . Also suppose the (conditional) probability of the same syllable selected at the end of an eojul is p . Then if the syllable were a grammatical morpheme, p will be greater than p0 . When N is the number of trial, that is, the number of eojuls, let Z 0 be defined as follows.
p − p0 . p0 (1 − p0 ) N We can reject or accept the null hypothesis depending on the significance level and the Z 0 value. For instance, we can reject the null hypothesis if Z 0 > 1.96 with significance level 0.05. In order to judge the relative “goodness” of the above four statistical measures, we considered recall – the number of grammatical morphemes4 found by the method divided by the number of all grammatical morphemes that actually exist in the text – and the precision – the ratio of correct morphemes among the candidates selected by the particular measure. These terms are borrowed from information retrieval. Achieving only high recall or high precision alone is trivial. To obtain high recall, we include everything in the list. To obtain high precision, we include nothing. The problem is to obtain as high a recall as possible while maintaining reasonable precision. Figure 1 plots the precision versus recall, using the four statistical measures as criteria. Overall, all four are disappointing. It is trivial to achieve either high recall or high precision alone; no single method leads us to high precision and sufficient recall at the same time. z0 =
4
This is only a preliminary analysis. Identifying a grammatical morpheme here only implies that it is used as a grammatical morpheme somewhere. Therefore stemming performance, using this criteria, will be much lower.
158
S. Cho and S.-S. Han
ùþþ úÿ
úþÿ úÿÿ ûÿ
õôóòñóðïî ôóíìëõôóò ñêéèóëïçñðæ æåæóêæ
üÿ ýÿ þÿ ÿ
úû úü
õôóòñóðïî ôóíìëõôóò ñêéèóëïçñðæ æåæóêæ
úý úþ ÿÿ ÿû ÿü ÿý ÿþ
ùÿø
üÿø
÷ÿø
ûÿø
öÿø
øþ÷
öùø
ûþ÷
öþ÷
Fig. 1. Precision vs. recall. The right one is the exploded view of the left half of the graph in the left
4 Multistage Method Segmentation
for
Morphology
Learning
and
Disappointed by the simple statistical methods tried in the previous section, we propose a multistage method for morphology learning and stemming. This method makes use of t-test for the initial construction of the candidate list, while ignoring repeated appearance of the same eojuls. In other words, two of the statistical measures, the usage count and the t-test discussed in the last section were combined. Usage count is adopted because repeated appearance of the same eojul, provided that it indeed consists of the same morphemes, do not give any more statistical information. T-test is adopted because unlike others, it has quantitative measure that lets us know how confident we are as to the decision. For instance, using a significance level of 0.05 implies that we are 95 % sure of the result of the decision. If we needed more accuracy or needed broader coverage, we simply need to adjust the significance level, and therefore the value of Z 0 . Other measures, for instance the rank in terms of frequency, may indicate which suffix is more likely to be a grammatical morpheme, but not how much likely it is. The complete method consists of 5 stages as listed below. 1. Obtain initial list of candidates for grammatical morphemes using t-test. 2. Exclude wrong grammatical morphemes(deadheads) accidentally included in the list from stage 1. 3. Generate list of content morphemes using the list of grammatical morphemes. 4. Exclude more errors from the list of grammatical morphemes using the pattern of morpheme combinations. This is the end of the learning phase. 5. Perform stemming, using the knowledge learned from stages 1 to 4. Each stage is described in more detail in the following subsections.
Automatic Stemming for Indexing of an Agglutinative Language
4.1
159
Stage 1: Obtain Initial List of Candidates for Grammatical Morphemes Using t-Test
In this stage, t-test is performed on all suffixes. Candidate grammatical morphemes are selected based on the Z 0 values. In order to avoid repeated analysis on the same suffixes and to do efficient computation, a backward trie is used. A backward trie is a trie where the the root branches according to the rightmost letter of a word, and the next level branches according to the letter second from the right, and so on and so forth. Since in a trie structure, identical sequences of letters lead to the same node, the same eojul is not inserted twice. Each Korean character represents a syllable and consists of three letters, representing the leading consonant, the vowel, and the optional trailing consonant. In order to perform t-test, we need to know the prior probability of a given sequence of letters. The prior probabilities are estimated by maximum likelihood estimation as follows. p0 ( w) = # ( w, C ' ) / Pos ( w, C ' ) Here, C ' is the (imaginary) corpus derived from C by deleting all duplicate eojuls. # ( x, C ) represents the number of occurrences of string x in corpus C . Pos( w, C ' ) is the number of positions that string w might fit in and computed by the following formula5:
Pos ( w, C ' ) =
l i =l w
where
(i − l w + 1)×# i (C ' ,
lmax is the number of syllables in the longest eojul, # i (C ' ) the number of
syllables in the length-i eojuls in corpus C ' , and l w = | w | / 3 (i.e., the number of syllables in w). A consonant can appear as a leading consonant or a trailing consonant, and they are treated as different letters. A candidate longer than one syllable and shorter than 2 syllables can appear only in eojuls of length greater than 1. The probability of a twosyllable candidate to appear in a three-syllable eojul is actually complicated, since appearing in the first and second syllables and appearing in the second and the third syllables are not independent events: the former keeps the latter from happening. However, since the probability is much smaller than unity, we can pretend that they are independent, and use an approximation. Suffix probability, that is, the probability that a string comes at the end of eojul is estimated by p( w) =# ( w$, C ' ) /( number of eojuls in C ' longer than w ) , where $ is the imaginary letter that indicates the end of an eojul. Again, this is a maximum likelihood estimation.
The reason why we do not simply use total corpus length is for taking the eojul length distributions into consideration. For instance, the number of positions that a single-syllable candidate can assume is equal to the total number of syllables in the corpus. However, a three-syllable candidate cannot possibly appear inside a twosyllable eojul.
5
160
S. Cho and S.-S. Han
Using the above definitions, z 0 ( w) =
p( w) − p 0 ( w)
is computed, and we p 0 ( w)(1 − p 0 ( w)) N reject or accept the null hypotheses with confidence of 90%. Using a small corpus of 25,000 eojuls, 517 candidates were selected among 40,000 suffixes. The stage 1 algorithm is depicted in Algorithm 1 L ← {} for each suffix w in corpus C Z 0 ( w) ←
p( w) − p 0 ( w) p 0 ( w)(1 − p 0 ( w)) N
// p and p 0 as defined above
if Z 0 > T1 , L ← L ∪ {w} endfor Algorithm 1. The first stage algorithm using t-test
T1 represents a threshold constant, and we chose 1.3 for the experiment. This
amounts to the significance level of 0.1. N is the number of Bernoulli trials, that is, the number of eojuls. 4.2
Stage 2: Exclusion of Wrong Candidates, or “Deadheads”
The candidates selected in stage 1 are those that have higher probability of occurring at the end of eojuls, compared to probability of occurrence in general. However, some of them are chosen not because they are grammatical morphemes, but because their ”[ix d aa] is a grammatical morpheme, and suffixes are. For instance, the copula “ therefore has high suffix probability. This also causes “ ”, or “ ” for instance to occur frequently just because they have “ ” as a substring. Moreover, the probability of “ ” occurring in the middle of an eojul(i.e., p0 ) is extremely small, causing the Z 0 value to be higher than we expect. On the other hand, even
ÿþ
ÿþ
ýÿþ
ûúù
ù
ýÿþ
üÿþ
” and “ ” are both candidates from stage 1, the occurrence of the though “ former does not rely solely on the high probability of the latter. In other words, the ” is not significantly higher than the conditional probability of “ ” in front of “ unconditional probability of “ ”. On the other hand, the conditional probability of “ ” in front of “ ” is significantly higher than its unconditional probability. Let w and δ be strings in C ' . In case Z 0 ( w) > T1 , and Z 0 (δw) > T1 , let
ûú
ù
ý
ý
ÿþ
# (δw$ ∈ C ' ) , p ' (δ ) = # (δ ∈ C ' ) / Pos(δ , C ' ) . Using p' and p ' , we can 0 0 # ( w$ ∈ C ' ) run a T-test to see whether δ ’s occurrence in front of w is merely a coincidence. p' (δ , w) =
Automatic Stemming for Indexing of an Agglutinative Language
161
ÿþý” in stage 1 was 1.4, while in the second stage it dropped to 0.68, indicating that the high frequency was merely by chance. On the other hand, Z value of “üûú” soared from 3.46 in the first stage In the actual experiment, the
Z 0 value of “
0
to 20.0 in the second stage. By applying stage 2 algorithm on the same corpus, 308 candidates among 517 chosen in stage 1 have been discarded and only 219 remained. Algorithm 2 depicts the detailed algorithm for stage 2. N w is the number of occurrence of
w.
for each suffix v = δw ∈ L Z 0 (δ , w) ←
p ' (δ , w) − p0 ' (δ ) p0 ' (δ )(1 − p0 ' (δ )) Nw
// L is from stage 1 algorithm // p ' and p0 ' as defined above
if Z 0 (δ , w) < T1 , L ← L − {v} endfor
Algorithm 2. 2nd stage algorithm for removing wrong candidates 4.3
Stage 3: Construction of Content Morpheme List
If we can identify grammatical morphemes from an eojul, we can identify the content morpheme; if we can identify the content morpheme, we can identify grammatical morphemes. Even though not prefect, we now are able to find many grammatical morphemes. We can now use them to determine which strings are content morphemes. However, it is difficult to separate the right content morpheme from only the list of grammatical morphemes. First of all, there can be two or more suffixes in a given eojul that are valid grammatical morphemes. For instance, in “ ”[h aa ” is the grammatical morpheme, but “ ” also is a valid k yo eh n uh n], “ grammatical morpheme. Second, there can be errors in the list of grammatical morphemes: the list may include wrong morphemes, and the right ones might not be in the list. Third, a suffix of an eojul being the same string as a valid grammatical morpheme does not necessarily mean it is used as a grammatical morpheme. For instance, “ ” in “ ”(far) is a grammatical morpheme, but not in “ ”(coffin). To resolve these problems, we first select strings that can possibly be content morphemes. Here, we take advantage of the observation that “content morphemes and grammatical morphemes concatenate to form eojuls”, and take the number of eojuls that can be formed by the candidate concatenated to by a valid grammatical morpheme as a criterion. That is, the number | {δ | wδ ∈ C ' , δ ∈ L} | . It is tempting to use another T-test on that number. Unfortunately, the numbers in general are too low to analyze statistically. Therefore, we decided to select only the candidates that are coupled with grammatical morphemes for more than a certain number. We chose T2=3 as the threshold in the experiment.
üú
÷
ö
ú
ùøüú õ
162
S. Cho and S.-S. Han
M ← {}, ∀x, count ( x) = 0 for all eojul w = w1...wk for i = 2 to k − 1 if wi ...wk ∈ L then count ( w1...wi −1 ) + + endfor for all strings w if count ( w) > T2 then M ← M ∪ {w} endfor
Algorithm 3. Generation algorithm for content morpheme candidates
4.4
Stage 4: Exclusion of Grammatical Morpheme Candidates That Are Coupled to Non-content Morphemes
In the previous subsection, we considered a string to be a content morpheme if it is coupled with many grammatical morphemes to form eojuls. Similarly, we can also decide that a certain suffix is not actually a grammatical morpheme, if it is not coupled with many content morphemes. For instance, in eojuls like “ ”[g uh t
ý”[r ao k] has a high Z
ao r ao k], the syllable “
ÿþý
0
value, and initially was considered
ý
a good candidate for a grammatical morpheme. However, even though “ ” is coupled with many different prefixes, none of the prefixes are content morphemes. Therefore we can conclude that it is not indeed a grammatical morpheme, but simply a string that appeared at the end of other morphemes. The algorithm is described in Algorithm 4. We used the threshold value 4 for T3 . for each w ∈ L for each v ∈ M if v ⋅ w ∈ C then count ( w) + + // C is the corpus endfor endfor for each w ∈ L if count(w) < T3 , then L ← L − {w} endfor Algorithm 4. Re-consideration of grammatical morpheme list according to the number of coupling with content morphemes
Automatic Stemming for Indexing of an Agglutinative Language
4.5
163
Stage 5: Stemming Phase
In order to find out the right content morpheme from an eojul, the best candidate is determined by the product of the probability that the prefix is a content morpheme and the probability that the suffix is a grammatical morpheme. The probability that a suffix of an eojul is a grammatical morpheme is determined as follows. Let the string w appear with probability p in general and with ps at the end of eojuls. Further let
a be the number of w ’s occurrences at the end of eojuls, b be the total number of occurrences, and a ' be the number of times w is actually used as a grammatical morpheme. A is the number of eojuls, and B is the number of positions where w a+b , and ps = a . Therefore if w were at the end of an could appear. Then p = A+ B A eojul, the probability of it being a grammatical morpheme is, A a− b a' Ab B p GM = = = 1− , by the maximum likelihood estimation. For instance, a a Ba suppose we have 1,000,000 eojuls where each eojul consists of 3 syllables. If the total number of syllable “ ” is 75,000 and among them 70,000 were at the end of eojuls. a' 1000000 × 5000 Then p GM = = 1 − = 1 − 0.0357 = 0.9643. An intuitive interpretaa 2000000 × 70000 tion is this: since the syllable is used even if it is not used as a grammatical morpheme, the same will be true for the last syllable. Therefore 2,500 occurrence (5,000/2) out of 70,000 probably was not used as a grammatical morpheme. Therefore P = (70,000 – 2,500) / 70000 = 0.9643. Of course, this applies to only those that are believed to be grammatical morphemes. Other candidates are given a minimal probability in order to avoid zero probability. Content morpheme probability is quantized to have only 3 different values, depending on which group it belongs. Namely, the candidates included in the set M, the candidates that have at least one instance of coupling with grammatical morphemes, and the rest. The reason for the quantization is because in general we don’t have enough quantity for the statistics to be significant. Also we have a lot of low-frequency content morphemes, which can easily be imagined from Zipf’s law[13]. The three cases were assigned the probability of 0.9, 0.1, and 1/|C|. When we assume the selection of content morphemes and grammatical morphemes is independent events, stemming becomes the task of deciding the position m that satisfies the following formula:
ÿ
m = arg max P (head = w1..wk , tail = wk +1..wn | w = w1..wn ) k
= arg max k
P ( w = w1..wn | head = w1..wk , tail = wk +1..wn ) P (head = w1..wk , tail = wk +1..wn ) P ( w = w1..wn )
= arg max P( head = w1..wk , tail = wk +1..wn ) ≅ arg max P (head = w1..wk ) P (tail = wk +1..wn ) k
k
164
S. Cho and S.-S. Han
5
Results and Future Work
Since the aim of the research is to learn morphology where large corpus is unavailable, a very small corpus of 25,000 eojuls is used to for learning. The text is from various articles from Internet Dong-A Ilbo newspaper, during the fourth week of Jan. 2001. Using the stage 1 list of grammatical morphemes, the success rate (i.e., the correct stemming rate) was 74%. Using the list enhanced by stage-2 algorithm, the success rate went up to 81.5%. After stage 4, success rate was 85%. Success was measured on randomly selected 1000 eojuls. The result is not as good as lexiconbased stemming, but it is “usable” as a stemming algorithm for internet search engine. It is more encouraging if we considered very special domain where complete dictionary is not available. Currently we treat composite grammatical morpheme as a single morpheme, but the algorithm will be enhanced if we could find individual morphemes from the composite morphemes. Also identifying composite content morphemes can also enhance the stemmer performance. But they both deserve separate research.
References [1] Lovins, J.B., Development of stemming algorithms. Machine Translation and Computational Linguistics, 11, 1968. [2] Patrick Schone and Daniel Jurafsky, “Knowledge-free Induction of Morphology using Latent Semantic Analysis,” in proceedings of the ACL99 workshop: Unsupervised learning in Natural Language Processing, University of Maryland. [3] J. Goldsmith, “Unsupervised learning of the morphology of a natural language,” University of Chicago, http://humanities.uchicago.edu/faculty/goldsmith. [4] LLuis Marquez, Lluis Padro, and Horacio Rodriguez, “A Machine Learning Approach to POS tagging,” Machine Learning, vol. 39, pp.59-91, 2000. [5] M.F.Porter, “An algorithm for suffix stripping,” Program, 14(3), pp.130-137, 1980. [6] E. Gaussier, “Unsupervised learning of derivational morphology from inflectional lexicons,” in proceedings of the ACL99 workshop: Unsupervised learning in Natural Language Processing, University of Maryland. [7] Dejean, H. 1998. “Morphemes as necessary concepts for structures: Discovery from untagged corpora,” University of Caen-Basse Normandie. http://www.info.unicaen.fr/DeJean/travail/article/pg11.htm [8] SangHyun Shin, KunBae Lee, JongHyuk Lee, “Two-stage Korean tagger based on Statistics and Rules”, Journal of Korean Information Science Society (B) vol.24-2-, pp.160-169, 1997.2. [9] D. Jurafsky and J.H. Martin, Speech and Language Processing, Prentice Hall, NJ, 2000. [10] YunJin Nam, ChulYung Ok, “Construction of dictionary of noun-derived suffixes based on Corpus analysis,” Journal of Korean Information Science Society (B), vol.23- 4, pp.389-401, 1996.4.
Automatic Stemming for Indexing of an Agglutinative Language
165
[11] SeungSik Kang, “Morphological Analysis of Korean Irregular verbs and adjectives using syllable characteristics,” Journal of Korean Information Science Society (B) vol.22-10, pp. 1480-1487, 1995.10. [12] C. Manning and H. Schültze, Foundations of Statistical Natural Language Processing, 1999 MIT Press. [13] Zipf, G.K. Human Behavior and the Principle of Least Effort, Cambridge, MA: AddisonWesley, 1949.
Pattern Acquisition for Chinese Named Entity Recognition: A Supervised Learning Approach 1
2
Xiaoshan Fang and Huanye Sheng 1
Computer Science & Engineering Department, Shanghai Jiao Tong University, Shanghai 200030, China
[email protected] 2 Shanghai Jiao Tong University, Shanghai 200030, China
[email protected]
Abstract. This paper presents a supervised learning method for the pattern acquisition for handcrafted rule-based Chinese named entity recognition systems. We automatically extracted low frequency patterns based on the predefined high-frequency patterns and manually validated the new patterns and outputs of terms. The experiments show that the number of person names extracted from the Chinese Treebank increased by 14.3% after the use of the new patterns.
1
Introduction
Named entity (NE) recognition belongs to one of the main tasks in Message Understanding Conference 6 and 7 (MUC-6 and MUC-7). In general, the named entity recognition task involves the recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions. In this paper we use the Chinese person name as our experiment object for the named entity recognition. In comparison to European languages, Chinese person names have no upper and lower case, which makes it relatively difficult to recognize. In addition, it is well known that there is no space between Chinese words; so it is not evident from the orthography where the word boundaries are. Therefore, the performance of Chinese extraction systems is partly affected by their word segmentation module. Handcrafted rule-based systems for the NE tasks attain high levels of performance, but it is a time consuming work to construct rules. It is easy to find the frequentlyoccurring patterns, whereas those low-frequency patterns are difficult to be discovered. Our work uses machine-learning approaches to automatically learn a set of lowoccurrence patterns for external evidence of Chinese person names from a corpus with the predefined high-occurrence patterns. The combination of a machine learning approach and a handcrafted approach improves the system’s efficiency. The experiments showed that the number of person names extracted from the Chinese Treebank increased by 14.3% after the use of the new patterns. Little work has been done in the field of automatic pattern acquisition for Chinese NE tasks. Hearst (1992, 1998) reports a method using lexico-syntactic patterns to extract lexical relations between words from English texts. Morin (1999) intended to bridge the gap between term acquisition and thesaurus construction by offering a framework for organizing multi-word candidate terms with the help of automatically
T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 166–175, 2002. © Springer-Verlag Berlin Heidelberg 2002
Pattern Acquisition for Chinese Named Entity Recognition
167
acquired links between single-word terms. Landau introduced supervised and unsupervised methods for extracting semantic relations between words. In section two of this paper, the method of pattern acquisition from corpus is presented. Then, section three deals with the result of experiments and an evaluation of the frequency patterns. Finally, briefly describe our future work.
2
Pattern Extraction Algorithm
Inspired by Hearst (1992, 1998), our procedure of discovering new patterns through corpus exploration, is compsed of the following eight steps: 1.
Collect the context relations for person names, for instance person name and verb, title and person name, person name and adjective.
2.
For each context relation, we use the high occurrence pattern to collect a list of terms. For instance, for the relation of title and person name, with a pattern NN+NR, we extract the terms of title, for example, (reporter), (team player), etc. Here NN+NR is a lexico – syntactic pattern found by a rule writer. NN and NR are POS tags in the Corpus. NR is proper noun. NN includes all nouns except for proper nouns and temporal nouns.
3.
Validate the terms manually.
4.
For each term, retrieve sentences contain this term. Transform these sentences to lexico-syntactic expression.
5.
Generalize the lexico-syntactic expressions extracted in last step by clustering the similar patterns.
6.
Validate the candidate lexico-syntactic expressions.
7.
Use new patterns to extract more person names.
8.
Validate person names and go to step 3.
Based on this method, we learned five new patterns for the relation title—person name from twenty-five texts in the Chinese Penn Treebank. Use all the six patterns we extract 120 person names form these texts. 15 of them are new. We will give a detail description in section 3, experiments. This new person names can also be used for person name thesaurus construction. 2.1
Automatic Classification of Lexico-Syntactic Patterns
Hearst introduced an algorithm that automatically acquires lexico-syntactic patterns by classifying similar patterns. We used it in the fifth step in the above algorithm. As described in reference [3], Lexico-syntactic expressions 1 and 2 are required from the relation HYPERNYM: 1.
NR and in NR such as LIST
2.
NR such as LIST continue to plague NR
168
X. Fang and H. Sheng
They can be abstracted as:1
$ = $ $ K $ K $ K $ M
N
Q
ZLWK
5(/$7,21 $ $ N > M +
(1)
ZLWK
5(/$7,21 % ′ % ′ N ′ > M ′ +
(2)
M
N
and
% = %% K % ′ K % ′ K % M
N
Q
′
M
N
Following is a function measuring the similarity of lexico-syntactic expressions and that relies on the following hypothesis:
: LQ $
: LQ $
: LQ $
$ = $ $ K K $ M K $N K K $Q
% = % % K % M ′ K K % N ′ K % Q ′
: LQ %
: LQ %
: LQ%
Fig. 1. Comparison of two lexico-syntactic expressions
Hypothesis 2.1 (Syntactic isomorphy) If two lexico-syntactic expressions A and B indicate the same pattern then, the items Aj and Bj’, and the items Ak and Bk’ have the same syntactic function. Let Win1 (A) be the window built from the first through j-1 words, Win2 (A) the window built from words ranking from j+1th through k-1th words, and Win3 (A) be the window built from k+1th through nth words (see Figure 2). The similarity function is defined as follows: The function of similarity between lexico-syntactic patterns Sim(Wini(A),Wini(B)) is defined experimentally as function of the longest common string. All lexico-syntactic expressions are compared two by two by previous similarity measure, and similar lexico-syntactic expressions are clustered. Each cluster is associated with a candidate lexico-syntactic pattern. For instance, the sentences introduced earlier generate the unique candidate lexico-syntactic pattern: 3.
1
NR and in NR such as LIST
Ai is the ith item of the lexicon-syntactic expression A, and n is the number of items in A. An item can be a lemma, a punctuation mark, a symbol, or a tag (NR, LIST, etc.) The relation k > j+1 states that there is at least one item between Aj and Ak.
Pattern Acquisition for Chinese Named Entity Recognition
169
6LP $ % = ∑ 6LP:LQ $ :LQ % L
L
=
L
with
:LQ $ = $ $ K $ − :LQ $ = $ + K $ − :LQ $ = $ K $ + M
M
N
N
Q
:LQ % = % % K % ′ − :LQ % = % + K % ′ − :LQ % = % K % ′ + M
DQG
M
N
N
Q
Fig. 2. The similarity function
3 3.1
Experiments Resource
We use Chinese Penn Treebank, which published by the Linguistic Data Consortium (LDC), as training corpus. Following are some data for the corpus: Size: About 100K words, 4185 sentences, 325 data files Source: 325 articles from Xinhua newswire between 1994 and 1998 Coding: GB code Format: Same as the English Penn Treebank except that we keep the original file information such as "DOCNO" and "DATE" in the data file. Annotation: All the files are annotated at least twice, one annotator does the 1stpass, and the resulting files are checked by the second annotator (2nd-pass). 3.2
Relation s
We consider five relations. They are: 1.
Title and Person Name, e.g. Chinese treebank, text 325)
(reporter) —
(Huang Chang-rui) (from
Titles include the words that represent the job title, occupation, and relative appellation such as (husband), (wife). Some of them can only be used (spokesman), (soldier), before person names. For example, (athlete), (worker), etc. Some of them are only used after person (Monsignor), etc. The others can be used on both sides of pernames, like son names. Some examples are , , , etc. among them words like (sir), (lady) has different mean when used in different side of person names. 2.
Person Name and Verb, e.g. nese treebank, text 318)
(Shou Ye) —
(emphasize) (from Chi-
170
X. Fang and H. Sheng
Person names are often followed by verbs. These verbs are processed as the right boundary of PNs in NE recognition task. There are one character verbs and two character verbs. Following are some of them. One character verb: (say), (speak), (talk), (call), (be), (gain), (arrive), (lead on). ( together with )… (report), (meet), (introduce), (inTwo character verbs: vite), (on invitation), (contact), (attend), (appeard), (visit), (replace), (accept), (come to hand), (take chage), (authorize), (agree), (declare), (accede), (head for), (look in), ( greet), (demission), (refuse), (pass away)… 3.
Person Name and Adjective, e.g. (American) — can person name) (from Chinese treebank, text 314), or years of age) — (Zhang Xue-liang).
(An Amei(seventy
Some special adjectives used before PNs are clues for NE recognition. For ex(Marshal Ye Jian-yin who is seventy years of ample, age). 4.
Person name and Conjunctions, e.g. (Fu Ming-xia) — (Chi Bing) (from Chinese treebank, text 325).
(and) —
and . These Conjunctions that found in Chinese Treebank are `, , , punctuations connect PNs in the text. For instance, (Chinese Treebank txt 325) Chinese player Lan Wei and Chen Hao obtain the (Chinese Treebank txt 324) 5.
Location names and organization name used before PNs, like (Tai Yuan Steel Company) (Li Shuang - liang), or (Chinese women Volleyball team) (Lang Ping), are also useful clues for person name recognition. Here (Tai Yuan) and (China) are location (Tai Yuan names and part of an organization name, separately. Steel Company)and (Chinese women Volleyball team) are organization name. I will consider these useful clues in my future work.
These relations describe some useful context information for named entity recognition. In this paper we only consider extracting the lexico-syntactic patterns according to the relation title and person name. 3.3
Patterns and Pattern Classification
Table 1 shows the patterns extracted for the Title – Person Name.
Pattern Acquisition for Chinese Named Entity Recognition
171
Table 1. A predefined pattern and the extracted patterns Pattern NN NR NN NR•NR NN NR NR NN NR ` NR NN NR NR
player player player player reporter
Example Richard Lan Wei and Chen Hao is Li Da-shuang and
Huang Hua-dong
Micheal Dumond Huang Chang-reui Yang Ai-guo
(1) (2) (3) (4) (5)
In table 1, the first pattern NN NR is a high occurrence pattern, it occurs 105 times in the text 301 to text 325. A normal rule writer, not a skilled one, can easily construct it. According to Chinese Penn Treebank, an NR is a name of a particular person, politically or geographically defined location (cities, countries, rivers, mountains, etc.), or organization (corporate, governmental, or other organizational entity). NT is the tag of temporal noun, which can be the object of prepositions such as (at), (since), (until), or (until). All other nouns are tagged with NN. We tried the procedure using just one term of “title used before person names” at a time. In the first case we tried the term “ ”, and with this we found the new patterns (2) – (4). We tried pattern (1) and retrieved a list of new terms. With the new term “ (reporter)” a new pattern — pattern (5) is acquired. We generalized new patterns manually. For example the sentence (6) and (7) can be transform to lexico-syntactic patterns (8) and (9). Table 2. Sample sentences and lexico-syntactic patterns
Compare these two patterns, according to hypothesis 2.1, a candidate pattern NN NR is generated. After being validated, it was used to extract new person NR names.
3.4
Person Names
Following is a list of person names extracted from corpus Chinese Treebank. (From text 324 and text 325) with pattern (1)–(6). Entries with ? indicate terms found with new patterns.
172
X. Fang and H. Sheng
Table 3. Person names extracted from Chinese Treebank. Entries with ? indicate terms found with patterns (2). Entries with ?? indicate terms found with patterns (4). Entries with ??? indicate terms found with patterns (5) Text
3.5
Person names
324
(Lan Wei), (Yang Ai-guo), (Lan Wei), (Cheng Hao)?, g , g , (Xu Yi ming), (Fu Ming-xia), (Chi Bing), (Huang (Chen Huo)?? Chan-rui)???,
325
(Fu Ming-xia), (Chi Bing)?, (Huang Chang (Fu Ming-xia)???, ••(Chi Bing)?, , rui), (Lan Wei), (Chen Hao)?, (Yang Ai-guo)???
Error Analysis
There are a few words extracted are not person names. There are two kinds of mistakes. The first word of the subtitle is the name of the news agency, which is tagged with . When the last word in the title is tagged with , the news agency is recognized as a PN. For example, the title of text 324 is (The world swimming competition, Lan wei and Chen Hao enter the second competition.) The subtitle is (Xin Hua new agency, Rome, September 1st, Yang Ai-guo, Huang Chang-rui) (Xin Hua new agency) is extracted as a PN. It is easy to correct this kind of mistake. A few location names have same lexico-syntactic patterns with the person names’. For example, in text 317 the is a sentence: , (It includes Bangkok, Thailand, which is the amphitryon of next Yuan Nan athletic meeting.) Here (Thailand) and (Bangkok) are extracted as PNs. We used a dynamic location name dictionary to remove them. 3.6
Statistical Results
Figure 3 compares the number of person names extracted by pattern 1 and the number of person names extracted by all the four patterns.
Pattern Acquisition for Chinese Named Entity Recognition
NNNR-301-325
173
withnewpatterns
25 20 15 10 6
5 0
Text
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
Fig. 3. Person names extracted by original observed pattern and by with new patterns
From the text 301 to text 305 we totally have 105 sentences that contain these patterns. Totally there are 120 person names. We use the pattern NN NR 105 times. The pattern NN NR NR occurs four times. Pattern NN NR NR 7 times, pattern NNNR‘NR 4 times. The frequencies of each pattern are described in above figure. By using the new patterns the number of person names extracted from the Chinese Treebank increased about 14.3%. The increase of PNs extracted with new = patterns
The number of PNs extracted with new patterns * 100% The number of PNs extracted with NN NR
(3)
It is interesting that the frequencies of the new patterns are low. These patterns are easy to be ignored when a rule writer look through a large volume of text to find the patterns. Figure 4 is a pattern – frequency curve for these four patterns. With all the patterns we extracted for the four relations we get a smooth pattern – frequency curve for Chinese person name recognition. The extracted person names and title terms will be used to thesaurus construction in my nearly future research.
174
X. Fang and H. Sheng
Grand Count 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% NNNR-301-325
NNNRNR
NNNR‘NR
NNNR NR
Grand Count
Fig. 4. Pattern – Frequency Chart for example patterns
4
Conclusion and Future Work
Supervised approach can be used for Chinese information extraction task. We present a bootstrapping algorithm that extracts patterns and generates semantic lexicons simultaneously. We can see that accurate recognition and categorization of names in unrestricted text is a difficult task. Certain types of NEs, such as person names, not only name a person but can also serve as a source of many other kinds of names. We will try some statistical methods, such as Hidden Markov Models, to recognize person names by analysing the probability in which a word can be a person name in a certain situation. The Chinese treebank is not so large as we wish. Since Chinese annotated corpus is very rare, our next work includes looking for larger corpus. We plan to use unsupervised training approach or combined both methods. Acknowledgements. Our work is supported by project COLLATE in DFKI (German Artificial Intelligent Center) and Computational Linguistic Department and project “Research on Information Extraction and Template Generation based Multilingual Information Retrieval”, funded by the National Natural Science Foundation of China. We would like to give our thanks to Prof. Uszkoreit, Ms. Fiyu Xu, and Mr. Tianfang Yao for their comments on our work.
Pattern Acquisition for Chinese Named Entity Recognition
175
References 1. 2. 3.
4.
5. 6. 7. 8.
9.
Fei Xia: The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0). October 17, 2000. Andrew Borthwick: A Maximum Entropy Approach to Named Entity Recognition, Ph.D thesis. (1999). New York University. Department of Computer Science, Courant Institute. Finkelstein-Landau, Michal and Morin, Emmanuel (1999): Extracting Semantic Relationships between Terms: Supervised vs. Unsupervised Methods, In proceedings of International Workshop on Ontological Engineering on the Global Information Infrastructure, Dagstuhl Castle, Germany, May 99, pp. 71–80. Emmanual Morin, Christian Jacquemin: Project Corpus-Based Semantic Links on a Thesaurus, (ACL99), Pages 389-390, University of Maryland. June 20–26, 1999 Marti Hearst: Automated Discovery of WordNet Relations, in WordNet: An Electronic Lexical Database, Christiane Fellbaum (ed.), and MIT Press, 1998. Marti Hearst, 1992: Automatic acquisition of hyponyms from large text corpora. In COLING’92, pages 539–545, Nantes. Kaiyin Liu: Chinese Text Segmentation and Part of Speech Tagging, Chinese Business Publishing company, 2000 Douglas Appelt: Introduction to Information Extraction Technology , http://www.ai.sri.com/~appelt/ie-tutorial/IJCAI99.pdf http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html
The Information System for Creating and Maintaining the Electronic Archive of Documents Alexander Marchuk, Andrey Nemov, Konstantin Fedorov, and Sergey Antyoufeev A.P.Ershov Institute of Informatics Systems, Pr. Akad. Lavrentjev, 6, Novosibirsk, 630090, Russia
[email protected],{duh, kvf, serga}@xtech.ru http://www.iis.nsk.su
Abstract. The problems of design and implementation of the electronic archive of documents are considered in the paper. The project intended to create a Webversion of the archive consisting of different paper documents. Functionals implemented in the system supply operators-archivists with easy-to-use tools to insert and update information, and provide convenient and transparent interfaces allowing end-users to visualise and search for information stored in the archive. The system consists of separate modules; most of them are implemented in the three-level client-server architecture. The system has been successfully applied to the creation of the Web-version of the unique collection of the documents that remained after the premature death of academician A.P. Ershov, one of the leading Russian computer scientists. This electronic archive is now available at http://ershov.ras.ru.
1
Introduction
It is now obvious that traditional "paper" methods of archive processing and maintenance are not convenient and the task of creating an "electronic" version of the "paper" archive together with software system supporting this process seems to be quite natural and even necessary. The thorough analysis of traditional approaches and problems facing the archivists allowed us to formulate the main requirements of an electronic archive: − the archive should be permanently replenished; − not only documents but annotations of various kinds (which are not archive documents) can be stored in the archive; − statistical treatment of documents should be allowed; − if the archive is not closed and is of great cultural and historical value, it should be as accessible as possible; this includes remote access and, possibly, translation of the documents into foreign languages. An attempt to partially solve these problems has been made in the process of design and implementation of electronic archive containing various documents from the archive of academician A.P.Ershov. This unique collection of letters, manuscripts, official documents, pictures, etc. provides valuable material covering the history of computing in Russia and has attracted world-wide attention.
T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 176–185, 2002. © Springer-Verlag Berlin Heidelberg 2002
The Information System for Creating and Maintaining the Electronic Archive
2
177
Statement of the Problem
An electronic archive of documents should allow us: to decrease amortisation of original documents stored in an archive; to provide a remote user with a possibility to “visit” the archive; to decrease labour consumption related to processing a great bulk of documents (for example, context search). Since our problem is to make an electronic version of an archive that is already composed, we need, first of all, to elaborate the techniques for transformation of the paper documents into an electronic form or, more precisely, into several different representations. An easy-to-use system of document and information storage should also be designed and implemented. 2.1
General Structure of the System
The information system should consist of the following main components: − a database along with information not stored directly in the database; − a set of interface functionals for an archive operator that supports fast and convenient insertion, correction, deletion and viewing of information in the database; − a set of utilities that helps an archive operator to work with a considerable volume of external information; processing of this information by means of the functionals mentioned above would be difficult, inefficient, or even impossible; − a set of interface functionals for end-users intending to look through different hierarchies of documents, to perform various kinds of a search for documents and related information, and to derive manifold statistic data; − a replication utility, for autonomous synchronisation of the content of different copies of the information system widely separated from each other. The general structure of the system is presented in Fig.1. O perator O perator’s interface
S et of utilities Database
End- user interface
Replication utility
M irrored database
End- user Fig. 1. The general structure of the electronic archive of documents
178
2.2
A. Marchuk et al.
The Basic Requirements
This section presents the basic requirements imposed on the system. The system should − provide a user with the possibility to look through different electronic representations of a document; − have a thoroughly developed, clear and convenient operator interface for actualisation of information stored in the database; − have a set of utilities intended to automate and simplify the labour- and resourceconsuming work; they should be much more convenient than standard ones, e.g., the utility of complex scanning of a document followed by saving a file under a specially appointed name, and the utility of storing information about a large number of graphical files in the database; − have tools that perform mirroring of the information part of the system onto the remote mirrors, which allows us to reduce the time of system response to requests of different users; − have autonomous tools for making reserve copies of the critical or frequently changed parts of the system (this mainly concerns the information content of the database). The system should be scalable, i.e., it should support storage of a large (and constantly growing) amount of archive documents and related information. The system should be multi-user, since there can be many archive operators simultaneously working with information, as well as many visitors working with different parts of the archive at the same time. The system should work in real-time mode; i.e., additions and corrections made by an operator should be immediately accessible for other operators, as well as for the visitors of the archive. The system should support different structuring of the documents and provide a remote user with an access to different (textual, graphic, hypertext, and annotated) representations of a document and to related information (organisations, people, cities, countries, etc). To do this, the "paper" structure of the archive (the enumerated documents placed in the enumerated folders) should be transformed into a more convenient, intuitively clear and accessible structure, and the related objects of prime importance should be selected in the documents together with proper links between them. Taking into account the fact that any document may contain text written in several different languages and the system can store not only the original text but its various translations, we have decided that the system should be multilingual. The system should work as autonomously as it is possible, for manual interference into its operation to take place only when it is really indispensable, i.e., all the processes that can be automated should be automated.
3
Main Design Decisions
Once the scope of problems to be solved was determined, now at the beginning of creating the system, we faced the problem: what combination of the available technologies and programming languages is better for writing each of its components,
The Information System for Creating and Maintaining the Electronic Archive
179
what platforms, environments and software products should be chosen? Another important point was a choice of the database system. The decision was made on the basis of several criteria, including the "price/quality" relation (and availability) (for the platforms and RDBMS), or convenient usage (of programming languages). 3.1
The Choice of Techniques
We have decided to implement some components, such as "operator interfaces" and "end-user interface", in the three-level client-server architecture and make them Weboriented. This choice allows the components to work more effectively, e.g., to support the simultaneous multi-user access, to maintain the system accessible almost everywhere without any special environment, generally only a Web-browser is needed. Utilities should be more closely bound to the workstation but, when required, be also connected with the archive database. One of the most safe and simple solutions is to implement them as system applications. Since this implementation is made in a system programming language, which is rather complicated, the utilities should perform only those functions that essentially simplify an operator’s work and cannot be implemented in a different way. It is worthy of note that we had to elaborate not only the set of components of the system, but also a physical model of the database that would completely represent all entities stored in the archive, its objects and links between them. 3.2 The Choice of the Platform, Environment, and Programming Languages Microsoft software products have been chosen as the basis for creating the information system. As a server platform we use the operating system Microsoft Windows NT 4.0 Server with a further transition to Microsoft Windows 2000 Advanced Server. At present, this system is one of the most widespread and available server platforms; many software products are compatible with and adapted to it and the system itself is being constantly upgraded according to the requirements to modern software. Based on the choice of OS, other components have been naturally selected: − Microsoft Internet Information Server 4.0 as a Web-server; − Microsoft SQL Server 7.0 [3, 4] as a database management system; its additional advantage is that its SQL dialect is far beyond the scope of the standard “SQL 92” which allows many actions to be performed using specific features and advantages of this RDBMS; moreover, there are many access layers, e.g., ODBC drivers for Microsoft SQL Server or OLE DB Provider for Microsoft environment; − ASP (Active Server Pages) technology for implementation of the end-user interface in the VBS (Visual Basic Script) language with the use of the OLE DB layer for interaction with the database [8]. The operator interfaces are intended for internal usage only, so they have been implemented in Perl [5] as one of the easiest and powerful languages with convenient facilities of string processing and working with regular expressions. This allows us to simplify implementation of many functionals used for processing of documents and their representation [6, 7].
180
A. Marchuk et al.
It is seemingly impossible to implement such a system not using HTML and JavaScript languages [9, 10]. We have chosen the system programming language C++ (Microsoft Visual C++ as a compiler) with a set of libraries to write the whole complex of utilities.
4
Implementation
Unfortunately, we cannot present a detailed description of the system implementation because of the lack of space. We would like to note that the structure presented in Sect.2.1 has been implemented taking into account the requirements from Sect.2.2, and implementation of the components has been made in the technologies described in Sect.3.1 and in the languages and platforms given in Sect.3.2. To the hierarchical structure of information storage used in the physical archive, we have added a more convenient and logical structure so that both structures can be easily used. Different representations of the documents are provided by the system; for example, it is possible to store different scanned images of a document, its recognised text, and translations into different languages. As most frequently happens with web-oriented applications, a three-level architecture of the system implies that a user of the system uses generally accessible means (for example, web browser) to work at the first level. In turn, the web-browser connects to the web-server (the second level), executing applications that provide processing of the user’s requirements. These applications interact with the database server (the third level) and other sources. 4.1
Database
The physical model of the database constructed at the beginning of the project has been changed and supplemented during our work, since the requirements imposed on the database and system actions have been modified. The archive database is based on the set of tables containing the information on the documents and their components, such as scanned pages, textual representations, translation, and so on. The main information object is a registration card. Its input and editing is among the most labour-consuming operations. The registration card contains the following information: − the document name; − the type of the document; − language of the original; − locals associated with the document; − institutions; − authors; − addressees; − persons related to the document; − keywords assigned to the document; − comments; − related documents.
The Information System for Creating and Maintaining the Electronic Archive
181
The system interface provides the archivist with tools for processing the set of registration cards. At present, the database consists of about 60 tables, 100 references, several views and stored procedures. The total size of the SQL-script creating the database is about 70Kb. 4.2
Data Related to the Document
As we have noted, each document can provide us with information about objects related to it. The most frequently related objects can be selected for separate consideration and storage. Theoretically, each document, be it a scientific paper or a simple letter, has an author or a group of authors, but in practice it is often impossible to know them for sure. However, the relation "author—document" is distinguished in our information system. We can also specify persons mentioned in the document or those to whom it is addressed (e.g., in letters or telegrams). Thus, the relation "person—document" is necessary for our system. A document may be related to a city, country, or organisation, and these objects, in turn, may be related to each other. Analysis of a document can result in a set of keywords, which will help us to classify it in our system; i.e., end-user can find this document by means of search by keywords. Though the documents are grouped according to the initial attributes like the date or kind of a document, user needs in the search for a document through the archive have required additional tables for data and relationships between the initial and derived attributes, such as tables of persons, organisations, cities, countries, and events. Formally, the derived attributes are not facts but some interpretations and are inserted in the system using special means, including manual input. However, they provide the data that generates efficient search environment allowing a user to select the basic filtering criteria of search in his request. 4.3
Interfaces of the Operators and Archive Users
Actions proposed by the operator interfaces allow us to edit content of almost all the tables of our database. The remaining tables, not edited by operator interfaces, can be processed by the system utilities. The total size of the Perl-code of the operator interfaces is about 700 Kb. It is important to stress that the operator can work having only a web-browser installed on his computer. A screenshot of one of the operators’ interfaces is presented in Fig 2. The end-user interfaces allow one to look through the archive documents with the help of both logical and physical hierarchical structures of document classification. A convenient search system here implemented presents a rich variety of opportunities to find a document according to its prescribed parameters. Viewing of all abovementioned representations of a document is implemented in a convenient form. There is also a possibility to pass through all references related to a document and to obtain complete information on them. Conceptually, the user interface is implemented as a well-designed separate Web-site.
182
A. Marchuk et al.
Fig. 2. Screenshot of one of the functionals of the backend
The total size of VBS-code is about 80 Kb. The process of looking through the documents in front-end is shown in Fig. 3. 4.4
Supporting Utilities
Labour-consuming and routine operator’s work needs to be automated. However, as noted above, not everything can be properly implemented using only Web-interfaces. Scanning of a large number of documents of the same type and quality (so, scanner adjustments should be the same for all documents) may be considered as such an example. Of course, we can scan lists and save images in files in a usual way, but in this case the number of operator’s manipulations is close to 10 mouse clicks and pushing the keyboard keys 5 times. The scanning utility is implemented so that it allows operator to set initial adjustments and then scan all documents with only one click for each list. It is obviously beneficial.
The Information System for Creating and Maintaining the Electronic Archive
183
Fig. 3. Screenshot of the front-end
Another implemented utility is intended to store information about a large number of files in the database; it also renews obsolete information on files. The utility for synchronisation of the information content of the archive’s database with remote mirrors should be also mentioned. The tools provided by Microsoft SQL Server 7.0 should apparently support such information exchange between remote SQL-servers but, due to specific features of these embedded tools and communication channels, they appeared to be useless and we have to "re-invent the wheel". 4.5
Summary
Thus, we have considered the components of the information system and described some aspects of their operation. All functionals of the user interface have been
184
A. Marchuk et al.
implemented in Visual Basic Script, and total code size is about 80 Kb. The server part of functionals of the administrator interfaces is written in Perl and total code size is about 700 Kb. The physical model of the database consists of approximately 60 tables, 100 references, several views and stored procedures. Total size of SQL-script which creates our database is about 70 Kb. The total body of information, including images of scanned documents, is close to 2 Gb. This volume will continue to grow, since the processed part of the archive is only about 15%.
5
Conclusion
This paper is devoted to the development of the information system for the electronic archive of documents. Within the project, we have elaborated the general concept of creating the archive, its architecture, the model of data of the electronic archive supporting different document representations (textual, graphic, hypertext, and annotated). We have elaborated the technology and set of tools for creation (filling, editing, and update) of electronic archive and work with the archive materials, their investigation and analysis. Archive technology has been studied in the aspect of its applicability to the creation of the electronic archive. Development of the software system consisted of four large and largely independent parts: − development of the information model of the system; − development and implementation of operators interfaces; − development and implementation of end-user interface; − development and implementation of a set of supplementary utilities and programs. The information system has been used for creating the electronic archive of the documents of academician A.P. Ershov. Since 2001, this archive has been available at http://ershov.ras.ru (the user part) and http://www.iis.nsk.su:81/archive/backend (a set of functional interfaces for operators). The stored information includes tens of thousands of documents, reference to thousands of organisations and about 3 thousand persons. It should be especially stressed that the authors had to create their own original general concept and structure of the information system, as well as the physical model of the database, from "zero level". Since the system is general-purpose, it can be easily applied to creating other electronic archives of documents. Further development of the system is also possible by adding new functionals to the operator’s and user’s part of the system, as well as to the physical model of the database. Acknowledgements. We are grateful to Microsoft Research for financial support of the project. Valuable contributions to the elaboration of the project concepts and its implementation have been made by the researchers of the Institute of Informatics Systems and xTech company: M.Bulyonkov, V.Philippov, and M.Philippova. The original graphic design of front-end and backend was created by M.Mironov. The large work of document processing and editing has been done by N.Cheremnykh,
The Information System for Creating and Maintaining the Electronic Archive
185
I.Kraineva, N.Polyudova, I.Pavlovskaya, and L.Demina. English versions of Russian documents have been prepared by A.Bulyonkova and A.Shelukhina.
References 1. 2.
C J Date, An Introduction to Database Systems (7th edition, Addison-Wesley, 2000).
Alan R. Simon, "Strategic Database Technology: Management For The Year 2000", "Finances and Statistics", 1998. 3. Dmitry Artemov, Grigory Pogulsky, "Microsoft SQL Server 7.0: Installation, administration optimization". Moscow, Publishing house "Russian Edition", "Channel Trading Ltd.", 1998. 4. Reference book: "SQL Server Book online". Microsoft Corporation. 1998. 5. Reference book: "Active Perl Online Help". ActiveState Tool Corp. 2000. 6. Larry Wall, Tom Christiansen, "Programming Perl, Second Edition", O’Reilly & Associates, Inc., September 1996. 7. Sriram Srinivasan, "Advanced Perl Programming", O’Reilly & Associates, Inc., 1997. 8. Reference book: "October 2000 Release of the MSDN Library", Microsoft Corporation, 2000. 9. Rick Darnell, " JavaScript Quick Reference". St.Petersburg, Publishing house "Piter", 2000. 10. Mark R. Brown, Jerry Honeycutt, "Using HTML 4, Fourth edition", Publishing house "Williams", Moscow,1999.
KiMPA: A Kinematics-Based Method for Polygon Approximation ¨ ur Ulusoy Ediz S ¸ aykol, G¨ urcan G¨ ule¸sir, U˘ gur G¨ ud¨ ukbay, and Ozg¨ Department of Computer Engineering, Bilkent University 06533 Bilkent, Ankara, Turkey {ediz,gulesir,gudukbay,oulusoy}@cs.bilkent.edu.tr
Abstract. In different types of information systems, such as multimedia information systems and geographic information systems, object-based information is represented via polygons corresponding to the boundaries of object regions. In many applications, the polygons have large number of vertices and edges, thus a way of representing the polygons with less number of vertices and edges is developed. This approach, called polygon approximation, or polygon simplification, is basically motivated with the difficulties faced in processing polygons with large number of vertices. Besides, large memory usage and disk requirements, and the possibility of having relatively more noise can also be considered as the reasons for polygon simplification. In this paper, a kinematics-based method for polygon approximation is proposed. The vertices of polygons are simplified according to the velocities and accelerations of the vertices with respect to the centroid of the polygon. Another property of the proposed method is that the user may set the number of vertices to be in the approximated polygon, and may hierarchically simplify the output. The approximation method is demonstrated through the experiments based on a set of polygonal objects.
1
Introduction
In most of the information systems employing methods for representing or retrieval of object-based information, polygons are used to represent objects or object regions, corresponding to the boundaries of the object regions. For example, in multimedia information systems, polygons are used in pattern recognition, object-based similarity [1,14], and image processing [12]. Another example application can be the representation of geographic information by the help of polygons in geographic information systems [6]. In many cases, the polygons have large number of vertices, and managing these polygons is not an easy task. Thus, polygon approximation is required to facilitate information processing. Obviously, the output of the approximation must be a polygon preserving all of the critical shape information in the original polygon. Besides the difficulties in managing polygons with large number of vertices, storing such a polygon may require relatively large disk space and memory usage during processing. However, the simplified form requires less memory and disk space than the original T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 186–194, 2002. c Springer-Verlag Berlin Heidelberg 2002
KiMPA: A Kinematics-Based Method for Polygon Approximation
187
one. Another reason for using polygon approximation can be the fact that the simplified form of the original polygon tends to have less ‘noise’ as far as the small details on the object region boundaries are concerned. In [13], a tool for object extraction in video frames, called Object Extractor, is presented in which the boundaries of the extracted object regions are represented as polygons. Basically, the extracted polygons have at least 360 vertices for each angle with respect to the centroid of the object. Since such a size is quite large as we might have a large number of objects in a single video frame, an approximation for the polygons is inevitable. Since dealing with polygons having appropriate size is one of the main motivations of polygon approximation, a reasonable size for the approximated polygons is around 30 vertices. For example, Turning Angle Method [1], a famous polygonal shape comparison method, may not be feasible for polygons having more than 30 vertices. In this method, a (polygonal) object is represented by a set of vertices and shape comparison between any two objects is performed with respect to their turning angle representations. A turning angle representation of an object is the list of turning angles corresponding to each vertex on the polygon. In this paper, Kinematics-Based Method for Polygon Approximation (KiMPA) is proposed. The main motivation of using kinematics is to figure out the most ‘important’ vertices to appear in the approximated form via associating a level of importance for each of the vertices. Thus, KiMPA eliminates less important vertices until the polygon reduces to an appropriate size. The rest of the paper is organized as follows: Section 2 defines the polygon approximation problem formally and discusses some of the existing methods in the literature. The KiMPA approximation process is discussed in Section 3. In Section 4, the experiments to demonstrate approximation method are presented for a set of polygonal objects. Finally, Section 5 concludes the paper.
2
Related Work
In parallel to the wide use of polygons in information systems, polygon approximation methods are required in various application areas, such as pattern recognition [10], processing spatial information [6], and shape coding [8]. Besides, there are also methods applicable to many applications. Before discussing some of the existing polygon approximation methods, the problem definition is given first. 2.1
Problem Definition
The problem of polygon approximation can be defined as follows: Let a set of points P = {p1 , p2 , ..., p|P | } denote a polygon (|P | is possibly large). Let another set of points PA = {p1 , p2 , ..., p|PA | } denote another polygon. If |P | >> |PA | and dP,PA < , where dP,PA denotes the difference between P and PA , and denotes the approximation error, then PA is an approximation of P . Based on this definition, the polygon approximation problem can be classified into two as follows:
188
E. S ¸ aykol et al.
– Given an upper bound for , find the polygon with minimum number of vertices. – Given an upper bound for |PA |, find the polygon that minimizes the approximation error . In many applications, the main goal may not fit into the above cases. For example in [13], due to the complexity of the extracted object boundary polygons, the approximated polygon must have less number of sides than some threshold (|PA |) and the approximation error must be below some threshold (). However, the approximated polygon must preserve the critical shape information, thus must include the most important vertices. The most important vertices are the peaks, i.e., the sharp and extreme points. Therefore, the approximation algorithm has to eliminate the non-critical vertices first. 2.2
Polygon Approximation Methods
The first way of approaching polygon approximation problem can be selecting |PA | many vertices from P arbitrarily or at random. In many cases, this may not give the desired results. Thus, more sophisticated methods are needed for an appropriate approximation. There exist some methods that transform the polygon approximation to a graph-based problem [4]. The corresponding weighted graph G for a polygon P is constructed as follows: Each vertex of the polygon is set to be a node in G, and arcs are added for each vertex pair (i, j) such that i < j. An error is associated for each arc corresponding to the presence of the line segment pi pj in the approximated form. Due to this transformation, solving the shortest path problem in G is equivalent to the first group of polygon approximation problems, as mentioned in problem definition. The Minimax Method [9], however, finds an approximation for a set of points representing a polygon where for an approximated line l the maximum distance between the points and l is minimized. The method is generalized for polygonal curves, not only closed polygons, and different constraints are imposed according to the definition and specifications of the problem. Another method that minimizes the inflections is proposed by Hobby [7]. In this method a special data structure is designed for the set of points so that the direction and length of the approximating line between points can be determined. The main use of such a method may be approximating the scanned bitmap images to smooth polygons. Polygon simplification algorithms in computer graphics literature reduce the number of polygons in a three dimensional polygonal model containing a large number of polygons [5,11,15]. However, since our aim is simplifying a two dimensional polygon by reducing the number of vertices and edges, we seek a more specialized algorithm suitable for this purpose. Although more general polygonal simplification algorithms from computer graphics domain can be used with adaptations for our specific needs, we preferred to design a new algorithm since the polygonal simplification algorithms from graphics domain are more general, and therefore, more complicated. However, our algorithm is similar to polygon simplification algorithms using vertex decimation.
KiMPA: A Kinematics-Based Method for Polygon Approximation
189
v
av cm
x-axis
Fig. 1. Polar Angle Calculations
3
Kinematics-Based Method for Polygon Approximation
KiMPA is applied on polygons in such a way that each vertex of a polygon is associated with an importance level. In fact, the value of the importance level is already encapsulated in the polygon, and kinematics is employed to process that value. The following section presents the preliminary definitions to clarify the method. 3.1
Preliminaries
Definition 1. (Vertex Velocity) The velocity Vv of a vertex v = (x, y) is the rate of change of distance per angle. Definition 2. (Vertex Acceleration) The acceleration Av of a vertex v = (x, y), is the rate of change of velocity per angle. Vv =
∆dv , ∆av
Av =
∆Vv ∆av
(1)
In Equation (1), d denotes the distance between v and the centroid1 cm , av denotes the polar angle of v (cf., Fig. 1). The key observation is as follows: the more the velocity for a vertex, the more sharp the polygon at that vertex. Hence, in order to get a good approximation for the polygonal region, the sharp features should be included. For example in Fig. 2 (a), it would be a better choice to include the points p2 and p3 in the result set rather than including p1 and p2 , if one of these three vertices has to be eliminated. In order to differentiate between the vertices, all of them have to be sorted with respect to their velocities and accelerations. The first step is selecting the top-K vertices in the descending velocity order. If the number K is not specified by the user, the default value 2 × |PA | can be used. Among the K vertices, the top-|PA | vertices are selected as the vertices of the approximated polygon |PA |. By processing the vertices in this manner, the most accelerated (the most ‘important’) vertices appear in the approximated polygon since they are the vertices where the original polygon has its sharp features. 1
The centroid, or center of mass, of a polygon can be computed as the mean of the vertex coordinates in O(|V |)-time.
190
E. S ¸ aykol et al.
p2
cm p1
p3 (a)
(b)
Fig. 2. (a) KiMPA Calculations to Detect Sharpness. (b) Visualizing Sharpness on a Comb-like Object.
In order to visualize the sharpness on objects, Fig. 2 (b) shows a fluctuating and sharp comb-like figure. This fluctuating and sharp vertices of the polygons can be detected by the change of the acceleration between consecutive vertices. As in the comb-like example, if one extreme has a positive acceleration, the other extreme will have a negative one, leading to a huge acceleration difference between two vertices. Therefore, among these most accelerated points, the very first vertices have to be selected as the final output to have a good approximation. As an example for the KiMPA approximation, Fig. 3 shows a sample polygon corresponding to a rose object. In Fig. 3 (a), the original object is shown as a polygon of 360 vertices. After applying KiMPA method on the object, the object is simplified to 23 vertices as shown in Fig. 3 (b).
(a)
(b)
Fig. 3. Application of KiMPA on an input polygon. (a) Original polygon having 360 vertices (at each degree). (b) Polygon after simplification to 23 vertices with K = 64.
3.2
KiMPA Algorithm
The section gives an algorithmic discussion on KiMPA based on the above preliminaries. As mentioned in the problem definition, the input is a set of vertices V corresponding to a polygon. In a polygon, the vertex are in order and this vertex ordering is needed in some calculations of the algorithm. In order to determine the importance levels of the vertices, vertex velocities have to be calculated
KiMPA: A Kinematics-Based Method for Polygon Approximation
191
first. Then, the vertex accelerations can be computed from the vertex velocities. One important point is that the sign of the vertex accelerations has to be taken into consideration. In both of these calculations, the previous vertex of a vertex, according to the initial sorted polar angle ordering, is needed. At the end of this calculation process, the importance level hidden in the vertices are determined. To continue approximation, the top-K vertices with higher velocities are selected. Then, among these vertices, the most accelerated vertices to be in the approximated form are selected. These operations require two sorting operations on the vertex list (in fact the second one is on a much smaller set), and since the final output must be a polygon, one more sorting operation is required to represent the output as a polygon. The overall approximation algorithm is shown in Fig. 4. In the algorithm, a vertex data structure is assumed to have fields for storing distance, velocity, and acceleration. Function Approximate(V , M , K, cm ) // V is a set of vertices for a polygon. // M is |PA |, the size of the output, K is 2 × M if not specified. // cm is the centroid of a polygon. 1. for each vertex v in V do 2. v.distance = findDistance(v, cm ); 3. for each vertex v in V do 4. v.velocity = |v.distance − pred(v).distance| ÷ |v.angle − pred(v).angle|; 5. for each vertex v in V do 6. signacc = sign(v.velocity − pred(v).velocity); 7. v.acceleration = signacc × (v.velocity − pred(v).velocity); 8. v.acceleration = v.acceleration ÷ |v.angle − pred(v).angle|; 9. sortByVelocity(V ); 10. V 1 = pickVertices(V, K); 11. sortByAcceleration(V 1); 12. V 2 = pickVertices(V 1, M ); 13. sortByAngle(V 2); 14. return V 2;
Fig. 4. The KiMPA Algorithm
The running time analysis of KiMPA algorithm is as follows: For a polygon, sorting by polar angles may be required and it can be handled in O(|V |log|V |) time. The running time of Approximate() function is dominated by line 9, which is O(|V |log|V |). The first three for loops require O(|V |) time, the other sort operations in line 11 and 13 require O(K) and O(M ) time, respectively. Therefore, KiMPA approximation algorithm takes O(|V |log|V |) time for an arbitrary set of vertices V .
4
Experiments
The experiments are based on polygonal objects extracted by our Object Extractor tool [13], which extracts polygonal boundaries of the objects of interest. The tool is a part of the rule-based video database management system that is
192
E. S ¸ aykol et al.
(a)
(b)
(c) Fig. 5. Simplification Experiments on Objects Extracted by Object Extractor (a) tiger, (b) bicy, and (c) sea.
KiMPA: A Kinematics-Based Method for Polygon Approximation
193
developed for storing and retrieving video related data and querying that data based on spatio-temporal, semantic, color and shape information [2,3]. It should be noted that an extracted object has at least one vertex (on the boundary) corresponding to every angle with respect to its centroid. Hence, the polygon has at least 360 sides. However, this number is quite large. For example, if we are to use the extracted objects for shape similarity via Turning Angle (TA) method [1], an appropriate polygon size is around 20, and it is certain that an approximation is needed. Obviously, randomly selecting the vertices to be in the approximated form would not be the correct way, thus an approximation method preserving the most important vertices is required. The basic observation for such an experiment is as follows: each vertex on the polygon corresponds to exactly one degree. The vertex velocity of a vertex v is the difference of the distances between v and its predecessor vertex (pred(v)). Similarly, the vertex acceleration of v is the difference between the velocities of v and pred(v). In both of the cases, pred(v) is located in one degree distance from v with respect to the centroid cm . The results of approximations for a set of polygonal objects are shown in Fig. 5. Since these objects are extracted from digitized images, the critical shape information about the original object is preserved in the approximated polygon. The approximated polygon can be used in pattern recognition [1] and object-based similarity [14] applications.
5
Conclusion
In this paper, a kinematics-based method for polygon approximation, called KiMPA, is proposed. The algorithm is especially suitable for extracting and representing the boundaries of salient objects from video frames for shape-based querying in multimedia, especially video, databases. The polygons are simplified according to the vertex velocities and vertex accelerations with respect to the centroid of the polygon. Velocity and acceleration values for vertices are used to express the level of importance, and during vertex elimination the ones having higher importance are preserved. Since polygon approximation is aimed to overcome the difficulties in processing polygons with a large number of vertices, not only to handle their large memory usage and disk requirements but also to reduce noise, KiMPA can be used to approximate a polygon while preserving the crucial information on the polygon. Thus, it can be successfully used within information systems requiring an approximated polygonal representation. While using the algorithm, the user may specify the number of vertices to be in the approximated polygon. Besides, the simplification algorithm may be hierarchically applied, enabling the approximation of a polygon to a desired simplification level. The method can be integrated to any system and may improve the performance of the approximation. We have conducted several polygon approximations on polygonal objects and the results of these approximations are presented. The results show that the method is successful in approximation of polygonal objects, and provides a way of representing the object with less number of vertices.
194
E. S ¸ aykol et al.
References 1. E. Arkin, P. Chew, D. Huttenlocher, K. Kedem, and J. Mitchel. An efficiently computable metric for comparing polygonal shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(3):209–215, 1991. ¨ Ulusoy, and U. G¨ 2. M.E. D¨ onderler, O. ud¨ ukbay. A rule-based approach to represent spatio-temporal relations in video data. In T. Yakhno, editor, Advances in Information Sciences (ADVIS’2000), volume 1909 of LNCS, pages 248–256, 2000. ¨ Ulusoy, and U. G¨ 3. M.E. D¨ onderler, O. ud¨ ukbay. A rule-based video database system architecture, to appear in Information Sciences, 2002. 4. D. Eu and G.T. Toussaint. On approximating polygonal curves in two and three dimensions. CVGIP: Graphical Models and Image Processing, 56(3):231–246, 1994. 5. M. Garland and P. S. Heckbert. Surface simplification using quadric error metrics. ACM Computer Graphics, 31(Annual Conference Series):209–216, 1997. 6. S. Grumbach, P. Rigaux, and L. Segoufin. The DEDALE system for complex spatial queries. Proceedings of ACM SIGMOD Symp. on the Management of Data, pages 213–224, 1998. 7. J.D. Hobby. Polygonal approximations that minimize the number of inflections. In Proceedings of Fourth ACM-SIAM Symposium on Discrete Algorithms, pages 93–102, 1993. 8. J. Kim, A.C. Bovik, and B.L. Evans. Generalized predictive binary shape coding using polygon approximations. Signal Processing: Image Communication, 15(78):643–663, 2000. 9. Y. Kurozumi and W.A. Davis. Polygonal approximation by the minimax method. Computer Graphics and Image Processing, 19:248–264, 1982. 10. C.C. Lu and J.G. Dunham. Hierarchical shape recognition using polygon approximation and dynamic alignment. Proceedings of 1988 IEEE International Conference on Acoustic, Speech and Signal Processing, Vol. II, pages 976–979, 1988. 11. J.R. Rossignac and P. Borrel. Multiresolution 3d approximations for rendering complex scenes. In B. Falcidieno and T. Kunii, editors, Modeling in Computer Graphics: Methods and Applications, pages 455–465, 1993. 12. J. Russ. The Image Processing Handbook. CRC Press, in cooperation with IEEE Press, 1999. ¨ Ulusoy. A semi-automatic object extraction tool 13. E. S ¸ aykol, U. G¨ ud¨ ukbay, and O. for querying in multimedia databases. In S. Adali and S. Tripathi, editors, 7th Workshop on Multimedia Information Systems MIS’01, Capri, Italy, pages 11–20, 2001. ¨ Ulusoy. A histogram-based approach for object14. E. S ¸ aykol, U. G¨ ud¨ ukbay, and O. based query-by-shape-and-color in multimedia databases. Technical Report BUCE-0201, Bilkent University, 2002. 15. W. J. Schroeder, J. A. Zarge, and W. E. Lorensen. Decimation of triangle meshes. Computer Graphics, 26(2):65–70, 1992.
The Implementation of a Robotic Replanning Framework ùXOH 0) do replanning; }} The replanner has three alternatives to choose from. An intermediate state which is known to be on the optimal or suboptimal path has always the priority over the other two alternatives. In the following algorithm, if the function get_intermediate_state() returns true, that is , there’s a previously obtained intermediate state for a similar situation, then the plan stored in the plan base is retrieved and is put into execution. Otherwise, the other two alternatives are considered for replanning. If the value of variable “dm” which corresponds to a distance metric to measure the difference between two states and whose value was obtained in the latest call of the function “execute()” is less than a previously settled threshold value “thr”, then the alternative of backtracking to the state before the unexpected event and executing rest of the original plan is chosen. In this case, new_plan corresponds to a sequence of actions that brings the state after the unexpected happening to the state right before the effects of the unexpected happening. Meanwhile, alter1_cost() function calculates the cost of
The Implementation of a Robotic Replanning Framework
199
using alternative 1 for replanning by taking into account planning time for backtracking and total distance traveled to reach the goal state. There’s a learning mechanism behind choosing a proper alternative. The system learns the value of the threshold whose value was assigned randomly at startup as it progresses. The details of the learning mechanism are in [6]. On the other hand, if the value of dm is not less than the threshold value, the alternative of generating a completely new plan from the state with the effects of the unexpected happening to the goal state is selected. Cost calculation (alter2_cost()) is done and used accordingly. Plans are put into execution when they are generated. replanner() { (bool) yes = get_intermediate_state(); if(yes){ new_plan = retrieve_plan_for_intermediate_state(); execute(new_plan);} else if(dm != 0 && dm < thr) { alter1_cost(); new_plan=planner(Observed_World_Model, Expected_World_Model); execute(new_plan); execute(plan); }//rest of plan will be executed. else { alter2_cost(); new_plan = planner(Observed_World_Model, Goal_State); execute(new_plan); }}
3 Domain Dependent Planning In order to achieve a general planner, planning for different domains should be allowed. In this study, general planning is achieved by using a means-ends analysis planning mechanism [9] that is represented by predicate calculus and allows representation of different planning domains, domain operators and rules. Means-ends analysis is a paradigm that applies domain specific search control rules on a search tree during planning and hence chooses a sequence of operators. Operators are nodes of a search tree. The functionality of control rules can be extended to incorporate finding an optimal sequence of actions if there’s one. In Vision Guided Planner, VGP, there are two blocks world domains. The first one is the mixed pieces domain where the other is the container load/unload domain in a port. Each domain has its own search control rules and the planner activates the domain specific control rules when planning for a domain. The objects in a domain are represented by a structure called object and it has attributes such as its type, name, size, present x and y coordinates, goal x and y coordinates, etc. The planning algorithm for mixed pieces domain solves the mixed pieces problem and is later expressed in the form of predicate calculus. The algorithm is as follows: Mixed_Pieces(){ Object obj; Read_State(Initial); Read_Satete(Goal);
200
S. YÕldÕrÕm and T. TunalÕ
int unprocessed=find_number_of_unprocessed_blocks(); strcpy(obj.type,”ARM”); While(unprocessed > 0) { if(obj.type == “BLOCK”){ find_closest_destination_from_block(obj); if(empty_destination(obj)){ move_block(obj); strcpy(obj.type, “ARM”);} else { find_closest_empty_cell_from_destination(obj); //Above function incorporates the movement of //the block in destination cell of object obj //to the empty cell. move_block(obj);//move block from initial to //goal position. strcpy(obj.type, “ARM”);} else if(obj.type == “ARM”) { find_closest_destination_from_arm(obj); move_arm(obj); //move to the coordinates of //block to be moved. strcpy(obj.type, “BLOCK”);} //assign next //moving object as “BLOCK”. unprocessed = unprocessed – 1; } The planning algorithm for container insertion and removal are same as their corresponding control rules. Two functions in the algorithms will be presented below: int find_closest_empty_destination_from_arm(object *obj, int *x, int *y, int *z){ int i,j,k; for(i = 1; i =0 //Other OCL expressions are defined here... //Rules are defined as in fipa-rdf1... ) :language fipa-rdf-extended )
Fig. 4. Carrying the requested ontology in FIPA-ACL messages
requester. In figure-4, one example for each element of the "SimpleBook" ontology is shown. Comments are added for readability, they are not parts of the content language. 3.2
Creating Ontology Dependent Query Interfaces: An Example Reusable Action
The user interfaces, which assist the users in formulating queries about a specific domain by using that domain’s ontology, are constructed dynamically using the transferred ontologies. By this way, users can formulate queries by using ontologies, which are updated or added to the system at runtime. How this action works is explained in the following paragraphs. The ontology, which is transferred in XML syntax with the fipa-rdf-extended content language, is parsed using an XML SAX parser. During parsing, choice lists are formed dynamically for the class attributes, which have a controlled vocabulary associated with them. By this way, users can select a term from this choice list during query formulation. Textboxes are formed for the class attributes, which do not have a controlled vocabulary associated with them. Users can enter a keyword in these textboxes during query formulation. Class names are used in forming the labels related with their attributes’ choice lists. Also a button named as "query button" is formed. In this button’s action method, first of all, the XOT model is constructed using the items selected from the choice lists or values entered into the textboxes.
290
R.C. Erdur and O. Dikenelli
Then, the XOT model is mapped to rdf and the content is placed in a FIPA ACL message with the "Request" performative. This FIPA ACL message is then put into the agent’s communication queue and the action is completed. It is the responsibility of the planner to send this query to the related agents and get the results. These user interface elements are added to a panel and the panel is passed to the agent’s user interface object to be displayed. Although this action is for user agents, it is also used in information agents with a minor modification. Information agents also have to create user interfaces where ontology dependent information can be entered. These user interfaces assist in modeling the information that a repository provides, in an ontology dependent way. The choice lists related with the controlled vocabularies and the textboxes are formed in the same way as explained above. The only difference is in the action method of the button element in this user interface. While the action method of the button in the query interface of the user agents collect the user selections and data and form a FIPA ACL message for querying, the action method of the button in the user interfaces of the information agents collect the user selections and puts an objective in the agents objective queue, indicating that the agent has to store the entered data locally as the meta-knowledge (ontological knowledge) related with the repository. Since the handling of the action method of the buttons in different ways is the responsibility of the agent's user interface object, this action is the same in both user and the information agents; hence, it is collected as an ontology related reusable action and added to the reusable actions layer.
4
Example Domain and Experiences
We have used the developed framework in the application area of "Component Based Software Reuse on Internet", for which we have constructed a multi-agent system prototype for searching and retrieving reusable software components over Internet [8]. The application area we have chosen has the characteristics of an open environment, since there are many component servers providing components in different domains, and each domain has its own domain ontology for modeling its components' metaknowledge. For example, component servers providing components in the domain of telecommunications model their components' meta-knowledge by using "telecommunications_domain_ontology", while component servers providing components in multimedia domain uses "multimedia_domain_ontology" to model their components' meta-knowledge. These domain ontologies may change at run time or new domains with their own domain ontologies may join to the environment. In these circumstances, our framework can transfer the updated or newly added ontologies and it provides the users with mechanisms for personalizing these ontologies. The constructed multi-agent system prototype has been evaluated in terms of its adaptability to the open environments, by changing the global ontologies used and by adding new domain ontologies to the system. The experiences have shown us that the agents instantiated by using the developed framework, can adapt themselves to the dynamically changing ontologies; hence, our ideas are realizable and consistent.
A FIPA-Compliant Agent Framework with an Extra Layer
5
291
Conclusion
In this study, we have developed an agent framework with an additional layer, containing actions that are common in all agents. Some of these actions are generic, such as registering/deregistering to AMS or DF. Our contribution is that we have added actions related with ontology dependent reusable behaviour to this additional layer. Transferring of ontologies, creating ontology based query interfaces dynamically, personalizing the transferred ontologies are some examples for these behaviour. These behaviour are very important, because they give the agents the ability to run in open environments with many dynamically changing ontologies. The developed framework is an alternative to others, since it has an additional layer containing reusable actions. However, other frameworks can be extended by using our approach and by adding a reusable actions layer like the one in the developed framework. The developed framework can be extended in several ways. The reusable actions layer is a flexible layer and we will improve it by adding new reusable actions as our framework is used in new applications and as experiences increase. The communication and conversation & execution layers are kept as simple as possible in the current prototype, because the main goal was to test the reusable actions layer, which was built on top of them. In the later prototypes, we will extend these layers with advanced reliability and concurrency techniques.
References 1. 2. 3. 4. 5.
6. 7. 8. 9.
Bellifemine, F., Poggi, A., and Rimassa, G.: Developing Multi-agent Systems with a FIPA-compliant Agent Framework. Software Practice. and Experience, 31 (2001) 103– 128. Berardi, D., Calvanese, D., and Giuseppe, G.: Reasoning on UML Class Diagrams Using Description Logic Based Systems. Workshop on Application of Description Logics ADL'2001. Available on line at http://ceur.ws.org/vol44. Brugali, D. and Sycara, K.: A Model for Reusable Agent Systems. In Implementing Application Frameworks. Fayad. M.E., Schmidt, D.C., and Johnson, R.E. (eds)., John Wiley and Sons, USA, (1999) 155–169. Cheyer, A. and Martin, D.: The Open Agent Architecture. Autonomous Agents and MultiAgent Systems, 4 (2001) 143–148. Cranefield, S., Haustein, S., and Purvis, M.: UML-Based Ontology Modeling for Software th Agents. Proceedings of the Workshop on Ontologies in Agent Systems held at the 5 International Conference on Autonomous Agents, (2001) Montreal, Canada. Available on line at http://cis.otago.ac.nz/OASWorkshop. Cranefield, S.: Networked Knowledge Representation and Exchange using UML and RDF. Journal of Digital Information, Vol.1 Issue:8, (2001). DARPA Agent Markup Language Web Site. www.daml.org, (2002). Erdur, R. C. and Dikenelli, O.: A Multi-agent System Infrastructure for Software Component Market-place: An Ontological Perspective. ACM Sigmod Record, Vol:31, Number:1, March, (2002) 55–60. Fensel, D., Harmelen, F., Horrocks, I., McGuinness, D., and Patel-Schneider, P.F.: OIL:An Ontology Infrastructure For The Semantic Web. IEEE Intelligent Systems, 16(2) (2001) 38–45.
292
R.C. Erdur and O. Dikenelli
10. FIPA XC00011B:FIPA RDF Content Language Specification. http://www.fipa.org., (2001). 11. FIPA XC00023G: FIPA Agent Management Specification. http://www.fipa.org, (2000). 12. FIPA XC00086C: FIPA Ontology Service Specification. http://www.fipa.org, (2000). 13. Genesereth, M.R. and Fikes, R.E.: KIF Version3.0 Reference Manual. Technical Report Logic-92-1, Computer Science Dept, Stanford University, (1992). 14. Gruber, T. R.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5(2) (1993). 15. Kendall, E., Krishna, P.V.M., Pathak, C.V. and Suresh, C.B.: A Framework for Agent Systems. In Implementing Application Frameworks, Fayad. M.E., Schmidt, D.C., and Johnson, R.E. (eds)., John Wiley and Sons, USA, (1999) 113–152. 16. Lea, D.: Concurrent Programming in Java: Design Principles and Patterns. AddisonWesley, Reading, MA, (1997). 17. Nwana, H.S., Ndumu, D.T., Lee, L.C., and Coll, J.C.: ZEUS: A Tool-kit for Building Distributed Multi-agent Systems. Applied Artificial Intelligence Journal, 13(1), (1999) 129–186. 18. Omicini, A.: SODA: Societies and Infrastructures in the Analysis and Design of Agentbased Systems. In Ciancarini, P. and Wooldridge, M. (eds). Proc. 1st Int. Workshop on Agent-Oriented Software Engineering (AOSE 2000), Limerick, Ireland, Volume 1957 of LNCS, (2001) Springer-Verlag, Berlin. 19. Poslad, S., Buckle P., and Hadingham, R.: The FIPA-OS Agent Platform: Open Source for th Open Standards. Proceedings of the 5 International Conference and Exhibition on the Practical Application of Intelligent Agents and Multi-agents, UK, (2000). Available at http://www.fipa.org/resources/byyear.html 20. Sycara, K., Klusch, M., Widoff, S., and Lu, J.: Larks: Dynamic Matchmaking Among Heterogeneous Software Agents in Cyberspace. Autonomous Agents and Multi-agent systems, March, (2001). 21. Zambonelli, F., Jennings, N., and Wooldridge, M.: Organisational Rules as an Abstraction for the Analysis and Design of Multi-agent Systems. Int. Journal of Knowledge and Software Engineering, Vol.11, No.3, (2001). 22. Zambonelli, F., Jennings, N., Omicini, A. and Wooldridge, M.: Agent Oriented Software Engineering for Internet Applications. in Coordination of Internet Agents: Models, Technologies, and Applications. Omicini et. al. (eds.), Chap:13, (2001) Springer Verlag.
Vibrational Genetic Algorithm (Vga) for Solving Continuous Covering Location Problems 1
2
3
Murat Ermis , Füsun Ülengin , and Abdurrahman Hacioglu 1
Dept. of Industrial Engineering, Turkish Air Force Academy, 34807 Yesilyurt, Istanbul, Turkey
[email protected] 2 Dept. of Industrial Engineering, Istanbul Technical University, 80680 Maçka, Istanbul, Turkey
[email protected] 3 Dept. of Aeronautical Engineering, Turkish Air Force Academy, 34807 Yesilyurt, Istanbul, Turkey
[email protected]
Abstract. This paper deals with a continuous space problem in which demand centers are independently served from a given number of independent, uncapacitated supply centers. Installation costs are assumed not to depend on either the actual location or actual throughput of the supply centers. Transportation costs are considered to be proportional to the square Euclidean distance travelled and a mini-sum criteria is adopted. In order to solve this location problem, a new heuristic method, called Vibrational Genetic Algorithm (VGA), is applied. VGA assures efficient diversity in the population and consequently provides faster solution. We used VGA using vibrational mutation and for the mutational manner, a wave with random amplitude is introduced into population periodically, beginning with the initial step of the genetic process. This operation spreads out the population over the design space and increases exploration performance of the genetic process. This makes passing over local optimums for genetic algorithm easy. Since the problem is recognized as identical to certain cluster analysis and vector quantization problems, we also applied Kohonen maps which are Artificial Neural Networks (ANN) capable of extracting the main features of the input data through a selforganizing process based on local adaptation rules. The numerical results and comparison will be presented.
1
Introduction
A location problem arises whenever a question is raised like “Where are we going to put the thing(s)?”, and then immediately following next two questions “Which places are available?” and “On what basis do we choose?” The answer to the first question determines the locational space. If this space is described by way of continuous variables, the problem is a continuous location problem [1]. Location-allocation (LA) problems arise in many practical settings; distribution, public services, telecommunication switching centers, warehouses, fire stations, etc. Given the location of a set of demand centers and their associated demands,they are T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 293–302, 2002. © Springer-Verlag Berlin Heidelberg 2002
294
M. Ermis, F. Ülengin, and A. Hacioglu
generally based on finding the number and location of supply centers and the corresponding allocation of the demand to them so as to satisfy a certain optimization criteria”. K Although there are network based, uni-dimensional, spherical and general § instances, most LA problems occur on a plane. For planar problems, Minkowsky’s l p norm is usually assumed with rectilinear and Euclidean being the most common cases. Continuous location problems assume that one cannot give an exhaustive list of all individual available places, as is the case in discrete location problems. We do not really know which sites are available, but rather that these are all over the place and we want to find out where to look for good candidates. Thus, continuous location models can be considered as site generating and will always have some geometrical flavour. However, what we have to take into account is that in order to be eligible, the sites must come from some feasible regions. Goldberg provides a good introduction to GAs including a summary of applications. GA is a generic solution technique which can be applied to any problem such as scheduling, computer, business, etc., after developing an encoding scheme for the solutions of the problem and deciding the settings of the parameters. However, although it is possible to design a GA for any optimization problem, GAs are not equally effective for all problems. In fact, GAs perform best for problems with a small number of constraints and a complicated objective function [2]. This makes the continuous covering location problem a good candidate for solution via a GA. We apply the Vibrational Genetic Algorithm (VGA) [3] for solving continuous covering location problems in this paper. VGA assures effective exploration /exploitation balance and diversity. Thus, it gives a good performance on convergence rate. We define vibration in VGA as a kind of oscillation which is produced in some wave form in certain parameters used in genetic process.
2 Problem Formulation LA problem analyzed in this paper has following assumptions: Demand centers are independently serviced K Supply centers may be located anywhere in § Every demand shall be allocated to its closest supply center Installation costs are not considered and the number of supply centers is given Transportation costs are assumed proportional to the square Euclidean distance To formulate this problem, we define the following inputs and sets as follows: I : the set of demand nodes indexed by i J : the set of candidate facility locations, indexed by j xi : the locations of the demand node i wj : the locations of the candidate facility locations dij : distance between demand node i and candidate site j hi : relative demands of the i demand centers and the following decision variable LI GHPDQGQRGHLLVDVVLJQHGWRDIDFLOLW\ M \ = . RWKHUZLVH LM
Vibrational Genetic Algorithm (Vga)
295
With this notation, the problem can be formulated as follows: Minimize
∑∑ h d
(1)
i ij y ij
i∈I j∈J
subject to
∑y
ij
= 1,
(2)
∀i ∈ I
j∈J
y ij ∈ {0,1} ,
(3)
where d ij = w j − x i
2
(4)
.
For the probabilistic demand case: Minimize
∫
x∈X
where X D Y wj
D( x) wY ( x ) − x
2
,
(5)
: demand region : multivariate distribution of the demand (x³X[0,1]) : allocation function (x³X{1,2,..., J }) : locations of the candidate facility locations
Given the location of the supply centers, the corresponding allocation of the demand would be to the closest facility. The question “What is the closest demand point?” yields the (closest point) Voronoi polyhedra [4].
Fig. 1. Closest point Voronoi diagram
When weighted distances are considered, the weights to be used have to be specified. Therefore, it is of interest to observe the jumpy behavior of optimal sites when weights are modified [5] or the very disconnected full set of possible optimal solutions for any kind of weight [6]. Other distance functions have been considered,
296
M. Ermis, F. Ülengin, and A. Hacioglu
mainly the rectangular distance ([7], [8]) but the application potential of these models seems rather reduced, since it is not clear how a push effect will be propagated by rectangular distance. For the discrete demand case and for (non-squared) Euclidean distances a local search heuristic solution approach was proposed in Cooper [1]. More promising from computational point of view, may be the use of combinatorial optimization techniques such as simulated annealing [9], tabu search or genetic algorithms (GA). GA has been tested for different types of combinatorial optimization problems and seems to perform well in some instances [1]. However, little effort has been directed towards designing GAs for location-allocation problems ([1], [10]).
3
The Proposed Genetic Algorithm
As noted by Obayashi et al. [11] the mutation probability for a real coded GA should be set at a higher value than in the case where binary coding is used. The reason is that in binary coding a change in a single bit can effect significant change in the value of the design variable, but in real number coding a similar change has a lesser effect. Thus a higher mutation probability is justified as a means to enable the algorithm to search the design thoroughly. This aim can be achieved by vibrational mutation, which is described and formulated in the following subsection. Mutation based vibrational concept stands on having samples simultaneously various parts of domain by which it is able to catch global optimum as fast as possible. In order to achieve this, all individuals in population are mutated by vibrational manner periodically, so that they spread out over domain space. Thus, it is possible to escape from local optimums quickly and explore more fitting individuals. 3.1 Outline of VGA In the proposed GA like conventional GA approaches, involves following four elements: 1) Solution Representation: the encoding scheme which is based on chromosomes (strings) of length p (number of supply centers) and the genes within a chromosome correspond to the coordinates of facilities is real coded. 2) Initial Population: the individual chromosomes in the initial population are randomly generated and there is no need to larger population size as will be seen in computational results. 3) Selection: the selection procedure is Stochastic Universal Sampling [12]. 4) Recombination: the operator used to mate the parents and produce the chromosomes is the BLX-a crossover technique [13] which is the most common approach for recombination of two parents represented by a vector of real numbers and vibrational mutation operator. After the Selection and Recombination steps are repeated several times, the survival of fittest idea in nature is translated into the survival of the best solution in the population of solutions. The formulations of VGA can be represented in a pseudocode format like following:
main() {
Vibrational Genetic Algorithm (Vga)
297
read the data /*demand and demand center locations matrices of the n interacting nodes*/ initialize min_cost /*the best cost solution*/ initialize the random number generator genetic() report min_cost } genetic() { gc=0 /*initialize the generation counter*/ seeds = 0 initialize min_cost /*the minimum cost solution for genetic generate P(gc) /*the population of solutions for generation gc*/ evaluate P(gc) /*calculate the cost of a chromosome using the distance based rule*/ while (gc max_gc) gc = gc+1 seeds = seeds+1 select P(gc) from P(gc-1) recombinate P(gc) if (seeds = IP) vibrational_mutation() seeds = 0 evaluate P(gc) update min_cost }
3.2 Vibrational Mutation Vibration approach used in mutational manner is realized after the recombination step and applied periodically beginning from initial steps. Introduced wave into population is random and during this operation, all of the genes in all chromosomes are mutated as follows: (6) y m = y m [1 + MA ⋅ u (0.5 − u )] new,i
i
1
2
m = 1,..., n i = 1,..., kn u1, u1 ³ [0, 1], m where y i are the control points (genes), kn is the total gene number of a chromosome, n is the total number of individual in the population, MA is the main amplitude, u1 and u2 is a random variables between [0,1]. Vibration implementation starts from a certain gene position at first chromosome, and continues throughout the genes at the same positions in the other chromosomes. This process is applied to all individuals in the population at every IP period where mutation rate Pm is equal to 1/IP and IP is a integer number. The period of vibrational
298
M. Ermis, F. Ülengin, and A. Hacioglu
effect (IP) has to keep a certain track, since the length of the period affects the performance of the method. Implementation of vibration results in a random distribution of this new population throughout the solution domain. The new population obtained under the vibrational effects continues passing through its regular genetic procedure for a while, because of fitting individuals from vibrated population may take time. Then, again vibrational effects are applied and the random distributions around previous population are held. The merit of the vibrational random distribution is to catch global optimum without entangling local optimums or escaping to entangle them. That is the most important feature which makes the vibrational genetic algorithm more valuable. However, at that point, it is important to emphasize that while the genetic procedure continues, if the averaging fitness value of the population tends to increase, the main amplitude of the vibrational wave will be reduced. When the averaging fitness value reaches a certain quantity, e.g. about the global optimum, then it is unnecessary to distribute population into a wide band. Instead, as a computationally efficient way, providing a random distribution into a narrower band helps to reach the global maximum as close as possible. For this purpose, the main amplitude MA is evaluated during genetic procedure as follows: r
log(1 + AF0 ) MA = . log(1 + AFk )
(7)
AF0 and AFk are respectively the average fitness values at the initial step and current step of the genetic process, r is a real number. At the initial step of the evolutionary process MA will be 1. If one desires greater or lower value, then he should multiply MA with a parameter. MA designates maximum band width of the random wave. In equation (7), r determines the speed of decrement of MA. For faster decrement, r should be taken greater or vice versa. The basic structure of VGA is given before, a brief outline of the vibrational mutation procedure can be given as follows: at the first step of genetic process, after the evaluation of fitness, selection of fitted individual and recombination, vibration is applied to all new chromosomes. Firstly, initial genes (j=1) in all chromosomes (i=1 to n) are vibrated; after that, the genes at the second position (j=2) in all chromosomes (i=1 to n) are vibrated and so on. This operation ends when the last genes (j=kn) are vibrated. At next IP-1 step of the genetic process (mutation rate Pm =1/IP), regular operations (evaluation of fitness, selection, recombination) are performed. Vibration implementation is repeated at every IP step while the genetic process goes on.
4
Computational Results
In this section, a continuous space LA problem involving 49 demand centers given by Lazano [14] is solved and the results of proposed VGA and Self-organizing maps (SOM) approaches are illustrated. We have computed VGA and SOM solutions for different number of facilities (from two to ten). Experiments are performed with same population size (n = 6), since VGA accelerates convergency even for small population without causing premature convergence. As a result of this, computational time of evolutionary process in contrast to other GA approaches is very small and
Vibrational Genetic Algorithm (Vga)
299
approximately after 5000 generations it converges to the minimum fitness value. Crossover rate, Pc, is one and mutation rate, Pm, depends on the period of vibrational mutation technique. The parents are selected by Universal Sampling (SUS) [12]. For vibrational mutation, the value of r in equation (7) was taken 5. Figure 2 shows the convergence history for different number of facilities (supply centers). VGA very quickly reduces the fitness value initially as seen on the figure and within a few period it reaches the stationary state. In Figure 3, convergence history of average fitness value of chromosomes in each generation is depicted.
0.0600
0.0800 6 Facilities
6 Facilities 0.0700
7 Facilities
0.0500
7 Facilities
8 Facilities
8 Facilities 0.0600
9 Facilities 10 Facilities
0.0300
9 Facilities 10 Facilities
0.0500
Fitness
Fitness
0.0400
0.0400 0.0300
0.0200 0.0200 0.0100 0.0100 0.0000
0.0000 301
701
1101
1501
1901
2301
2701
3101
3501 3901
4301
4701
LA computation calls
Fig. 2. Convergence history for different number of facilities with n=6
301
701
1101
1501
1901
2301
2701
3101
3501 3901
4301
4701
LA computation calls
Fig. 3. Average fitness (AF) values for different number of facilities with n=6
Continuous space location problem is identical to certain Cluster Analysis and Vector Quantization problems. SOM that we used to solve continuous space LA problem and compare with the results of VGA is the Kohonen Maps and has n-dimensional input layer (location vectors of demand centers) and the ouput layer corresponding to the m facilities (supply centers). In figure 4, best result of SOM and VGA approaches is illustrated for different number of facilities. In Figure 5, locations of 10 facilities found by VGA and corresponding demand centers are depicted. For the same number of facilities, locations calculated by SOM are as in Figure 6. In order to see the performance of the algorithm for large problems and to compare it with SOM, the AP data set which is derived from a real-world location problem for Australia Post and is described in Ernst and Krishnamoorthy [16] is also used. The AP data set can be obtained from OR-Library [17]. This data set consists of 200 nodes, which represent postcode districts. Problems of size n = 50, 100, 150, 200 can be obtained from the data set. Table 1 summarizes all the results obtained by conventional GA, VGA and SOM for different number of demand nodes and facilities. The results in the table are the best fitness values after 36,000 calls (i.e. 3,000 generations) and VGA outperforms all the others in all cases.
300
M. Ermis, F. Ülengin, and A. Hacioglu
Fitness
0.1000 0.0900 VGA
0.0800 0.0700 0.0600
SOM
0.0500 0.0400 0.0300 0.0200 0.0100 0.0000 2
3
4
5
6
7
8
9
10
# of Supply Centers
Fig. 4. Best solution values given by VGA and SOM for different number of facilities
Fig. 5. Locations of 10 supply centers found by Fig. 6. Locations of 10 supply centers founded by SOM VGA
Conventional GA requires much more computation time compared to VGA. For the case of 200 demand nodes and 10 facilities, VGA converges after 9,612 computation calls, but GA reaches its best fitness value, which is worse than VGA’s, which is after 72,012 calls (Fig. 7).
5
Conclusion
In this paper, we proposed an application of a new GA approach to continuous covering location problems and compared the results with SOM results. We know that one of the main problems that arises in real coded GA is premature convergence.
Vibrational Genetic Algorithm (Vga) Table 1. Comparison of results – n = 12 # of demand nodes 50
100
150
200
# of supply centers 6 8 10 6 8 10 6 8 10 6 8 10
301
(x106)
GA
VGA
SOM
5.68 4.11 3.32 4.58 3.94 3.44 5.42 4.56 4.00 5.68 4.87 4.27
5.38 3.95 3.06 4.48 3.88 3.16 5.31 4.48 3.99 5.60 4.70 4.06
5.63 4.14 3.25 4.61 3.97 3.34 5.40 4.53 4.00 5.60 4.73 4.11
7 VGA GA
Fitness
6
5
4 0
10000
20000
30000
40000
50000
60000
70000
80000
LA computation calls
Fig. 7. Number of LA computation calls for the same fitness value computed by VGA and GA
Especially small populations are dramatically affected by premature convergence [15]. The results clearly demonstrate that the use of vibration accelerates convergency even when small population is used. The performance of VGA is consistently better than the regular genetic algorithms and SOM. Computational results demonstrated especially that vibrational mutation had great effect on performance. Because, the vibrational mutation technique uses greater mutation rate and can make significant change in the value of the location variable. Therefore the vibrational mutation spreads out the population over the solution space and enables the algorithm to search the solution space thoroughly.
302
M. Ermis, F. Ülengin, and A. Hacioglu
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
Drezner, Z., Hamacher, H.W.: Facility Location: Applications and Theory. SpringerVerlag, Berlin Heidelberg New York (2002) 37–81 Reeves, C.R.: Genetic Algorithms in Modern Heuristic Techniques for Combibatorial Problems. Wiley, New York (1993) 151–196 Hacioglu, A., Özkol, I.: Vibrational Genetic Algorithm As a New Concept in Airfoil Design. Aircraft Eng. and Aerospace Technology, Vol 3 (2002) Okabe, A., Boots, B., Sugihara, K.: Spatial Tesellations: Concepts and Applications of Voronoi Diagrams. Wiley, Chichester (1992) Erkut, E., Öncü, T.: A parametric 1-maximin Location Problem. Journal of the Operational Research Society, Vol. 42 (1991) 49–55 Carrizosa, E., Munoz, M., Puerto, J.: On the Set of Optimal Points to the Weighted maximin Problem. Studies in Locational Analysis, Vol.7 (1995) 21–33 Drezner, Z., Wesolowsky, G.: Minimax and maximin Facility Location Problems on a Sphere. Naval Research Logistics Quarterly, Vol. 30 (1983) 305–312 Appa, G., Giannikos, I.: Is Linear Programming Necessary for Single Facility Location with maximin of Rectlinear Distance? Journal of the Operational Research Society, Vol. 45 (1994) 97–107 Liu, C-M., Kao, R-L., Wang, A-H.: Solving Location-Allocation Problems with Rectlinear Distances by Simulated Annealing. Journal of the Operational Research Society, Vol. 45 (1994) 1304–1315 Abdinnour-Helm, S.: A Hybrid Heuristic for the Uncapacitated Hub Location Problem. European Journal of Operational Research, Vol. 106 (1998) 489–499 Obayashi, S., Takanashi, S., Takeguchi, Y.: Niching and Elitist Model for MOGAs. Paralel Problem Solving from Nature-PPSN V, Lecture Notes in Computer Science, SpringerVerlag, Berlin Heidelberg New York 260–269. Baker, J.E.:Reducing Bias and Inefficiency in the Selection Algorithm. Proceedings of the Second International Conference on Genetic Algorithms, Morgan Kaufman Publishers (1987) 14–21 Eshelman, L.J., Schaffer, J.D.: Real Coded Genetic Algorithms and Interval Schema. Foundations of Genetic Algorithms 2, Morgan Kaufman Publishers (1993) 187–202 Lazano, F., Guerrero, F., Onieva, L., Larraneta, J.: Kohonen Maps for Solving A Class of Location-Allocation Problems. European Journal of Operations Research, Vol.108 (1998)106–117 Sefrioui, M., Periaux, J.: Genetic Algorithms, Game Theory and Hierarchical Models: Some Theoretical Backgrounds Part I: Real Coded GA,von Karman Institute for Fluid Dynamics Lecture Series 2000-07 Ernst,A.T. and Krishnamoorthy, M.: Efficient Algorithms for the UncapacitatedSingle Allocation p-hub median problem. Location Science, 139–154 Beasley, J.E., OR-Library: Distributing Test Problems by Electronic Mail, Journal of the Operational Research Vol. 41 (1990) 1069–1072
Minimal Addition-Subtraction Chains Using Genetic Algorithms Nadia Nedjah and Luiza de Macedo Mourelle Department of Systems Engineering and Computation, Faculty of Engineering, State University of Rio de Janeiro, Rio de Janeiro, Brazil {nadia, ldmm}@eng.uerj.br http://www.eng.uerj.br /~ldmm
Abstract. Addition and addition-subtraction chains consist of a sequence of E integers that allow one to efficiently compute power T , where T varies but E is constant. The shorter the addition (addition-subtraction) chain is, the more efficient the computation. Solving the optimisation problem that yields the shortest addition (addition-subtraction) is NP-hard. There exists some heuristics that attempt to obtain reduced addition (addition-subtraction) chains. We obtain minimal addition (addition-subtraction) chains using genetic algorithms.
1
Introduction
Recently, several cryptosystems based on the Abelian group defined over elliptic curves are proposed. In these cryptosystems [10], the inverse of element is easily obtained. Hence for such groups one can compute exponentiations by an interleaved sequence of multiplications and divisions. The performance of exiting cryptosystems [10] is primarily determined by the implementation efficiency of the modular multiplication/division. As the operands are usually large (i.e. 1024 bits or more), and in order to improve time requirements of the encryption/decryption operations, it is essential to attempt to minimise the number of modular multiplications/divisions performed. e A simple procedure to compute C = T mod M is based on the paper-and-pencil method. This method requires E-1 modular multiplications. It computes all powers of 2 3 E -1 E T: T T T ... T T , where denotes a multiplication. The paper-and-pencil method computes more multiplications than necessary. For 31 31 instance, to compute T , it needs 30 multiplications. However, T can be computed 2 3 5 10 11 21 31 using only 7 multiplications: T T T T T T T T . But if 31 division is allowed, T can be computed using only 5 multiplications and one 2 4 8 16 32 - 31 division: T T T T T T T , where denotes a division. The basic question is: what is the least number of multiplications and/or divisions E to compute T , given that the only operation allowed is multiplying or dividing two already computed powers of T? Answering the above question is NP-hard [8], but there are several efficient algorithms that can find near optimal ones. In the rest of the
T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 303–313, 2002. © Springer-Verlag Berlin Heidelberg 2002
304
N. Nedjah and L. de Macedo Mourelle
paper, addition chains will be considered as a particular case of an additionabstraction chain. Evolutionary algorithms are computer-based solving systems, which use evolutionary computational models as key element in their design and implementation. A variety of evolutionary algorithms have been proposed, from which genetic algorithms represent the major class [5]. They have a conceptual base of simulating the evolution of individual structures via the Darwinian natural selection process. The process depends on the performance of the individual structures as defined by its environment. Genetic algorithms are well suited to provide an efficient solution of NP-hard problems [2]. In the rest of this paper, we will present a novel method based on the additionsubtraction chain method that attempts to minimise the number of modular multiplications and divisions, necessary to compute exponentiations. It does so using genetic algorithms. This paper will be structured as follows: Section 2 presents the most cited additionsubtraction chain based methods; Section 3 gives an overview on genetic algorithms concepts; Section 4 explains how these concepts can be used to compute a minimal addition-subtraction chain. Section 5 presents some useful results.
2
The Addition-Subtraction Chain Based Methods
The addition-subtraction chain based methods for a given exponent E attempt to find a chain of numbers such that the first number of the chain is 1 and the last is the exponent E, and in which each member of the chain is the sum/difference of two previous members. A formal definition of an addition-subtraction chain is as in Definition 1: Definition 1. An addition-subtraction chain of length l for an integer n is a sequence of integers (a0, a1, a2, ..., al) such that a0 = 1, al = n and ak = ai aj, 0 i j < k l. For the addition-only chain based methods the exponent is represented by the binary representation while for the addition-subtraction based methods, the exponent is recoded, generally using the Booth algorithm [4]. The recoding of the exponents should normally reduce the number of multiplications and divisions that are necessary to compute exponentiation. Generally speaking, the recoding techniques use the fact that: 2i+j-1 + 2i+j-2 + 2i+j-1 + . . . +2i = 2i+j - 2i
to collapse a block of ones in order to obtain a minimal sparse representation. Using the notation 1 = -1 to represent signed binary numbers, the signed canonical binary representation of 2310 = 10111 = 101001 . Note that the binary representation contains four digits different from 0 while the recoded representation contains only three such digits. However, recoding methods need an efficient way to compute the inverse of elements, as it is the case with cryptosystems that are based on elliptic curves where the inverse is easily obtained. Finding a minimal addition-subtraction chain for a given number is NP-hard. Therefore, heuristics are used to attempt to approach such a chain. The most used heuristic consists of scanning the digits of E form the less significant to the most
Minimal Addition-Subtraction Chains Using Genetic Algorithms
305
significant digit and grouping them in partitions Pi. The size of the partitions can be constant or variable [4], [6], [7]. Modular exponentiation methods based on constantsize partitioning of the exponent are usually called (recoding) m-ary methods, where m is a power of two and log m is the size if a partition and modular exponentiation 2 methods based on variable-size are usually called (recoding) window methods. There exist several strategies to partition the exponent. The generic computation of such * methods is given in Algorithm 1, where E is the recoded exponent and Vi and Li denote the value and length of partition Pi respectively. Algorithm 1. (Recoding)ModularExponentiation(T, M, E) Partition E*/E using the given strategy; for each Pi in ¨(E*/E)Compute T Vi mod M; C = T Vb-1 mod M; for i = b-2 downto 0 C =
L
T2 i mod M;
if Vi 0 then C = C T Vi mod M; return C; end algorithm. *
Observe when E is used, the value Vi of a partition Pi can be null, positive or negative. When Vi is negative the multiplication of line 3 in Algorithm 1, is actually a division. The algorithm based on a given addition-subtraction chain, used to compute the E modular exponentiation C = T mod M, is specified by Algorithm 2. Algorithm 2. ChainBasedMethod(T, M, E) Let [a0=1 a1 a2 ... al=E] be an addition chain for E; power[0] = T mod M; for k = 1 to l let ak = ai aj | i m or (cij⊕wij) > n THEN change the element to 0 4. For all non-zero elements, apply . 5. IF bij > m or dij > n THEN change the element to 0. 6. IF all element =0 THEN goto 12 ELSE increase path length by 1. 7. IF there exists identical elements THEN remove all elements except one. 8. In DM, apply and look for data dependency and mutate DM. 9. IF the selected LCM value is bigger than that of the final DO loop THEN change that element to 0. 10. Apply to every non-zero element. 11. Goto 6 12. IF there exists added number THEN apply . 13. Except for the mutated sentences, perform doall Si to every sentence.
3
Extraction of Parallelism Using the Data Dependency Removal Method
In order to extract the parallelism from the loop with procedure calls, we used the data dependency removal method. The method transforms procedures used in the loop using inlining. It then applies the data dependency removal algorithm.
Algorithm 4 1. Draw procedure call multi-graph 2. Expand it into augmented call graph 3. Calculation of information between procedures 4. Dependency analysis 5. IF (one procedure is called and the caller related variables are not changed) THEN goto ELSE goto 6. Insert the data dependency removal algorithm and make parallel code
Algorithm 5 1. Initialization S = number of sentences Declaration of dynamic memory array DMA, DMB, DMC sw= 0 2. Use GCD arithmetic function call for diophantine method calculation, and derive pass=1 and S*S size 2 dimensional dependent matrix DMB. IF (The index variable is same) THEN rename 3. Set loop index variables i, j. They are N1, N2, at best. For all DMB matrix i,j IF ((aij⊕uij) > N1 || (cij⊕wij) > N2) THEN change that element to 0 IF sw == 0 THEN DMA = DMB, sw = 1 4. Using DMB matix on all nonzero element Forall i, j IF (uij & wij == 1) THEN uniform ELSE IF (uij & wij != 1) THEN nonuniform ELSE complex type to doall Si change
Interprocedural Transformations for Extracting Maximum Parallelism
421
5. IF the same element exists, remove all except one THEN 6. IF all element = 0 THEN go to 8 ELSE pass+1 to increase pass matrix production S*S , derive sized dependent matrix DMC = DMA * DMB pass++ DMB = DMC free DMC ENDIF 7. Go to 3 8. Except changed sentences, change all sentences into DOALL Si sentence.
Algorithm 6 /* Apply the inter procedure transformation first, and then apply inlining except for the inlining procdures */ { 1. repeat(until all the procedure be optimized) { /* inter procedure changed property */ } 2. repeat(until the inter procedure be optimized) { /* except inlining inter procedure transformation */ } 3. apply inlining in a reverse topological order }
Algorithm 7 /* after inlining, perform changes in procedure */ { 1. apply inlining in a reverse topological order 2. repeat(until all the procedure be optimized) { /* inter procedure changed property */ } }
4
Performance Analysis
To show the proposed method is better than the conventional ones, we evaluated data distance using two of the most widely used sample codes for the uniform and nonuniform type code. The performance analysis in Figure 1(a) for uniform code is compared with the data dependency removal method and linear transformation methods such as interchanging, tiling, skewing, unimodular, selective cycle shrinking, Hollander, and Chen&Wang. Figure 2(a) shows the performance of unimodular, selective cycle shrinking, Hollander, Chen&Wang and new method when the distance is uniform. We performed the comparison on CRAY T3E machine with the fixed values of N1=20, N2=100, and the number of processes is 4. Unimodular method and cycle shrinking method are similar in performance because they divide the blocks and process sequentially, and then calculate the value of dependent distance. Hollander method is the worst method, because it processes the white node and black node in serial form. The best performance is achieved when the data dependency removal method is
422
Y.-S. Chang et al.
processed until there is no more data dependency. The number of parallel code used is =2N1-4 in tiling, interchanging, selective shrinking. =(N2-4)/4 is used for skewing, unimodular and Chen&Wang method. It is =(N1ÝN2)/2 in Hollander method. Finally, the data dependency removal method takes place only =2+ N1/10 times. Figure 2(b) shows the performance of DCH method, IDCH method, and new method when the distance is nonuniform. For the measurment, we increased the number of iterations from 50 to 500 and the number of processes is 4. With the DCH method, = N1/2. With the IDCH method, = N1ÝN2/Tn(Tn: number of tile). For the data dependency method = 1+Min(N1,N2)/4. This shows that the proposed method is effective for both uniform and nonuniform code. DO I = 3, N1 DO J = 5, N2 A(I, J) = B(I-3, J-5) B(I, J) = A(I-2, J-4)
(a) uniform code
DO I = 1, N1 DO J = 1, N2 A(2*I+J+1, I+J+3) = = A(2*J+3, I+1)
(b) nonuniform code Fig. 1. Example Code I
(a) uniform
(b) nonuniform
Fig. 2. Performance results of Example code I
The comparison and analysis of data dependency removal method among multiple procedures is performed on the CRAY T3E system. We gradually increased the number of processors to 2, 4, 8, 16, 32 and applied data dependency distance method for uniform, nonuniform, and complex code. Using the example in Figure 3, we compared loop extraction transformation method , loop embedding transformation method, and procedure cloning transformation method. The data dependency removal method for transformation between procedure performs inline expansion to remove the data dependency until there are no more parallel data dependency. The same process is applied to Loop extraction and Loop embedding, which can reduce the overhead in procedure calls. Procedure cloning is divided into sequential process part and parallel part, and it produces the best parallelism. For the case of nonuniform and complex data dependency distance, parallelism is possible for the expanded data dependency removal method only.
Interprocedural Transformations for Extracting Maximum Parallelism
423
Therefore, we apply parallelization for data dependency method, and apply sequence for all other methods. SUBROUTINE P real a(n, n) integer i do i = 1, 10 call Q(a, i) call Q(a, i+1) enddo
SUBROUTINE P real a(n, n) integer i do i = 1, 10 call Q(a, i*3) call Q(a, i*5) enddo
SUBROUTINE P real a(n, n) integer i do i = 1, 10 call Q(a, i) call Q(a, i*5) enddo
SUBROUTIN Q(f, i) real f(n, n) integer i, j do j = 1, 100 (i, j) = f(i,j) + ... endd
SUBROUTIN Q(f, i) real f(n, n) integer i, j do j = 1, 100 f(i, j*5)=f(i, j*4) + ... enddo
SUBROUTIN Q(f, i) real f(n, n) integer i, j do j = 1, 100 f(i, j) = f(i, j) + ... enddo
(a) uniform code
(b) nonuniform code
(c) complex code
Fig. 3. Example code II
(a) uniform code
(b) nonuniform code
(c) complex code Fig. 4. Performance of Example code II
Performance analysis is done by comparing the runtime performance on different processors. The summary of the performance analysis is in Fig. 4. The expanded data dependency removal method in data dependent distance for uniform, nonuniform,
424
Y.-S. Chang et al.
complex code are all becoming better with more processors. In the situation where the distance is uniform, the procedure cloning transformation method is better than loop embedding and loop extraction method. The loop extraction method is the worst in efficiency. For data dependent distance in nonuniform and complex code, only the data dependent removal method can be parallelized. In that sense, this method is the best.
5
Conclusion
Most programs spend their execution time in the loop structure. For the reason, there are many on-going studies on transforming sequential programs into parallel programs. Most of the studies are focused on extracting the parallelism, and then transforming it into inter-procedural parallelism. However, these methods can only be applied to uniform code. This paper proposed an algorithm that is applicable for both uniform and nonuniform dependency distance code. To prove this, we used an applicable data dependent removal method for a single loop. The result shows that this method is very efficient. We also applied this method to the inter-procedure algorithm and showed that this method is efficient as well.
Reference 1. Banerjee, U.: Loop Transformations for Restructuring Compilers: The Foundations. Kluwer Academic Publishers, 1993. 2. Chen, Y-S and S-D Wang: A Parallelizing Compilation Approach to Maximizing Parallelism within Uniform Dependence Nested Loops. Dept. of Electrical Engineering, National Taiwan University, 1993. 3. Hall, M. W.: Managing Interprocedural Optimization. Ph.D thesis, Dept. of Computer Science, Rice University, 1991. 4. Hall, M. W., K. Kennedy, and K. S. Mckinley: Interprocedural Transformations for Parallel Code Generation. Technical Report 1149-s, Dept. of Computer Science, Rice University, 1991. 5. D’Hollander, E. H.: Partitioning and Labeling of Loops by Unimodular Transformations. IEEE Trans. on Parallel and Distributed Systems Vol. 3, No. 4, July, 1992. 6. Mckinley, K. S.: A Compiler Optimization Algorithm for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems. 9(8): 769–787, August, 1998. 7. Punyamurtula,S., and V. Chaudhary: Compile-Time Partitioning of Nested Loop Iteration Spaces with Non-uniform Dependences. In Journal of Parallel Algorithm and Architecture, 1996. 8. Tzen, T. H. and L. M. Ni: Dependence Uniformization : A Loop Parallelization Technique. IEEE Transactions on Parallel and Distributed Systems, May, 1993. 9. Wolfe, M. J.: High Performance compiler for Parallel Computing. Oregon Graduate Institute of Science & Technology, 1996.
On Methods’ Materialization in Object–Relational Data Warehouse Bartosz Bebel, Zbyszko Królikowski, and Robert Wrembel Poznan University of Technology, Institute of Computing Science Piotrowo 3A, 60-965 Poznan, Poland {Bartosz.Bebel, Zbyszko.Krolikowski, Robert.Wrembel} @cs.put.poznan.pl
Abstract. Object–relational data warehousing systems are more and more often applied to the integration and analysis of complex data. For such warehouses, the application of materialized object–oriented views is promising. Such views are able to: (1) integrate not only data of complex structure but also manage the behavior of data, i.e. methods, (2) transform simple relational data to data of complex structure and vice versa. This paper addresses a materialization technique of method results in object–oriented views. We have developed methods' materialization technique, so called hierarchical materialization, and evaluated this technique by a number of experiments concerning methods without input arguments as well as methods with input arguments, for various patterns of method dependencies. The obtained results show that the hierarchical materialization may significantly reduce methods' execution times. In this paper we present the results concerning methods without input arguments.
1 Introduction New, dynamically growing branches of industry impose new requirements in processing and storing large amounts of data. The information collected in an enterprise is often of different data format and complexity (e.g., relational, object– relational, object–oriented, on-line, Web pages, semi–structured, spreadsheets, flat files) and is stored in information systems that usually have different functionality. As the management of an enterprise requires a comprehensive view of most of its data, one of the important tasks of an information system is to provide an integrated access to all data sources within an enterprise. In practice, data warehousing systems are very often used as a "tool" for data integration. A data warehouse is mostly implemented as a set of materialized views. In the process of integrating and warehousing complex data, object–relational data warehouse systems are suitable [4, 6]. In this field, materialized object–oriented views (MOOV) are very promising, however, few concepts contributed to this domain so far. We have developed an approach to MOOV, called View Schema Approach (VSA). In VSA [10, 11. 12], an object-oriented view (OOV) is defined as a view schema of an arbitrary complex structure and behavior, composed of view classes. Each view class T. Yakhno (Ed.): ADVIS 2002, LNCS 2457, pp. 425–434, 2002. © Springer-Verlag Berlin Heidelberg 2002
426
B. Bebel, Z. Królikowski, and R. Wrembel
is derived from one or more classes in a database schema. A view class is derived by an OQL-like command defining its structure, behavior, and set of instances. View classes in a view schema are connected by inheritance and aggregation relationships. Several view schemas may be defined in one database and each of them is uniquely identified by its name. While materializing an OOV one should consider materialization of objects’ structure as well as objects’ methods. The materialization of a method consists in computing the result of the method once, storing it persistently in a database, and then, using the persistent value when the method is invoked, rather than computing it every time the method is invoked. When a method result is made persistent it has to be kept up to date when data used to compute this result change. When method m is materialized, it may be reasonable to materialize also the intermediate results of methods called from m. We call this technique hierarchical materialization. When an object used to materialize the result of method m is updated, then m has to be recomputed. This recomputation can use unaffected intermediate results that have already been materialized, thus reducing the time spent on recomputation. Hierarchical materialization of methods is suitable for the environments where updates and deletions of objects are less frequent than queries, e.g., data warehousing systems. This paper focuses on a framework for the materialization of method results in object–oriented views, discuss its implementation and experimental evaluation. Section 2 discusses related approaches to method materialization. Section 3 outlines our concept of hierarchical materialization of methods. Section 4 presents experimental results concerning hierarchical materialization. Finally, Section 5 summarizes the paper and points out the areas for future work.
2 Related Work Several approaches to object–oriented views have been proposed in scientific publications. Few of them support materialization and maintenance of an OOV (see [11] for an overview). None of the approaches, however, supports the materialization of methods in a materialized OOV. There are a few approaches that address method precomputation, i.e. materialization, in the context of indexing techniques and query optimization [7, 1, 8, 9]. The work of [7] sets up the analytical framework for estimating costs of caching complex objects. Two data representations are considered, i.e., procedural representation and object identity based representation. In the approach of [1], the results of materialized methods are stored in an index structure based on B–tree, called method–index. A method–index on a method M stores in its key values the results of the invocation of M on the instances of the indexed class. Before executing M, the system searches the method–index for M. If the appropriate entry is found the already precomputed value is used. Otherwise, M is executed for an object. The application of method materialization proposed in [1] is limited to methods that: (1) do not have input arguments, (2) use only atomic type
On Methods’ Materialization in Object–Relational Data Warehouse
427
attributes to compute their values, and (3) do not modify values of objects. Otherwise, a method is left non–materialized. The concept of [8, 9] proposes data structures supporting materialization of methods. It uses the so called Reverse Reference Relation, which stores the information about an object used to materialize method m, a name of a materialized method, and a set of objects passed as arguments of m. Furthermore, the approach maintains also the information about the attributes, called relevant attributes, whose values were used to materialize method m. Method m has to be recomputed only when the value of a relevant attribute of an object used in m was updated. Each modification of an object used for computing the value of method M results in the rematerialization of the materialized method's value. The approach, however, does not consider the dependencies among methods. Another approach to method precomputation concerns so called inverse methods [5]. An inverse method is used for transforming a value or object from one representation (type) to the other. When this method is used in a query, it is computed once, instead of computing appropriate method for each object returned by the query. A results of an inverse method id not persistent. The result is stored in a memory and is accessible only for the current query and when the query ends the result is removed from the memory.
3 Hierarchical Materialization of Methods 3.1 Hierarchical Materialization at a Glance When hierarchical materialization is applied to method mi, then the result of mi is stored persistently and, additionally, the intermediate results of methods called from mi are also stored persistently. After the materialization of mi, the result of the first invocation of method mi for object oi is stored persistently. Each subsequent invocation of mi for the same object oi uses the already materialized value. When an object oi, used to materialize the result of method mi, is updated or deleted, then mi has to be recomputed. This recomputation can use unaffected intermediate materialized results, thus reducing the recomputation time overhead. Methods may have various numbers of input arguments, that can be of various types. Generally, methods that have input arguments are not good candidates for the materialization. However, in our approach, a method with input arguments can be materialized and maintained within acceptable time provided that: (1) the method has few input arguments and (2) each of the arguments has a narrow, discrete domain. Moreover, hierarchical materialization may be useful only (1) for those methods that call other methods and the computation of those called methods is costly and (2) for the environments where updates and deletions of objects are less frequent than queries, e.g., data warehouse. The following example will informally illustrate the hierarchical materialization.
428
B. Bebel, Z. Królikowski, and R. Wrembel
Example 1. Let us consider view classes V_Computer – composed of V_CDDrive, V_Disk, and V_MainBoard. V_MainBoard in turn is composed of V_RAM and V_CPU, whose instances are shown in Figure 1. Each view class in this view schema has a method, called power_cons, that returns the consumption of electricity power by a computer's component. Power consumed by each instance of V_Main_Board is the sum of power consumed by V_RAM, V_CPU, and V_MainBoard itself. Similarly, power consumed by each instance of V_Computer is the sum of power consumed by V_CDDrive, V_MainBoard, and V_Disk. Let us further assume that the instance of V_Computer, namely the object identified by vcom1 is composed of objects vcd1 (the instance of view class V_CDDrive), vd20 (the instance of view class V_Disk), and vmb100 (the instance of V_MainBoard), which in turn is composed of: vram200, vram201, vram202, and vcpu10. The result of V_Computer::power_cons is materialized for the instance of V_Computer only when this method is invoked for this instance. Furthermore, all the methods called from V_Computer::power_cons are also materialized when they are executed. Let us assume that the power_cons method was invoked for vcom1. In our example, the hierarchical materialization mechanism results in materializing values of the following methods: V_RAM::power_cons for objects identified by vram200, vram201, and vram202, V_CPU::power_cons for object vcpu10, V_MainBoard::power_cons for object vmb100, V_Disk::power_cons for object vd20, V_CDDrive::power_cons for object vcd1, and finally V_Computer::power_cons for object vcom1.
1: power_cons( )
vcd1 : V_CDDrive
3: power_cons( )
4: power_cons( )
vram200 : V_RAM
vcom1 : V_Computer
vram202 vram201
vmb100 : V_MainBoard
2: power_cons( )
vd20 : V_Disk 5: power_cons(Integer)
vcpu10 : V_CPU
Fig. 1. An example of materialized view schema instances
Having materialized the methods discussed above, let us consider that the component object vd20 has been replaced with another disk instance, say vd301, with greater power consumption. Thus, the result of V_Computer::power_cons materialized for vcom1 is no longer valid and it has to be recomputed. However, during the recomputation of vcom1.power_cons the unaffected materialized results of methods can be reused, i.e., vcd1.power_cons, vmb100.power_cons have not been changed and they can be used to compute a new value of vcom1.power_cons.
On Methods’ Materialization in Object–Relational Data Warehouse
3.2
429
Data Structures
In order to materialize methods in a view class and maintain the materialized results, three additional data structures have been proposed. These structures are as follows: View Methods, Materialized Method Results Structure, and Graph of Method Calls (cf. [2, 11]). View Methods (VM for short) makes available the data dictionary information about all methods and their signatures implemented in view classes. The chain of method dependencies, where one method calls another, is called Graph of Method Calls (GMC for short). GMC is used by the procedure that maintains the materialized results of methods. When materialized method mj becomes invalid all the materialized methods that use the value of mj also become invalid. In order to invalidate those methods the content of GMC is used. As the same method can be invoked for different instances of a given view class and the same method can be invoked with different values of input arguments, the system has to maintain the mappings between: (1) the materialized value of a method, (2) an object for which it was invoked, and (3) values of input arguments. The mappings are represented in the structure, called Materialized Method Results Structure (MMRS for short). MMRS is used by the procedure that maintains the materialized results of methods. When method mi is invoked for a given object oi and this method has been previously set as materialized, then MMRS is searched in order to get the result of mi invoked for oi. If it is not found then, mi is computed for oi and stored in MMRS. Otherwise, the materialized result of mi is read instead of executing mi. When an object used to compute the materialized value of mi is updated or deleted, then the materialized value becomes invalid. In such a case, appropriate record is removed from MMRS. When the materialized value of mi becomes obsolete it is removed from MMRS. The removal of the result of method mj causes that the results of methods that called mj also become invalid and have to be removed from MMRS. The removal of materialized results from MMRS is recursively executed up to the root of GMC. In order to ease the traversal of aggregation hierarchy in an inverse direction our prototype system maintains for each object so called inverse references. An inverse reference for object oj is the reference from oj to other objects that reference voj. For example, the inverse reference for object vcpu10 (cf. Figure 1) contains one object identifier vmb100 that points to the instance of the V_MainBoard view class.
4 Experimental Evaluation The proposed hierarchical materialization technique has been implemented within so called View Schema Approach Prototype (VSAP). The prototype has been implemented partially as the application written in C/C++ and partially as packages, functions, and procedures stored in the Oracle9i DBMS, using its object–oriented features. All dictionary tables, object tables, and data have been stored in this
430
B. Bebel, Z. Królikowski, and R. Wrembel
database. The experiment evaluating hierarchical materialization has been performed in Oracle9i (rel. 9.0.1.2.0) database management system (Oracle9i was running under the control of Linux, on a PC with Intel Celeron 333MHz, with 512MB of RAM. The size of a database buffer equaled 16MB). The experiments were performed for different shapes of Graph of Method Calls, as shown in Figure 2. For example, in the first case (Figure 2a) method m1 called m11 and m12. The result of m1 was computed as the sum of results returned by m11 and m12. Similarly, m11 called m111 and m112 by summing up their results. m111, in turn, called m1111 and m1112 by summing up their results. The same computation pattern was used for the rest of methods in this GMC. The graph represents also the aggregation hierarchy of objects. A complex object at the root of the hierarchy references two other objects at the lower level. Each of these lower–level objects referenced further two other objects. In a consequence, each root complex object is composed of 126 component objects. The size of one root complex object, including its components, equaled to 15kB. The experiments were performed for various numbers of root complex objects (from 100 to 5000). We have also evaluated the performance of materialized methods with input arguments (cf. [3]), however, due to space limitation we present only the results for methods without input arguments and only for 2000 root complex objects, which is representative for other numbers of root complex objects.
Fig. 2. Different shapes of Graph of Method Calls used in the experiment
On Methods’ Materialization in Object–Relational Data Warehouse
4.1
431
Storage Space Overhead
Additional data structures supporting the hierarchical materialization need storage space. The space overhead for storing one record in MMRS is computed by the following formula: 64B + nb_arg * domain * 2B, where nb_arg is the number of input arguments of a materialized method; domain is the number of different values an argument can have. This evaluation shows that the size of MMRS grows with the number of input arguments and with the size of their domains. If the domain of an input argument is large, then the large size of MMRS deteriorates the system performance. The space overhead for storing one record in GMC is constant and equals to 120B and it is independent on the number of complex objects being processed. The below table summarizes the space overhead for storing the content of MMRS and GMC for the shapes of GMC shown in Figure 2. Only methods without input arguments are taken into account. Table 1. Space overhead for storing the content of MMRS and GMC GMC a) b) c)
4.2
MMRS [kB] 15 875 45 500 3 875
GMC [kB] 14.8 42.5 3.5
Method Performances
Fig. 3 shows the total time overhead for: (1) the execution of method m1 without materialization (Exe), (2) the execution of method m1 with the materialization of its result (E+M), (3) reading the materialized result of method m1 (RM), (4) the invalidation of method m1 (Inv), and (5) the rematerialization of a previously invalidated method m1 (Rem). These five kinds of time overhead were measured for m1 without input arguments, for 2000 root complex objects, for the shape a), b) and c) of GMC. Average times, computed per one root complex object, are presented in Fig. 4. In these experiments, the invalidation of m1 was caused by updating an object used to compute the result of a leaf–method (e.g. m1111111 in Figure 2 a), thus one branch of GMC was invalidated from the very bottom method to the very top method m1. In order to measure the usefulness of hierarchical materialization we computed the following time coefficient: tc = (Inv + Rem) / Exe. Taking into account the following times shown in Fig. 4 for shape a): (Exe) = 1.2 sec for method m1; (Inv) = 0.17 sec for method m1; (Rem) = 0.15 sec for method m1;
432
B. Bebel, Z. Królikowski, and R. Wrembel
@ F H V > H P L W O D W R W
PD
PE
PF
([H
(0
50
,QY
5HP
Fig. 3. Total times of processing method m1, for 2000 root complex objects (one branch of GMC invalidated); three "shapes" of GMC were taken into account
the value of tc equals approximately 0.27, meaning that thanks to the hierarchical materialization technique, method m1 (the root of Graph of Method Calls) was executed approximately 3.7 times faster than without materialization. In our tests it is the m1 method whose execution time is reduced 3.7 times. For shape b) of GMC the value of tc equals 0.11, which means that the execution time of m1 was reduced 9.1 times. For shape c) of GMC the value of tc equals 0.77, which means that the execution time of m1 was reduced 1.3 times. The tc coefficient depends (1) on the shape of GMC and (2) on the number of objects being updated, which in a consequence, impacts the number of methods whose results have to be invalidated and rematerialized. We have performed other experiments measuring the impact of invalidating more than one branch of GMC. Because of space constraint, however, we are unable to present them in the details. A brief summary of those experiments is as follows. Two branches of GMC were invalidated, i.e. invalidated leaf–methods m1111111 and m1222111 for shape a and b as well as m111 and m131 for shape c. The results are as follows: • for shape a the tc coefficient equals 0.5, which means that time execution of m1 was decreased twice; • for shape b the tc coefficient equals 0.21, which means that time execution of m1 was decreased 4.8 times; • for shape c the tc coefficient equals 1.4, which means that time execution of m1 was increased 0.7 times.
5 Conclusions and Future Work The support for view materialization and maintenance is required when applying object–oriented views in the process of warehousing data of complex structure and
On Methods’ Materialization in Object–Relational Data Warehouse
433
@ F H V > H P L W H J D U H Y D
PD
PE
PF
([H
(0
50
,QY
5HP
Fig. 4. Average times of processing method m1, for 2000 root complex objects (one branch of GMC invalidated); three "shapes" of GMC were taken into account
behavior. In this paper we presented a framework for object–oriented view materialization and maintenance with respect to methods. To the best of our knowledge, it is the first approach to method materialization applied to object– oriented views. Moreover, we have proposed a novel method materialization technique, called hierarchical materialization. As the experiments showed, this technique allows to reduce the maintenance cost of materialized methods, as unaffected intermediate results do not have to be recomputed and can thus be reused during the recomputation of another affected method. Hierarchical materialization of methods has the following limitations. 1. It is suitable for the environments where updates and deletions of objects are less frequent than queries, e.g., data warehousing systems. 2. It is suitable for materializing only those methods that do not have input arguments at all, or have a few input arguments each of which can have few different values (cf. [3]). 3. This technique gives better performance for methods whose on–line computation is costly. Even though the methods used in the experiments performed simple arithmetical operations, the hierarchical materialization technique gave better system performance for GMC a) and b). Higher increase in the system performance will be achieved provided that we materialize methods whose computation is more costly than those used in the experiments. From the results that we obtained it is clearly remarkable that the hierarchical materialization gives good results for Graphs of Method Calls that have a few levels and where one method calls a few other methods (cf. shape a and b). Shape c) gave a better performance only when one branch was invalidated, for two branches, however, the execution time deteriorated. The current implementation of hierarchical materialization has a few following drawbacks: (1) the decision whether a method should be materialized or not is explicitly made by a view designer during the system tuning activity, (2) the graph of
434
B. Bebel, Z. Królikowski, and R. Wrembel
method calls is not extracted automatically from method bodies, (3) the GMC has to reflect the aggregation relationships. In other words, if method mi (in view class Vi) calls mj (in view class Vj) then aggregation relationship must exist between Vi and Vj.
References 1.
2.
3. 4.
5. 6. 7. 8. 9.
10.
11.
12.
Bertino E.: Method precomputation in object–oriented databases. Proceedings of ACM– SIGOIS and IEEE–TC–OA International Conference on Organizational Computing Systems, 1991 Bebel B., Wrembel R.: Hierarchical Materialisation of Methods in Object-Oriented Views: Design, Maintenance, and Experimental Evaluation. Proc. of the DOLAP’01 Conference, Atlanta, USA, November, 2001 Bebel B., Wrembel R.: Method Materialization Using the Hierarchical Technique: Experimental Evaluation (in preparation) Eder J., Frank H., Morzy T., Wrembel R., Zakrzewicz M.: Designing an ObjectRelational Database System: Project ORDAWA. Proc. of challenges of ADBIS-DASFAA 2000, Prague, Czech Republic, September, 2000, pp. 223–227 Eder J., Frank H., Liebhart W.: Optimization of Object–Oriented Queries by Inverse Methods. Proc. of East/West Database Workshop, Austria, 1994 Huynh T.N., Mangisengi O., Tjoa A.M.: Metadata for Object–Relational Data Warehouse. Proc. of DMDW'2000, Sweden, 2000 Jhingran A.: Precomputation in a Complex Object Environment. Proc of IEEE Data Engineering Japan, 1991, pp. 652–659 Kemper A., Kilger C., Moerkotte G.: Function Materialization in Object Bases. Proc. of SIGMOD, 1991, pp. 258–267 Kemper A., Kilger C., Moerkotte G.: Function Materialization in Object Bases: Design, Realization, and Evaluation. IEEE Transactions on Knowledge and Data Engineering, Vol. 6, No. 4, 1994 Wrembel R.: On Materialising Object-Oriented Views. In Barzdins J., Caplinskas A. (eds.): Databases and Information Systems. Kluwer Academic Publishers, 2001, ISBN 07923-6823-1, pp. 15–28 Wrembel R.: The Construction and Maintenance of Materialised Object–Oriented Views in Data Warehousing Systems. PhD thesis, Poznan University of Technology, Institute of Computing Sicence, Poznan, Poland, March, 2001 Wrembel R.: On Defining and Building Object–Oriented Views for Advanced Applications. Proc. of the Workshop on Object–Oriented Business Solutions, Budapest, Hungary, June, 2001
Author Index
Antyoufeev, Sergey 176 Arag˜ ao, Marcelo A.T. 244 Arslan, Ahmet 255, 264 Atanassov, Krassimir 21 Baser, G¨ ung¨ or 223 Bebel, Bartosz 425 Bitirim, Yıltan 93 B¨ ohlen, Michael 65 Bulgak, Ayse 386 Bulgun, Ender Yazgan
223, 366
C ¸ akir, Sen 366 C ¸ ebi, Yal¸cın 205 Chang, Yu-Sug 415 Chang, Yue-Shan 273 Cheng, Ming-Chun 273 Chiao, Hsin-Ta 273 Cho, Sehyeong 154 Choi, Jungwan 336 Chountas, Panagiotis 21 Codognet, Philippe 242 Dalkılı¸c, G¨ okhan 144, 205 Dalkılı¸c, Mehmet Emin 144 Demir, Deniz 395 Dikenelli, O˘ guz 283 Dom´ınguez, Eladio 54 Dunlap, Joanna C. 375 Ekin, Emine 324 El-Darzi, Elia 21 Eminov, Diliaver 386 Erdogan, Nadia 327 Erdur, Rıza Cenk 283 Ermi¸s, Murat 293 Etaner-Uyar, A. Sima 314
G¨ ud¨ ukbay, U˘ gur 186 G¨ ule¸sir, G¨ urcan 186 G¨ ulse¸cen, Sevin¸c 385 ˙ G¨ ultekin, Irfan 255 Guo, Baofeng 123 Hacıo˘ glu, Abdurrahman 293 Han, Ki-Jun 356 Han, Seung-Soo 154 Harmanci, A. Emre 314 Hwang, Sung-Ho 356 Jensen, Ole Guttorm Jeˇzek, Karel 83 Jiang, Jianmin 123
65
Kasap, Mustafa 223 Kavukcu, Ergun 395 Kemikli, Erdal 327 Khan, Sharifullah 31 Kilic, Alper 264 Kim, Jung Sup 213 Kim, Ki Chang 213 Kim, Soo Duk 213 Kim, Yoo-Sung 213 Kodogiannis, Vassilis 21 Kr´ olikowski, Zbyszko 425 Kut, Alp 223, 366 Latiful Hoque, Abu Sayed Md. Lee, Dong Ho 213 Lee, Hye-Jung 415 Lee, Im-Yeong 415 Li, Pengjie 123 Lloret, Jorge 54
Fang, Xiaoshan 166 Fedorov, Konstantin 176 Fernandes, Alvaro A.A. 244
Macedo Mourelle, Luiza de Marchuk, Alexander 176 McGregor, Douglas 11 Mott, Peter L. 31
Gilbert, Austin 232 Gokcen, Ibrahim 104 Gordon, Minor 232 Grabinger, Scott 375
Nam, S.W. 336 Nedjah, Nadia 303, 405 Nemov, Andrey 176 Nyg˚ ard, Mads 43
11
303, 405
436
Author Index ¨ Ulengin, F¨ usun 293 ¨ ur 186 Ulusoy, Ozg¨ Uemura, Shunsuke 114
Paprzycki, Marcin 232 Park, Doo-Soon 415 Park, Jaehyun 213 Peng, Jing 104 Petrounias, Ilias 21 Rahayu, Johanna Wenny
1
Sadat, Fatiha 114 S ¸ aykol, Ediz 186 Sever, Hayri 93, 133 Sevilmis, Can 346 Sheng, Huanye 166 Taniar, David 1 Tolun, Mehmet R. 133 Tonta, Ya¸sar 93 Topcuoglu, Haluk 346, 395 Tøssebro, Erlend 43 Tunalı, Turhan 195
Wilson, John 11 Won, Youjip 336 Wrembel, Robert 425 Yakhno, Tatyana 324 Yıldırım, S ¸ ule 195 Yoshikawa, Masatoshi 114 Yuan, Shyan-Ming 273 Zapata, Mar´ıa Antonia Z´ıma, Martin 83
54